Using SEER public-use datasets
SEER is the U.S. National Cancer Institute's Surveillance, Epidemiology and End Results program. It is an amazing resource for information about the cancers that occur in the U.S. One of the produces of SEER is the Public Use datasets, which contain de-identified records on over 3.5 million cancers that have occurred between 1973 and 2005.
When you have 3.5 million cancer cases to study, you can draw certain types of inferences that could not possibly be made with the data accumulated at a single medical institution.
The SEER public-use data files are available as a DVD or they can be
downloaded from the web.
Information for obtaining these files is available at:
http://seer.cancer.gov/data/
To get these files, you need to fax a signed agreement to the National
Library of Medicine and wait for a response (it just took a few hours
when I tried a week or two ago).
For this exercise, I downloaded the following file:
SEER_1973_2005_CD2.d08042008.zip (199,212,317 bytes)
Expanded, it produced several directories of files. The raw data
records are contained in the following files.
08/04/2008 11:09 AM 143,352,097 BREAST.TXT
08/04/2008 11:09 AM 109,510,121 COLRECT.TXT
08/04/2008 11:09 AM 67,313,323 DIGOTHR.TXT
08/04/2008 11:09 AM 93,497,187 FEMGEN.TXT
08/04/2008 11:09 AM 70,809,823 LYMYLEUK.TXT
08/04/2008 11:09 AM 119,312,494 MALEGEN.TXT
08/04/2008 11:09 AM 127,835,148 OTHER.TXT
08/04/2008 11:09 AM 129,526,418 RESPIR.TXT
08/04/2008 11:09 AM 59,133,585 URINARY.TXT
9 File(s) 920,290,196 bytes
The SEER files comprise over 920 megabytes of text.
Each record looks something like this (note that the following is a
fake and truncated record, because I didn't want to
list an actual record from SEER on my web page. JB):
246000990000001521205001078191409902051986C64908130381303211
An actual record might go on for 258 characters.
Suppose we wanted to parse through every record of every file
in the seer data set.
A few lines of Perl will suffice:
#/usr/local/bin/perl
opendir(SEERDIR, "c\:\\seer") || die ("Unable to open directory");
@files = readdir(SEERDIR);
$totalcount;
closedir(SEERDIR);
chdir("c\:\\seer");
foreach $datafile (@files)
{
open (TEXT, $datafile);
$line = " ";
while ($line ne "")
{
$line = <TEXT>;
$totalcount++;
}
close TEXT;
}
print "\n$totalcount";
exit
The output of the script is the number "3553255" (total number of records).
It will appear on your
computer monitor, after about 11 seconds. That's all the time
it takes (on a 2.8 GHz CPU) to parse 920 megabytes of text.
Each SEER record is a cancer case, described by a series of 258 (mostly)
numbers, in byte-assigned positions, described by a data dictionary
document (available at the SEER web site). Here are the first 14 items
in the dictionary.
List of the data dictionary items accounting
for the first 46 bytes of a SEER record
Patient ID number 01-08
Registry ID 09-18
Marital Status at DX 19-19
Race/Ethnicity 20-21
Spanish/Hispanic Origin 22-22
NHIA Derived Hispanic Origin 23-23
Sex 24-24
Age at diagnosis 25-27
Year of Birth 28-31
Birth Place 32-34
Sequence Number--Central 35-36
Month of diagnosis 37-38
Year of diagnosis 39-42
Primary Site 43-46
These first 14 items are the only demographic items I have used
for any of my SEER projects.
When you know the byte locations for the data dictionary entries,
you can easily write a short script (I like to use Perl, Ruby, or Python)
that can extract and compile data any way you wish.
Here's a short Perl script that parses through the first two records
of each SEER public use file, extracting the age at diagnosis and the
year of diagnosis from the first two records of each SEER file, and
printing out the result to the screen.
#/usr/local/bin/perl
opendir(SEERDIR, "c\:\\seer") || die ("Unable to open directory");
@files = readdir(SEERDIR);
closedir(SEERDIR);
chdir("c\:\\seer");
foreach $datafile (@files)
{
next if ($datafile !~ /\.txt/i);
open (TEXT, $datafile);
for($i=0;$i<2;$i++)
{
$line = <TEXT>;
$age_at_dx = substr($line,24,3);
$year_of_dx = substr($line,38,4);
print "$datafile Age $age_at_dx Year $year_of_dx\n";
}
close TEXT;
}
exit
The following output consists of a smattering of information from the first
two records from each of the SEER files.
BREAST.TXT Age 057 Year 1988
BREAST.TXT Age 094 Year 1979
COLRECT.TXT Age 059 Year 1973
COLRECT.TXT Age 069 Year 1989
DIGOTHR.TXT Age 062 Year 1981
DIGOTHR.TXT Age 062 Year 1978
FEMGEN.TXT Age 080 Year 2000
FEMGEN.TXT Age 067 Year 1983
LYMYLEUK.TXT Age 071 Year 1990
LYMYLEUK.TXT Age 078 Year 1990
MALEGEN.TXT Age 088 Year 1984
MALEGEN.TXT Age 082 Year 1979
OTHER.TXT Age 064 Year 1979
OTHER.TXT Age 067 Year 1983
RESPIR.TXT Age 074 Year 1977
RESPIR.TXT Age 091 Year 2004
URINARY.TXT Age 071 Year 1986
URINARY.TXT Age 074 Year 1973
The most important line from the Perl script is:
$age_at_dx = substr($line,24,3);
This pulls the string consisting of bytes 25,26, and 27 from the
record. The data dictionary tells us that bytes 25-27 comprise
the record's age at diagnosis.
Two of the most important data items in the record will require a little
extra work, if, once extracted, you want to understand their meaning.
These are the morphology codes and the anatomic site codes.
The morphology code occupies bytes 53-57 (for ICDO-3) and bytes
48-52 (for ICDO-2), for each record.
Examples of morphology codes are:
M96783 primary effusion lymphoma
M83703 adrenal cortical carcinoma
You can get a copy of the ICDO-3 codes from:
http://seer.cancer.gov/icd-o-3/sitetype.icdo3.d08152007.pdf
You will need to cut and paste the pdf file into a plain ascii text
file, free of formatting characters, to use it in your scripts.
I put the list of codes and equivalent terms into a text file that
I named "ICDO-3".
I use the following short subroutine to put the codes, and their term equivalents,
into a hash. When I need to convert a code into its term, I just call
the hash value.
open (ICD, "c\:\\ftp\\icd03\.txt");
$line = " ";
while ($line ne "")
{
$line = <ICD>;
if ($line =~ /([0-9]{4})\/([0-9]{1}) +/o)
{
$code = $1 . $2;
$term = $';
$term =~ s/ *\n//o;
$term = lc($term);
$dictionary{$code} = $term;
}
}
close ICD;
This snippet of code will mean something to you when you
look at the data format in SEER's ICDO-3 file.
The same can be done with an ICDO-2 file, and with the anatomic
site codes (the primary site item, bytes 43-46).
Anatomic site codes and terms are available at:
http://www.ncri.ie/data.cgi/html/icdo2sites.shtml
An example of a site code and its term equivalent is:
C649 Kidney NOS*
*NOS means not otherwise specified
Everything we want to do with the SEER files involves parsing
through the files, line (record) by line; and then pulling out
the data we're interested in.
For example, the following list is a compilation of all of
the diagnoses (that occur at least 10 times in the data set),
sorted by the age of the person at the time of diagnosis,
accompanied by the number of cases and the cumulative fraction
of cases accounted for, and by the diagnosis (truncated for
reasons of space). Neoplasms that occurred with fewer than 10 cases
were omitted from the list.
The list was produced by a short Perl script composed with various
combinations and variations of the Perl code appearing on this page.
Number Cumu- Name of
Age of lat- of
cases ive Neoplasm
-----------------------------------------------------
000 0000041 0.000 retinoblastoma, differentiated type
001 0000641 0.000 retinoblastoma nos
002 0000050 0.000 infantile fibrosarcoma
002 0000016 0.000 juvenile myelomonocytic leukemia
002 0000021 0.000 atypical teratoid/rhabdoid tumor
002 0000043 0.000 retinoblastoma, undifferentiated type
004 0000303 0.000 hepatoblastoma
004 0001762 0.000 neuroblastoma nos
004 0000012 0.000 medulloepithelioma nos
005 0001466 0.001 nephroblastoma nos
007 0000308 0.001 ganglioneuroblastoma
011 0000013 0.001 subependymal giant cell astrocytoma
012 0000010 0.001 pancreatoblastoma
013 0001163 0.001 medulloblastoma nos
013 0000018 0.001 neurofibromatosis nos
014 0000756 0.001 embryonal rhabdomyosarcoma
014 0000084 0.001 langerhans cell histiocytosis, disseminated
016 0000060 0.001 choroid plexus papilloma, malignant
017 0001545 0.002 pilocytic astrocytoma (c71._) 9421/1
018 0000031 0.002 embryonal sarcoma
018 0000337 0.002 alveolar rhabdomyosarcoma
019 0001088 0.002 ewing's sarcoma
019 0000100 0.002 desmoplastic medulloblastoma
019 0000507 0.002 primitive neuroectodermal tumor
020 0000584 0.003 endodermal sinus tumor
020 0000055 0.003 clear cell sarcoma of kidney
021 0000069 0.003 ganglioglioma
021 0000827 0.003 precursor b-cell lymphoblastic leukemia
022 0000139 0.003 ependymoma, anaplastic type
022 0000025 0.003 dysembryoplastic neuroepithelial tumor
022 0000049 0.003 precursor b-cell lymphoblastic lymphoma
023 0001242 0.003 teratoma, malignant nos
023 0000021 0.003 choroid plexus papilloma nos
023 0009222 0.006 precursor cell lymphoblastic leukemia, nos
024 0000098 0.006 pineoblastoma
024 0000090 0.006 malignant rhabdoid tumor
024 0000117 0.006 precursor t-cell lymphoblastic leukemia
025 0000011 0.006 periosteal osteosarcoma
025 0000023 0.006 mixed type rhabdomyosarcoma
025 0000093 0.006 pleomorphic xanthoastrocytoma
026 0000103 0.006 alveolar soft part sarcoma
026 0000019 0.006 chondroblastoma, malignant
026 0000053 0.006 telangiectatic osteosarcoma
027 0000011 0.006 spindle cell rhabdomyosarcoma
027 0000039 0.006 desmoplastic small round cell tumor
028 0000690 0.006 germinoma
028 0001717 0.007 teratocarcinoma
028 0000118 0.007 parosteal osteosarcoma
028 0000227 0.007 chondroblastic osteosarcoma
028 0000095 0.007 germ cell tumor, nonseminomatous
028 0000556 0.007 choriocarcinoma combined with teratoma
029 0000023 0.007 ganglioglioma, anaplastic
029 0000019 0.007 intratubular malignant germ cells
029 0000021 0.007 malignant placental site trophoblastic tumor
030 0002388 0.008 mixed germ cell tumor
030 0003184 0.009 embryonal carcinoma nos
030 0000115 0.009 precursor t-cell lymphoblastic lymphoma
031 0000966 0.009 dysgerminoma
031 0000832 0.009 choriocarcinoma
031 0000021 0.009 centrol neurocytoma
032 0000016 0.009 lipoma nos
032 0000082 0.009 neuroepithelioma nos
032 0000027 0.009 adamantinomatous craniopharyngioma
032 0000019 0.009 malignant teratoma, undifferentiated type
033 0001848 0.010 osteosarcoma nos
033 0000114 0.010 fibroblastic osteosarcoma
033 0000030 0.010 adamantinoma of long bones
033 0000156 0.010 synovial sarcoma, biphasic type
033 0011933 0.013 hodgkin lymphoma, nodular sclerosis, nos
034 0001388 0.014 ependymoma nos
034 0000149 0.014 peripheral neuroectodermal tumor
035 0000051 0.014 prolactinoma
035 0000094 0.014 protoplasmic astrocytoma
035 0000963 0.014 precursor cell lymphoblastic lymphoma, nos
036 0000435 0.014 hodgkin lymphoma, nod. scler., grade 2
037 0009849 0.017 seminoma nos
037 0000087 0.017 small cell sarcoma
037 0000010 0.017 acidophil carcinoma
037 0000555 0.017 synovial sarcoma nos
037 0001609 0.017 burkitt lymphoma, nos
037 0000045 0.017 mediastinal large b-cell lymphoma
037 0000178 0.017 synovial sarcoma, spindle cell type
037 0000118 0.018 giant cell tumor of bone, malignant
037 0000388 0.018 sq. cell carcinoma, lg. cell, non-ker., in situ
038 0000219 0.018 astroblastoma
038 0000053 0.018 craniopharyngioma
038 0057571 0.034 carcinoma in situ nos
038 0000016 0.034 androblastoma, malignant
038 0000627 0.034 seminoma, anaplastic type
038 0000106 0.034 mesenchymal chondrosarcoma
038 0063370 0.052 squamous cell carcinoma in situ nos
038 0000416 0.052 hodgkin lymphoma, nod. scler., grade 1
038 0000548 0.052 hodgkin lymphoma, nod. scler., cellular phase
039 0000019 0.052 neurofibroma nos
039 0000026 0.052 sertoli cell carcinoma
039 0000096 0.052 hepatocellular carcinoma, fibrolamellar
039 0000021 0.052 hepatosplenic gamma-delta cell lymphoma
039 0000481 0.052 hodgkin lymph., nodular lymphocyte predom.
039 0000024 0.052 mpnst with rhabdomyoblastic differentiation
040 0001012 0.053 mixed glioma
040 0000038 0.053 cavernous hemangioma
040 0000070 0.053 myxopapillary ependymoma
040 0000040 0.053 pigmented dermatofibrosarcoma protuberans
040 0000123 0.053 clear cell sarcoma of tendons and aponeuroses
040 0000040 0.053 sertoli-leydig cell tumor, poorly differentiated
041 0014107 0.057 kaposi's sarcoma
041 0002297 0.057 oligodendroglioma nos
041 0000014 0.057 spongioblastoma polare
041 0000010 0.057 papillary carcinoma, oxyphilic cell
041 0024527 0.064 squamous intraepithelial neoplasia, grade iii
042 0000024 0.064 pulmonary blastoma
042 0000118 0.064 burkitt cell leukemia
042 0000764 0.065 fibrillary astrocytoma
042 0003634 0.066 dermatofibrosarcoma nos
042 0000282 0.066 epithelioid cell sarcoma
042 0000861 0.066 hodgkin's disease, lymphocytic predominance
043 0000036 0.066 hemangioma nos
043 0000505 0.066 rhabdomyosarcoma nos
043 0000019 0.066 solid pseudopapillary carcinoma
044 0000015 0.066 papillary meningioma
044 0000013 0.066 papillary meningioma 9538/3
044 0000017 0.066 juxtacortical chondrosarcoma
044 0000115 0.066 adenocarcinoma, endocervical type
044 0003873 0.067 squamous cell carcinoma, microinvasive
044 0000079 0.067 papillary cystadenoma, borderline malignancy (c56.9)
045 0000086 0.067 hemangioblastoma
045 0000037 0.067 oligodendroblastoma
045 0000016 0.067 struma ovarii, malignant
045 0000029 0.067 undifferentiated sarcoma
045 0000010 0.067 clear cell chondrosarcoma
045 0000011 0.067 carotid body tumor, malignant
045 0000091 0.067 papillary carcinoma, encapsulated
045 0000035 0.067 extra-adrenal paraganglioma, malignant
045 0000482 0.067 sq. cell carcinoma, keratinizing, nos, in situ
046 0000285 0.068 burkitt's tumor
046 0003337 0.069 glioma, malignant
046 0000024 0.069 periosteal fibrosarcoma
046 0004414 0.070 hodgkin's disease, mixed cellularity
046 0010260 0.073 papillary and follicular adenocarcinoma
046 0000215 0.073 follicular carcinoma, minimally invasive
047 0008335 0.075 astrocytoma nos
047 0000690 0.075 neurofibrosarcoma
047 0000047 0.075 leydig cell tumor, malignant
047 0001140 0.076 acute promyelocytic leukemia
047 0000163 0.076 malignant melanoma in giant pigmented nevus
047 0000517 0.076 follicular adenocarcinoma, well differentiated type
047 0000111 0.076 papillary mucinous cystadenoma, borderline malignancy (c56.9)
047 0000084 0.076 squamous cell carcinoma in situ with questionable stromal invasion
048 0003078 0.077 hodgkin's disease nos
048 0000257 0.077 megakaryocytic leukemia
048 0000039 0.077 ovarian stromal tumor, mal.
048 0002320 0.077 astrocytoma, anaplastic type
048 0000500 0.078 oligodendroglioma, anaplastic type
048 0000194 0.078 endometrial stromal sarcoma, low grade
048 0000104 0.078 epithelioid hemangioendothelioma, malignant
048 0000057 0.078 ac. myelomonocytic leuk. w abn. mar. eosinophils
048 0001603 0.078 serous papillary cystic tumor of borderline malignancy (c56.9)
049 0000033 0.078 subependymal glioma
049 0000179 0.078 mesenchymoma, malignant
049 0000461 0.078 gemistocytic astrocytoma
049 0000012 0.078 primary effusion lymphoma
049 0000196 0.078 pheochromocytoma, malignant
049 0000159 0.078 nonencapsulated sclerosing carcinoma
049 0000034 0.078 dermoid cyst with malignant transformation
049 0001041 0.079 serous cystadenoma, borderline malignancy (c56.9)
049 0001486 0.079 mucinous cystic tumor of borderline malignancy (c56.9)
050 0002700 0.080 bowen's disease
050 0000036 0.080 hodgkin's granuloma
050 0000012 0.080 chromophobe adenoma
050 0001225 0.080 pituitary adenoma, nos
050 0000722 0.080 neurilemmoma, malignant
050 0000172 0.081 giant cell glioblastoma
050 0022382 0.087 papillary carcinoma nos
050 0000089 0.087 ameloblastoma, malignant
050 0000516 0.087 papillary microcarcinoma
050 0000034 0.087 acute biphenotypic leukemia
050 0004050 0.088 follicular adenocarcinoma nos
050 0000017 0.088 mixed medullary-papillary carcinoma
050 0001004 0.088 medullary carcinoma with amyloid stroma
050 0000085 0.088 acute myeloid leukemia, t(8;21)(q22;q22)
050 0000030 0.088 mucinous cystadenocarcinoma, non-invasive
050 0000029 0.088 mucinous adenocarcinoma, endocervical type
050 0000141 0.089 follicular adenocarcinoma, trabecular type
051 0001932 0.089 chondrosarcoma nos
051 0000116 0.089 gastrinoma, malignant
051 0000115 0.089 round cell liposarcoma
051 0000117 0.089 paraganglioma, malignant
051 0000761 0.089 adrenal cortical carcinoma
051 0000884 0.090 lymphoepithelial carcinoma
051 0000358 0.090 hemangiopericytoma, malignant
051 0004547 0.091 superficial spreading melanoma, in situ
051 0000335 0.091 medullary carcinoma with lymphoid stroma
051 0000026 0.091 malignant giant cell tumor of soft parts
051 0000179 0.091 adenocarcinoma in adenomatous polyposis coli
052 0000043 0.091 myosarcoma
052 0000153 0.091 adenoma nos
052 0000978 0.091 neurilemmoma nos
052 0001258 0.092 fibrosarcoma nos
052 0000032 0.092 histiocytic sarcoma
052 0000191 0.092 myxoid chondrosarcoma
052 0000302 0.092 esthesioneuroblastoma
052 0000110 0.092 precancerous melanosis nos
052 0040929 0.104 superficial spreading melanoma
052 0000159 0.104 hemangioendothelioma, malignant
052 0000934 0.104 cystosarcoma phyllodes, malignant
052 0000031 0.104 malignant melanoma in precancerous melanosis
052 0000248 0.104 chronic myelogenous leukemia, bcr/abl positive
052 0000058 0.104 hodgkin's disease, lymphocytic depletion, diffuse fibrosis
053 0000097 0.104 neoplasm, benign
053 0000265 0.104 fibromyxosarcoma
053 0001167 0.104 myxoid liposarcoma
053 0000046 0.104 fascial fibrosarcoma
053 0000027 0.104 blue nevus, malignant
053 0000128 0.104 spermatocytic seminoma
053 0007534 0.107 medullary carcinoma nos
053 0000053 0.107 hypereosinophilic syndrome
053 0000042 0.107 odontogenic tumor, malignant
053 0000780 0.107 granulosa cell tumor, malignant
053 0000350 0.107 malignant melanoma in junctional nevus
054 0000032 0.107 neuroma nos
054 0000025 0.107 balloon cell melanoma
054 0000135 0.107 malignant mastocytosis
054 0000142 0.107 mesonephroma, malignant
054 0003757 0.108 mucoepidermoid carcinoma
054 0010414 0.111 lobular carcinoma in situ
054 0000029 0.111 thymoma, type b2, malignant
054 0000117 0.111 nk/t-cell lymphoma, nasal and nasal-type
054 0000890 0.111 anaplastic large cell lymphoma, t-cell and null cell type
055 0000614 0.111 chordoma
055 0000023 0.111 angiomyosarcoma
055 0000111 0.111 fibrous meningioma
055 0000959 0.112 thymoma, malignant
055 0000328 0.112 adenocarcinoid tumor
055 0000057 0.112 chromophobe carcinoma
055 0000032 0.112 megakaryocytic myelosis
055 0000903 0.112 endometrial stromal sarcoma
055 0000120 0.112 atypical medullary carcinoma
055 0000265 0.112 mucocarcinoid tumor, malignant
055 0000070 0.112 juvenile carcinoma of the breast
055 0000029 0.112 adenocarcinoma in situ in familial polyp. coli
056 0000042 0.112 gliomatosis cerebri
056 0006891 0.114 comedocarcinoma nos
056 0000028 0.114 theca cell carcinoma
056 0000096 0.114 myxoid leiomyosarcoma
056 0048315 0.128 malignant melanoma nos
056 0000030 0.128 angiomatous meningioma
056 0000144 0.128 transitional meningioma
056 0000045 0.128 thymoma, type ab, malignant
056 0000140 0.128 pleomorphic rhabdomyosarcoma
056 0000412 0.128 malignant melanoma, regressing
056 0003428 0.129 mucinous cystadenocarcinoma nos
056 0000086 0.129 papillary carcinoma, columnar cell
056 0000485 0.129 papillary mucinous cystadenocarcinoma
056 0002754 0.130 intraductal and lobular in situ carcinoma
056 0000626 0.130 hodgkin's disease, lymphocytic depletion nos
056 0000039 0.130 endometrioid adenocarcinoma, secretory variant
056 0000147 0.130 primary cutan. cd30+ t-cell lymphoprolif. disorder
057 0000089 0.130 myxosarcoma
057 0000403 0.130 adenosarcoma
057 0021999 0.137 melanoma in situ
057 0000855 0.137 islet cell carcinoma
057 0000145 0.137 mixed type liposarcoma
057 0003285 0.138 inflammatory carcinoma
057 0000066 0.138 thymoma, type b1, malignant
057 0000047 0.138 spindle cell melanoma, type a
057 0000026 0.138 subcutaneous panniculitis-like t-cell lymphoma
058 0000011 0.138 vipoma
058 0000202 0.138 myeloid sarcoma
058 0002147 0.138 hairy cell leukemia
058 0000191 0.139 stromal sarcoma, nos
058 0000038 0.139 insulinoma, malignant
058 0000032 0.139 glucagonoma, malignant
058 0006684 0.140 adenocarcinoma in situ
058 0001435 0.141 oxyphilic adenocarcinoma
058 0001770 0.141 acute monocytic leukemia
058 0000054 0.141 malignant myoepithelioma
058 0003319 0.142 adenoid cystic carcinoma
058 0000079 0.142 thymoma, type b3, malignant
058 0000477 0.142 spindle cell melanoma, type b
058 0000229 0.142 meningotheliomatous meningioma
058 0000234 0.143 malignant tumor, small cell type
058 0009679 0.145 comedocarcinoma, noninfiltrating
058 0000207 0.145 cyst-associated renal cell carcinoma
058 0000843 0.146 intraductal micropapillary carcinoma
058 0004480 0.147 ml, large b-cell, diffuse, immunoblastic, nos
058 0008089 0.149 squamous cell carcinoma, large cell, nonkeratinizing type
059 0003124 0.150 sarcoma nos
059 0008070 0.152 leiomyosarcoma nos
059 0000114 0.152 atypical meningioma
059 0000082 0.152 thymic carcinoma, nos
059 0000016 0.152 aggressive nk-cell leukemia
059 0002839 0.153 cribriform carcinoma in situ
059 0027965 0.161 papillary adenocarcinoma nos
059 0000018 0.161 malignant eccrine spiradenoma
059 0001301 0.161 papillary cystadenocarcinoma nos
059 0000050 0.161 solitary fibrous tumor, malignant
059 0000091 0.161 polymorphous low grade adenocarcinoma
059 0004086 0.163 adenocarcinoma with squamous metaplasia
059 0000448 0.163 acute myeloid leukemia without maturation
059 0035630 0.173 intraductal carcinoma, noninfiltrating nos
059 0000022 0.173 acute myeloid leukemia, 11q23 abnormalities
059 0000020 0.173 mixed islet cell and exocrine adenocarcinoma
059 0004558 0.174 infiltr. duct mixed with other types of carcinoma, in situ
060 0008688 0.176 nodular melanoma
060 0000030 0.176 insular carcinoma
060 0003377 0.177 mycosis fungoides
060 0000764 0.178 amelanotic melanoma
060 0001025 0.178 spindle cell sarcoma
060 0000113 0.178 queyrat's erythroplasia
060 0000401 0.178 epithelioid cell melanoma
060 0016028 0.183 carcinoid tumor, malignant
060 0000486 0.183 epithelioid leiomyosarcoma
060 0001267 0.183 mature t-cell lymphoma, nos
060 0001045 0.183 cutaneous t-cell lymphoma, nos
060 0000020 0.183 carcinosarcoma, embryonal type
060 0001669 0.184 duct carcinoma in situ, solid type
060 0001021 0.184 liposarcoma, well differentiated type
060 0000014 0.184 granulosa cell-theca cell tumor, mal.
060 0000614 0.184 acute myeloid leukemia with maturation
060 0000617 0.184 renal cell carcinoma, chromophobe type
060 0000166 0.185 carcinoid tumor, argentaffin, malignant
060 0000126 0.185 adult t-cell leukemia/lymphoma (htlv-1 pos.)
060 0000299 0.185 squamous cell carcinoma, small cell, nonkeratinizing type
060 0008938 0.187 malignant lymphoma, follicular center cell, cleaved, follicular
061 0019907 0.193 glioblastoma nos
061 0000906 0.193 meningioma, malignant
061 0000015 0.193 epithelioma, malignant
061 0026705 0.201 endometrioid carcinoma
061 0018131 0.206 acute myeloid leukemia
061 0010280 0.209 chronic myeloid leukemia
061 0000026 0.209 polygonal cell carcinoma
061 0000057 0.209 collecting duct carcinoma
061 0000303 0.209 metaplastic carcinoma, nos
061 0000382 0.209 mixed tumor, malignant nos
061 0333623 0.303 infiltrating duct carcinoma
061 0002006 0.303 acute myelomonocytic leukemia
061 0020557 0.309 clear cell adenocarcinoma nos
061 0000885 0.309 infiltrating ductular carcinoma
061 0000165 0.309 carcinoma in pleomorphic adenoma
061 0000661 0.310 mixed epithel. & spindle cell melanoma
061 0000295 0.310 glioblastoma with sarcomatous component
061 0000042 0.310 adenocarcinoma with apocrine metaplasia
061 0021695 0.316 infiltrating duct and lobular carcinoma
061 0000324 0.316 acute myeloid leukemia, minimal differentiation
061 0000100 0.316 papillary squamous cell carcinoma, non-invasive
061 0000012 0.316 endometrioid adenocarcinoma, ciliated cell variant
061 0000074 0.316 hodgkin's disease, lymphocytic depletion, reticular type
061 0002217 0.317 paget's disease and infiltrating duct carcinoma of breast
062 0001323 0.317 liposarcoma nos
062 0006337 0.319 tubular adenocarcinoma
062 0000014 0.319 heavy chain disease, nos
062 0000077 0.319 atypical carcinoid tumor
062 0000494 0.319 skin appendage carcinoma
062 0001092 0.319 mixed cell adenocarcinoma
062 0001281 0.320 spindle cell melanoma nos
062 0004682 0.321 serous cystadenocarcinoma nos
062 0005812 0.323 fibrous histiocytoma, malignant
062 0001263 0.323 gastrointestinal stromal sarcoma
062 0000106 0.323 malignant tumor, giant cell type
062 0000493 0.323 basaloid squamous cell carcinoma
062 0000258 0.323 renal cell carcinoma, sarcomatoid
062 0000118 0.323 epithelial-myoepithelial carcinoma
062 0000884 0.323 acral lentiginous melanoma, malig.
062 0015063 0.328 papillary serous cystadenocarcinoma
062 0000039 0.328 endometrioid adenofibroma, malignant
062 0012432 0.331 malignant lymphoma, non hodgkin's type
062 0000037 0.331 adenocarcinoma with spindle cell metaplasia
062 0000086 0.331 therapy-related myelodysplastic syndrome, nos
062 0002593 0.332 infiltr. duct mixed with other types of carcinoma
062 0003162 0.333 noninfiltrating intraductal papillary adenocarcinoma
062 0005812 0.334 malignant lymphoma, mixed lymphocytic-histiocytic, nodular
062 0003414 0.335 malignant lymphoma, follicular center cell, noncleaved, follicular
063 0001353 0.336 hemangiosarcoma
063 0000023 0.336 hodgkin's sarcoma
063 0000024 0.336 meningiomatosis nos
063 0000239 0.336 solid carcinoma nos
063 0000135 0.336 composite carcinoid
063 0001142 0.336 cystadenocarcinoma nos
063 0000023 0.336 schneiderian carcinoma
063 0011193 0.339 adenosquamous carcinoma
063 0000084 0.339 psammomatous meningioma
063 0037088 0.350 ml, large b-cell, diffuse
063 0000080 0.350 basal cell adenocarcinoma
063 0000016 0.350 intestinal t-cell lymphoma
063 0000287 0.350 dedifferentiated liposarcoma
063 0000939 0.350 plasmacytoma, extramedullary
063 0000869 0.350 plasma cell tumor, malignant
063 0003345 0.351 malignant lymphoma, nodular nos
063 0001358 0.352 paget disease and intraductal ca.
063 0002708 0.353 serous surface papillary carcinoma
063 0000020 0.353 squamous cell carcinoma, clear cell type
063 0000012 0.353 basal cell carcinoma, fibroepithelial type
063 0000334 0.353 giant cell sarcoma (except of bone m9250/3)
063 0023109 0.359 squamous cell carcinoma, keratinizing type nos
063 0000472 0.359 infiltrating lobular mixed with other types of carc.
063 0000277 0.359 combined hepatocellular carcinoma and cholangiocarcinoma
064 0001275 0.360 polycythemia vera
064 0000041 0.360 lymphangiosarcoma
064 0017866 0.365 oat cell carcinoma
064 0037543 0.375 renal cell carcinoma
064 0001065 0.376 giant cell carcinoma
064 0011520 0.379 acinar cell carcinoma
064 0001358 0.379 cloacogenic carcinoma
064 0034336 0.389 lobular carcinoma nos
064 0013199 0.393 malignant lymphoma nos
064 0000947 0.393 granular cell carcinoma
064 0000405 0.393 pleomorphic liposarcoma
064 0000723 0.393 apocrine adenocarcinoma
064 0003164 0.394 scirrhous adenocarcinoma
064 0005106 0.396 neuroendocrine carcinoma
064 0000363 0.396 sweat gland adenocarcinoma
064 0247826 0.465 squamous cell carcinoma nos
064 0019957 0.471 hepatocellular carcinoma nos
064 0000757 0.471 papillary squamous cell carcinoma
064 0010254 0.474 carcinoma, undifferentiated type nos
064 0000087 0.474 acute panmyelosis with myelofibrosis
064 0000248 0.474 giant cell and spindle cell carcinoma
064 0000083 0.474 small cell carcinoma, fusiform cell type
064 0000151 0.474 adenocarcinoma in mult. adenomatous polyps
064 0000018 0.474 atypical chronic myeloid leuk., bcr/abl negative
064 0000026 0.474 adenocarcinoma with cartilaginous and osseous metaplasia
065 0002723 0.475 meningioma nos
065 0003731 0.476 acute leukemia nos
065 0000880 0.476 cribriform carcinoma
065 0000200 0.477 plasma cell leukemia
065 0000514 0.477 pleomorphic carcinoma
065 0000052 0.477 eccrine adenocarcinoma
065 0029592 0.485 large cell carcinoma nos
065 0000123 0.485 brenner tumor, malignant
065 0000998 0.485 essential thrombocythemia
065 0000016 0.485 ceruminous adenocarcinoma
065 0010967 0.488 signet ring cell carcinoma
065 0000125 0.488 nodular hidradenoma, malignant
065 0004446 0.490 carcinoma, anaplastic type nos
065 0000086 0.490 adenoid squamous cell carcinoma
065 0000116 0.490 sclerosing sweat duct carcinoma
065 0000856 0.490 adenocarcinoma with mixed subtypes
065 0003638 0.491 marginal zone b-cell lymphoma, nos
065 0004319 0.492 ml, mixed sm. and lg. cell, diffuse
065 0000183 0.492 superficial spreading adenocarcinoma
065 0000054 0.492 hepatocellular carcinoma, clear cell type
065 0000020 0.492 composite hodgkin and non-hodgkin lymphoma
065 0000105 0.492 adenocarcinoma with neuroendocrine differen.
065 0001094 0.493 intraductal papillary adenocarcinoma with invasion
066 0000815 0.493 erythroleukemia
066 0001654 0.493 linitis plastica
066 0000461 0.493 basaloid carcinoma
066 0002890 0.494 mantle cell lymphoma
066 0001270 0.495 ml, lymphoplasmacytic
066 0000787 0.495 spindle cell carcinoma
066 0001267 0.495 carcinoma, diffuse type
066 0000390 0.495 alveolar adenocarcinoma
066 0000618 0.496 paget's disease, mammary
066 0050173 0.510 small cell carcinoma nos
066 0000624 0.510 papillary carcinoma in situ
066 0000783 0.510 desmoplastic melanoma, malignant
066 0011881 0.513 bronchiolo-alveolar adenocarcinoma
066 0000241 0.513 angioimmunoblastic t-cell lymphoma
066 0000403 0.514 large cell neuroendocrine carcinoma
066 0000315 0.514 malignant tumor, fusiform cell type
066 0000042 0.514 prolymphocytic leukemia, t-cell type
066 0001669 0.514 small cell carcinoma, intermediate cell
066 0000050 0.514 adenocarc. in situ in mult. adenomatous polyps
066 0000127 0.514 neoplasm, uncertain whether benign or malignant
066 0000032 0.514 intraductal papillary-mucinous carcinoma, invasive
067 0000114 0.514 sezary's disease
067 0001015 0.515 lymphoid leukemia nos
067 0000779 0.515 mesodermal mixed tumor
067 0000184 0.515 trabecular adenocarcinoma
067 0000514 0.515 intracystic carcinoma, nos
067 0009635 0.518 ml, small b lymphocytic, nos
067 0000557 0.518 combined small cell carcinoma
067 0025451 0.525 mucin-producing adenocarcinoma
067 0000032 0.525 dedifferentiated chondrosarcoma
067 0000016 0.525 immunoproliferative disease, nos
067 0001263 0.525 epithelioid mesothelioma, malignant
067 0000179 0.525 splenic marginal zone b-cell lymphoma
067 0000212 0.525 bronchiolo-alveolar carcinoma, mucinous
067 0000584 0.526 squamous cell carcinoma, spindle cell type
067 0008163 0.528 adenocarcinoma in situ in adenomatous polyp
067 0000165 0.528 bronchiolo-alveolar carcinoma, non-mucinous
067 0005831 0.530 adenocarcinoma in situ in tubulovillous adenoma
067 0000327 0.530 acute myeloid leuk. with multilineage dysplasia
067 0000026 0.530 bronch.-alv. carc., mixed mucin. and non-mucinous
067 0000038 0.530 intraductal papillary-mucinous carcinoma, non-inv.
068 0000054 0.530 carcinoma simplex
068 0002264 0.530 carcinosarcoma nos
068 1021940 0.818 adenocarcinoma nos
068 0003088 0.819 mullerian mixed tumor
068 0000758 0.819 villous adenocarcinoma
068 0045778 0.832 mucinous adenocarcinoma
068 0004897 0.834 mesothelioma, malignant
068 0001673 0.834 verrucous carcinoma nos
068 0014878 0.838 non-small cell carcinoma
068 0000032 0.838 thymoma, type a, malignant
068 0000295 0.838 pseudosarcomatous carcinoma
068 0000056 0.838 granular cell tumor, malignant
068 0018652 0.844 hutchinson's melanotic freckle
068 0020880 0.849 adenocarcinoma in adenomatous polyp
068 0000032 0.849 prolymphocytic leukemia, b-cell type
068 0096537 0.877 papillary transitional cell carcinoma
068 0016555 0.881 adenocarcinoma in tubulovillous adenoma
069 0036429 0.892 multiple myeloma
069 0004610 0.893 cholangiocarcinoma
069 0001872 0.893 myeloid leukemia nos
069 0000156 0.893 basosquamous carcinoma
069 0000046 0.893 eccrine poroma, malignant
069 0000010 0.893 clear cell adenocarcinofibroma
069 0000476 0.894 fibrous mesothelioma, malignant
069 0016754 0.898 adenocarcinoma in villous adenoma
069 0000940 0.899 transitional cell carcinoma in situ
069 0000419 0.899 noninfiltrating intracystic carcinoma
069 0000362 0.899 myelosclerosis with myeloid metaplasia
069 0000586 0.899 chronic myeloproliferative disease, nos
069 0004672 0.900 adenocarcinoma in situ in villous adenoma
069 0002438 0.901 papillary trans. cell carcinoma, non-invasive
069 0007272 0.903 malignant melanoma in hutchinson's melanotic freckle
070 0003575 0.904 tumor cells, malignant
070 0030328 0.913 chronic lymphoid leukemia
070 0000301 0.913 prolymphocytic leukemia, nos
070 0050243 0.927 transitional cell carcinoma nos
070 0000049 0.927 osteosarcoma in paget's disease of bone
071 0001585 0.927 waldenstrom macroglobulinemia
071 0000122 0.927 transitional cell carcinoma, spindle cell type
071 0000263 0.927 refractory cytopenia with multilineage dysplasia
071 0000080 0.927 refract. anemia with excess blasts in transformation
071 0000811 0.928 paget's disease, extramammary (except paget's disease of bone)
072 0002581 0.928 leukemia nos
072 0182616 0.980 carcinoma nos
072 0000408 0.980 klatskin tumor
072 0000012 0.980 hepatoid adenocarcinoma
072 0000796 0.980 sebaceous adenocarcinoma
072 0000786 0.980 basal cell carcinoma nos
072 0000012 0.980 adenoid basal cell carcinoma
072 0002766 0.981 adenocarcinoma, intestinal type
072 0000015 0.981 multicentric basal cell carcinoma
072 0000763 0.981 refractory anemia with excess blasts
072 0000261 0.981 mesothelioma, biphasic type, malignant
072 0000028 0.981 transitional cell carcinoma, micropapillary
073 0000820 0.982 refractory anemia
073 0000036 0.982 basal cell carcinoma, nodular
074 0001716 0.982 merkel cell carcinoma
074 0002422 0.983 myelodysplastic syndrome, nos
074 0000655 0.983 refractory anemia with sideroblasts
074 0001798 0.984 chronic myelomonocytic leukemia, nos
074 0000089 0.984 myelodysplastic syndr. with 5q deletion syndrome
076 0056558 1.000 neoplasm, malignant
Once we have the columned data, we can easily produce a graphic that represents the salient features we would like to emphasize.
You can see for yourself, from the tabulated list and from the graph, that
the commbined occurrences of all neoplasms, that occur in people with an average age under 33,
contribute only 1% of the total number of cancers in the SEER collection.
Additional web sites discussing the SEER public use data files are:
Using SEER public-use datasets
Data Mining the differences in occurrences of neoplasms
in the U.S. African-American
and White
populations: http://www.julesberman.info/seerwhbl.htm
Using SEER public-use datasets: further
examples: http://www.julesberman.info/seer_all.htm
Graphic representation of cancer occurrences, by age, for over 700 neoplasms:
http://www.julesberman.info/seerdist.pdf
Using the CDC (Centers for Disease Control and Prevention) public use data
files: http://www.julesberman.info/cdc_ch.pdf
As specified in the SEER Data Agreement, the citation for the SEER data
is as follows:
Surveillance, Epidemiology, and End Results (SEER) Program
(www.seer.cancer.gov) Limited-Use Data (1973-2005), National Cancer
Institute, DCCPS, Surveillance Research Program, Cancer Statistics
Branch, released April 2008, based on the November 2007 submission.
If you want to do creative data mining, you will need to
learn a little computer programming.
For Perl and Ruby programmers, methods and scripts for using SEER and other publicly available biomedical databases, are described in detail in my prior books:
Perl Programming for Medicine and Biology
Ruby Programming for Medicine and Biology
An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.
More information on cancer is available in my
recently published book, Neoplasms: Principles of Development and Diversity
I also maintain a blog where I regularly write about data organization,
data annotation, data retrieval, and data mining.
http://julesberman.blogspot.com
© 2008 Jules Berman
key words: neoplasms, cancer, neoplasia, precancer, tumor, tumour, tumors, tumours, neoplasm, carcinogenesis, carcinogens, tumor genetics
As with all of my scripts, lists, web sites, and blog entries,
the following disclaimer applies.
This material is provided by its creator, Jules J. Berman, "as is",
without warranty of any kind, expressed or implied, including but
not limited to the warranties of merchantability, fitness for a
particular purpose and noninfringement. in no event shall the
author or copyright holder be liable for any claim, damages or
other liability, whether in an action of contract, tort or otherwise,
arising from, out of or in connection with the material or the use or
other dealings in the material.
Last modified: December 31, 2008