Using SEER public-use datasets: Examples
SEER is the U.S. National Cancer Institute's Surveillance,
Epidemiology and End Results program. It is an amazing
resource for information about the cancers that occur
in the U.S. One of the produces of SEER is the Public
Use datasets, which contain de-identified records on
over 3.5 million cancers that have occurred between
1973 and 2005.
When you have 3.5 million cancer cases to study, you can draw certain
types of inferences that could not possibly be made with the data
accumulated at a single medical institution.
In separate web pages, I have described how
you can obtain and
analyze the free, SEER public-use data files, and I have provided
one example of a
SEER data mining project.
This page contains several additional short projects that were
done using the SEER data files and a few short Perl scripts.
GERM CELL TUMORS IN DIFFERENT ETHNICITIES
Reviewing output from
the prior
example page, germ cell tumors seemed to occur more frequently in
the white population, than in the African-American population.
Would this difference extend to the Hispanic population?
I went to the SEER site and used SEER's public query engine to see if
this observation could be verified.
The SEER query site is:
http://seer.cancer.gov/canques/index.html
I looked for tumors in male testes, comparing
Hispanic non-whites with African-Americans.
The SEER interface produces a list of the input parameters.
The same interface produces a bar chart of your results:
You may be wondering, if I am interested in germ cell tumors, why did I query over tumors of the testes? I did this because the SEER interface does not allow me to do a query on specific types of germ cell tumors or of any specific testicular tumor. I know that most testicular tumors are germ cell tumors, so I settled, figuring that if there were a difference in the incidence of germ cell tumors in the Hispanic and the African-American populations, it would show up in the query.
And that's what happened. The SEER output demonstrated that white Hispanics had a much higher incidence of testicular tumors (and, presumably, testicular germ cell tumors) than the African-American population.
If I want to find the ratio for specific tumors, I need to do a little more work. A simple Perl script produced the following list:
2.031 0173 026 germinoma
2.756 0005 038 intratubular malignant germ cells *
2.756 0005 025 malignant teratoma, undifferentiated type *
3.409 0104 019 teratoma, malignant nos
3.478 1125 035 seminoma nos
4.452 0054 036 seminoma, anaplastic type
4.757 0157 028 teratocarcinoma
5.053 0060 027 choriocarcinoma combined with teratoma
5.168 0018 050 spermatocytic seminoma *
5.523 0299 028 embryonal carcinoma nos
5.548 0363 028 mixed germ cell tumor
6.389 0028 027 germ cell tumor, nonseminomatous
* Cases with asterisks have too few cases (second column)
for any significance
The left column is the ratio of cases per total population of white Hispanics divided by the same ratio for African-Americans. The second column is the number of cases, the third column is the average age of cases, and the final column is the ICD-0 term for the neoplasm.
We see (column 1) that white Hispanics have a higher case ratio for every type of germ cell tumor in males (regardless of site).
This tells us a few things. First, that all of the germ cell tumors are related to each other by more than histogenesis (cell of origin). They must have a relationship that extends to causation and development. Second, it tells us that the relatively high level of occurrence of germ cell tumors in white Hispanics is not just a fluke occurring in one cancer of one particular site. It is a consistent phenomenon that extends to several different related tumors and their histologic variants.
I should stress that germ cell tumors are rare, even in the Hispanic population. We are discussing relative rates of uncommon tumors among different ethnicities. An individual's risk of developing a germ cell tumor is low, regardless of ethnicity.
CONFIRMING PRECANCER BIOLOGY
Here is an example. Precancers are the identifiable, easily-treated, lesions from which advanced cancers develop. When you eradicate the precancer, the cancer never develops.
In theory, if precancers precede cancers, the average age at diagnosis of precancers should be smaller than the average at diagnosis of the cancers that develop from precancers. This biologic tautology has been hard to verify, because not all precancers will develop in people of the same age. The same is true for cancers. And it's hard to come up with a large enough population that will separate the (overlapping) precancer and cancer populations.
However, the SEER population provides the numbers we need. When we extract all of the SEER neoplasms that arise in the uterine cervix, we find the following average ages for the resulting set of tumors:
Number
Age of Neoplasm name
Cases
--------------------------------------------------------------
034 0049176 carcinoma in situ nos
034 0051906 squamous cell carcinoma in situ nos
035 0000359 sq cell carcinoma lg cell non-ker in situ
037 0000313 sq cell carcinoma keratinizing nos in situ
039 0001348 adenocarcinoma in situ
039 0018551 squamous intraepithelial neoplasia, grade iii
039 0000058 squamous cell carcinoma in situ with questionable stromal invasion
041 0003213 squamous cell carcinoma, microinvasive
043 0000113 adenocarcinoma, endocervical type
048 0001320 adenosquamous carcinoma
048 0000049 neuroendocrine carcinoma
048 0000093 large cell carcinoma nos
049 0003118 squamous cell carcinoma, large cell, nonkeratinizing type
050 0002524 carcinoma nos
050 0000259 endometrioid carcinoma
050 0000021 mucinous adenocarcinoma, endocervical type
051 0004121 adenocarcinoma nos
051 0000250 small cell carcinoma nos
051 0002727 squamous cell carcinoma, keratinizing type nos
051 0000104 squamous cell carcinoma, small cell, nonkeratinizing type
052 0000233 mucinous adenocarcinoma
052 0018774 squamous cell carcinoma nos
054 0000218 clear cell adenocarcinoma nos
055 0000088 papillary squamous cell carcinoma
056 0000023 mesodermal mixed tumor
056 0000141 mucin-producing adenocarcinoma
058 0000034 verrucous carcinoma nos
060 0000289 neoplasm, malignant
060 0000037 papillary serous cystadenocarcinoma
062 0000025 sarcoma nos
062 0000055 mullerian mixed tumor
062 0000044 carcinoma, anaplastic type nos
062 0000076 carcinoma, undifferentiated type nos
062 0000033 adenocarcinoma with squamous metaplasia
063 0000043 carcinosarcoma nos
----------------------------------------------------------------
The average age of all of the in situ lesions (i.e., non-invasive precancers) is smaller than the average age of the observed invasive cancers arising from the cervix! I have never seen this observation demonstrated from any other data set.
It took about one minute to generate the table, using a Perl script that parsed through 3.5 million SEER records.
BONE MARROW PRECANCERS
Today, we'll look at the neoplasms that occur in the blood and bone marrow. Do the precancers of blood occur at a younger age than the cancers that develop from those precancers (as we might expect)?
Here are the SEER listings for neoplasms of the blood and bone marrow. The left-hand column is the average age of occurrence of each neoplasm. The middle column is the number of cases in the SEER collection (neoplasms with fewer than 20 SEER cases were considered un-informative and were omitted from the list). The column on the right is the ICD-O term.
Number
Age of Neoplasm name
Cases
--------------------------------------------------------------
015 0000057 langerhans cell histiocytosis, disseminated
021 0000827 precursor b-cell lymphoblastic leukemia
023 0009220 precursor cell lymphoblastic leukemia, nos
024 0000117 precursor t-cell lymphoblastic leukemia
042 0000117 burkitt cell leukemia
043 0000024 burkitt lymphoma, nos
045 0000233 burkitt's tumor
047 0001140 acute promyelocytic leukemia
048 0000257 megakaryocytic leukemia
048 0000057 ac. myelomonocytic leuk. w abn. mar. eosinophils
050 0000034 acute biphenotypic leukemia
050 0000085 acute myeloid leukemia, t(8;21)(q22;q22)
052 0000248 chronic myelogenous leukemia, bcr/abl positive
053 0000053 hypereosinophilic syndrome
054 0000121 malignant mastocytosis
055 0000032 megakaryocytic myelosis
055 0000031 mature t-cell lymphoma, nos
058 0002146 hairy cell leukemia
058 0001770 acute monocytic leukemia
059 0000030 hodgkin's disease nos
059 0000447 acute myeloid leukemia without maturation
059 0000022 acute myeloid leukemia, 11q23 abnormalities
060 0000115 myeloid sarcoma
060 0000614 acute myeloid leukemia with maturation
060 0000113 adult t-cell leukemia/lymphoma (htlv-1 pos.)
061 0018129 acute myeloid leukemia
061 0010279 chronic myeloid leukemia
061 0002006 acute myelomonocytic leukemia
061 0000324 acute myeloid leukemia, minimal differentiation
061 0000027 malignant lymphoma, mixed lymphocytic-histiocytic, nodular
062 0000086 therapy-related myelodysplastic syndrome, nos
063 0000032 hemangiosarcoma
063 0000119 plasma cell tumor, malignant
063 0000054 ml, large b-cell, diffuse, immunoblastic, nos
064 0001275 polycythemia vera
064 0000428 ml, large b-cell, diffuse
064 0000087 acute panmyelosis with myelofibrosis
065 0003731 acute leukemia nos
065 0000200 plasma cell leukemia
065 0000998 essential thrombocythemia
065 0000482 plasmacytoma, extramedullary
066 0000815 erythroleukemia
066 0000036 malignant lymphoma, nodular nos
066 0000036 ml, mixed sm. and lg. cell, diffuse
066 0000042 prolymphocytic leukemia, t-cell type
066 0000162 splenic marginal zone b-cell lymphoma
066 0000026 malignant lymphoma, follicular center cell, cleaved, follicular
067 0001014 lymphoid leukemia nos
067 0000225 malignant lymphoma, non hodgkin's type
067 0000327 acute myeloid leuk. with multilineage dysplasia
068 0000148 ml, lymphoplasmacytic
068 0000132 marginal zone b-cell lymphoma, nos
068 0000032 prolymphocytic leukemia, b-cell type
069 0036377 multiple myeloma
069 0001870 myeloid leukemia nos
069 0000094 mantle cell lymphoma
069 0000314 malignant lymphoma nos
069 0000362 myelosclerosis with myeloid metaplasia
069 0000586 chronic myeloproliferative disease, nos
070 0030307 chronic lymphoid leukemia
070 0000301 prolymphocytic leukemia, nos
071 0000276 ml, small b lymphocytic, nos
071 0001582 waldenstrom macroglobulinemia
071 0000263 refractory cytopenia with multilineage dysplasia
071 0000080 refract. anemia with excess blasts in transformation
072 0002580 leukemia nos
072 0000763 refractory anemia with excess blasts
073 0000820 refractory anemia
074 0002422 myelodysplastic syndrome, nos
074 0000655 refractory anemia with sideroblasts
074 0001798 chronic myelomonocytic leukemia, nos
074 0000089 myelodysplastic syndr. with 5q deletion syndrome
----------------------------------------------------------------
The precancer lesions of the bood cells are the myelodysplasias (previously called preleukemias). They include the refractory anemias and chronic myelomyocytic leukemia (not to be confused with chronic myeloid leukemia). These lesions, sometimes progress to acute myelogenous leukemia.
Here are the average ages of development of the myelodysplasias:
Number
Age of Neoplasm name
Cases
--------------------------------------------------------------
071 0000263 refractory cytopenia with multilineage dysplasia
071 0000080 refract. anemia with excess blasts in transformation
072 0000763 refractory anemia with excess blasts
073 0000820 refractory anemia
074 0002422 myelodysplastic syndrome, nos
074 0000655 refractory anemia with sideroblasts
074 0001798 chronic myelomonocytic leukemia, nos
074 0000089 myelodysplastic syndr. with 5q deletion syndrome
--------------------------------------------------------------
All of the myelodysplasias cluster at the upper end of ages for blood neoplasms (70+ years old). This is far older than the average age of occurrence of the acute leukemias (into which the myelodysplasias develop).
Number
Age of Neoplasm name
Cases
--------------------------------------------------------------
050 0000085 acute myeloid leukemia, t(8;21)(q22;q22)
058 0001770 acute monocytic leukemia
059 0000447 acute myeloid leukemia without maturation
059 0000022 acute myeloid leukemia, 11q23 abnormalities
060 0000614 acute myeloid leukemia with maturation
061 0018129 acute myeloid leukemia
061 0002006 acute myelomonocytic leukemia
061 0000324 acute myeloid leukemia, minimal differentiation
065 0003731 acute leukemia nos
067 0000327 acute myeloid leuk. with multilineage dysplasia
--------------------------------------------------------------
How can a precursor lesions occur in a population that is older than the population in which the developed cancer occurs?
The answer is simple. Most acute leukemias do not develop from the myelodysplasias. When we look at column two, we see at a glance that the acute leukemias are much more numerous than the myelodysplasias.
The pathway of myelodysplasia to acute leukemia is the exception, not the rule, and we would need to find some other precursor lesion to accunt for the bulk of acute myeloid leukemias.
This is an example of how to use the SEER data to examine and test existing hypotheses and to develop new hypotheses. It took under a minute to generate the table, using a Perl script that parsed through 3.5 million SEER records.
APPENDICEAL TUMORS
Today, we'll look at the neoplasms that occur in appendix of the colon.
Here are the SEER listings. The left-hand column is the average age of occurrence of each neoplasm. The middle column is the number of cases in the SEER collection (neoplasms with fewer than 20 SEER cases were considered un-informative and were omitted from the list). The column on the right is the ICD-O term.
Number
Age of Neoplasm name
Cases
--------------------------------------------------------------
039 022 carcinoid tumor, argentaffin, malignant
040 434 carcinoid tumor, malignant
050 217 adenocarcinoid tumor
054 224 mucocarcinoid tumor, malignant
055 038 composite carcinoid
058 022 carcinoma nos
058 138 signet ring cell carcinoma
059 186 mucin-producing adenocarcinoma
060 640 mucinous adenocarcinoma
061 175 mucinous cystadenocarcinoma nos
063 649 adenocarcinoma nos
065 033 adenocarcinoma in tubulovillous adenoma
067 063 adenocarcinoma in villous adenoma
--------------------------------------------------------------
Though many different kinds of malignant neoplasms can occur in the appendix (and can be found in the SEER data), only carcinoids and adenocarcinomas occur frequently.
All of the carcinoid tumors cluster within a younger average age of occurrence than the adenocarcinomas.
This tells us a few things:
1. All of the carcinoids are biologically related to each other.
2. The carcinoids have a different developmental history than the adenocarcinomas.
3. When a pathologist sees a focus of adenocarcinoma in an appendiceal tumor, particularly in a young or middle-aged patient, he or she should carefully look for a focus of carcinoid, because the tumor might be a mixed adenocarcinoid tumor.
CONFIRMING PRECANCER BIOLOGY
Today, we'll look at the neoplasms that occur in the related anatomic sites: pleura, peritoneum, retro-peritoneum, and pelvis.
Here are the SEER listings. The left-hand column is the number of occurrences. Nneoplasms with fewer than 20 SEER cases were considered un-informative and were omitted from the lists. The second column is the average age of occurrence. The column on the right is the ICD-O term.
SEER captures malignant neoplasms. In the SEER data set, we saw that the following distribution of cases, by occurrences at anatomic site:
PLEURA = 6,138 cases
PERITONEUM = 3,067 cases
RETROPERITONEUM = 3,640 cases
PELVIS = 470 cases
The pleura is the mesothelial-lined cavity of the chest, surrounding and covering the heart and the lungs. The peritoneum is the mesothelial-lined cavity of the abdomen, surrounding and overing all or part of the intestines and other organs of the abdomen (e.g., liver, spleen, pancreas).
The pleura accounts for many more occurrences of malignant tumors than does the peritoneum. Only a few different tumors account for the vast majority of malignant neoplasms arising in the pleura.
MALIGNANT NEOPLASMS OF THE PLEURA (MALE AND FEMALE)
TOTAL = 6,138 CASES
Number
of Age Neoplasm name
Cases
--------------------------------------------------------------
0024 068 sarcoma nos
0038 071 ml, large b-cell, diffuse
0110 070 neoplasm, malignant
0236 073 mesothelioma, biphasic type, malignant
0433 069 fibrous mesothelioma, malignant
1079 068 epithelioid mesothelioma, malignant
4027 069 mesothelioma, malignant
--------------------------------------------------------------
The peritoneum, with about half as many cancer occurrences as the pleura, has more types of tumors that occur in significant numbers (20 or more), including the tumors that arise from the surface of the ovaries (e.g. papillary serous cystadenocarcinoma).
NEOPLASMS OF THE PERITONEUM (MALE AND FEMALE)
TOTAL = 3067
Number
of Age Neoplasm name
Cases
--------------------------------------------------------------
022 062 liposarcoma nos
023 062 gastrointestinal stromal sarcoma
031 067 sarcoma nos
032 064 carcinoid tumor, malignant
038 085 endometrioid carcinoma
045 061 mucinous adenocarcinoma
046 066 fibrous histiocytoma, malignant
047 067 mullerian mixed tumor
053 066 neoplasm, malignant
073 072 carcinoma nos
073 063 ml, large b-cell, diffuse
097 068 papillary adenocarcinoma nos
127 063 leiomyosarcoma nos
134 062 epithelioid mesothelioma, malignant
180 065 serous cystadenocarcinoma nos
245 067 adenocarcinoma nos
359 066 serous surface papillary carcinoma
451 067 papillary serous cystadenocarcinoma
551 062 mesothelioma, malignant
--------------------------------------------------------------
The retroperitoneum, is the collection of tissues that lie between the peritoneal lining and the surface wall of the abdomen. The retroperitoneum is often referred to as the retroperitoneal space. This is not the best term, as it calls to mind a body cavity (space), perhaps lined by mesothelium, and this is not the case. The retroperitoneum is mostly fat, connective tissue, and organs. Fully retroperitoneal organs, such as the kidney and attached adrenals, can drop a little bit, along the potential space of its surronding fascia, but that's about the closest thing to a space that the retroperitoneum can offer. Organs of the abdomen that are slapped tightly against the posterior wall of the peritoneum are, technically, retroperitoneal (such as the ascending and descending colon, and the rectum). Organs or parts of organs that dangle in the abdomen (such as the transverse colon), are fully peritoneal.
For the purposes of collecting data on retroperitoneal neoplasms, the tumors that arise from identifiable organs (e.g., kidney, head of pancreas, adrenals, rectum) are assigned to those organs, in the SEER dataset, and NOT to the retroperitoneum. This leaves, for the most part, soft tissue tumors, muscle tumors and nerve tumors arising in the retroperitoneum. There are a great variety of these tumors, even when we restrict our list to those tumors that occur with a frequency of 20 or greater.
NEOPLASMS OF THE RETROPERITONEUM (MALE AND FEMALE)
TOTAL CASES = 3640
Number
of Age Neoplasm name
Cases
--------------------------------------------------------------
020 049 rhabdomyosarcoma nos
020 020 endodermal sinus tumor
022 021 teratoma, malignant nos
022 065 epithelioid leiomyosarcoma
023 010 embryonal rhabdomyosarcoma
024 069 mesothelioma, malignant
025 054 mesenchymoma, malignant
027 057 hemangiopericytoma, malignant
030 028 embryonal carcinoma nos
034 064 malignant lymphoma nos
035 048 neurofibrosarcoma
038 058 mixed type liposarcoma
040 066 malignant lymphoma, non hodgkin's type
043 045 neurilemmoma, malignant
045 044 seminoma nos
051 012 ganglioneuroblastoma
055 058 fibrosarcoma nos
060 066 pleomorphic liposarcoma
075 063 spindle cell sarcoma
111 063 dedifferentiated liposarcoma
127 067 neoplasm, malignant
140 062 myxoid liposarcoma
142 064 ml, large b-cell, diffuse
222 060 liposarcoma, well differentiated type
223 062 sarcoma nos
255 004 neuroblastoma nos
298 063 liposarcoma nos
311 063 fibrous histiocytoma, malignant
622 062 leiomyosarcoma nos
--------------------------------------------------------------
The pelvis is a commonly used anatomic term that creates much confusion. It is sometimes described as the bowl-like invagination in the lower abdomen, or it may be described as the structures that support the lower abdomen, or it may be described as the set of bones that create the framework of the bowl-like invagination.
It is very difficult to assign neoplasms to the pelvis, because tumors arising in this area can best be assigned to the bones in which they are found, or to the peritoneum, or to the retroperitoneum. The difficulty of assigning neoplasms to the pelvis becomes apparent when we see that of the approximately 3.5 million cases in the SEER dataset, there are only 470 cases assigned to the pelvis, and of these cases, most seem to arise, more specifically, from the peritoneum (papillary serous cystadenocarcinoma) or the uterine cervix (squamous cell carcinoma), or the intestines (adenocarcinoma).
NEOPLASMS OF THE PELVIS (MALE AND FEMALE)
TOTAL = 470
Number
of Age Neoplasm name
Cases
--------------------------------------------------------------
023 069 papillary serous cystadenocarcinoma
025 069 squamous cell carcinoma nos
050 075 carcinoma nos
061 066 adenocarcinoma nos
126 078 neoplasm, malignant
--------------------------------------------------------------
SUMMARY
There is a tautologic remark that pathologists use. "Common tumors occur commonly, and uncommon tumors occur uncommonly." This means that a pathologist should be cautious when assigning a diagnosis that rarely occurs at the location where the tumor has arisen.
To know which tumors commonly occur, at what sites, at what ages, in what ethnic populations, it is very useful to have a large collection of neoplasms from which to study, and to have a good understanding of the frequency of occurrence of the neoplasms that arise at the site. The SEER data set permits such determinations.
As specified in the SEER Data Agreement, the citation for the SEER data
is as follows:
Surveillance, Epidemiology, and End Results (SEER) Program
(www.seer.cancer.gov) Limited-Use Data (1973-2005), National Cancer
Institute, DCCPS, Surveillance Research Program, Cancer Statistics
Branch, released April 2008, based on the November 2007 submission.
If you want to do creative data mining, you will need to
learn a little computer programming.
For Perl and Ruby programmers, methods and scripts for using
SEER and other publicly available biomedical databases, are
described in detail in my prior books:
Perl Programming for Medicine and Biology
Ruby Programming for Medicine and Biology
An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.
More information on cancer is available in my recently published
book, Neoplasms.
I also maintain a blog where I regularly write about data organization,
data annotation, data retrieval, and data mining.
http://julesberman.blogspot.com
© 2008 Jules Berman
key words: neoplasms, cancer, neoplasia, precancer, tumor, tumour, tumors, tumours, neoplasm, carcinogenesis, carcinogens, tumor genetics
The anatomic figures appearing on these pages were provided by
Wikipedia. They are digital reproductions of images found in Gray's
Anatomy. Because their copyright has expired, the images now fall
within the public domain.
As with all of my scripts, lists, web sites, and blog entries,
the following disclaimer applies.
This material is provided by its creator, Jules J. Berman, "as is",
without warranty of any kind, expressed or implied, including but
not limited to the warranties of merchantability, fitness for a
particular purpose and noninfringement. in no event shall the
author or copyright holder be liable for any claim, damages or
other liability, whether in an action of contract, tort or otherwise,
arising from, out of or in connection with the material or the use or
other dealings in the material.
Last modified: November 18, 2008