Supplemental Book Resources for:
- Author Information
- Corrections site for Neoplasms: Principles of Development and Diversity
- Approximately
240 lists from
Biomedical Informatics
- Phylogeny (class hierarchy) construction for any of over 400,000 organisms.
- White paper: Jules J. Berman and G. William Moore.
Implementing an RDF Schema for Pathology Images, September 10, 2007.
- Essay on
Tumor Speciation, February 19, 2009.
- Developmental Lineage Classification and Taxonomy of Neoplasms
The Developmental Lineage Classification and Taxonomy of Neoplasms
is an open source computer-parsable data set that can be used to organize,
collect, merge, share, analyze, understand, develop hypotheses, and test hyposthese,
and discover new information related to neoplasia. The classification is
used extensively in software projects featured elsewhere on this page.
The Developmental Lineage Classification and Taxonomy of Neoplasms
is available as a free, open source document available in several
different file formats at: http://www.julesberman.info/devclass.htm
The ontology for the
The Developmental Lineage Classification and Taxonomy of Neoplasms
is illustrated in the schematic below.
The classification and taxonomy contains about 6,000 classified
types of neoplasms and over 130,000 neoplasm names.
It is the largest cancer nomenclature in existence.
The classification has been used to generate a list of
inherited medical conditions that are associated with neoplasms,
available at:
http://www.julesberman.info/omimneo.htm
The classification also serves as the dataset underlying a neoplasm
search engine that provides synonyms and related terms for phrases
entered into a query box. This is available at:
http://www.julesberman.info/neoget.htm
- Parsable Doublets List
Word doublets are two-word phrases that
appear in text (i.e., they are not randomly chosen two-word
sequences.)
Doublets can be used in a variety of informatics projects: indexing,
data scrubbing, nomenclature curation, etc.
The available public domain doublet list was generated from a large narrative pathology text. Thus
the doublets included here would be particularly suitable for
informatics projects involving surgical pathology reports, autopsy
reports, pathology papers and books, and so on.
- De-identifier for confidential medical records.
A scrubber (deidentification) use-case for the doublets
list is available.
It consists of scrubbed output for over 15000 citations of
pathology papers (from PubMed)
A slightly more complex scrubber (deidentification)
that preserves punctuation in the original narrative textuse.
A scrubber (deidentification)
that parses a public domain book.
The scrubber, which is distributed under a GNU license,
uses the doublet method (described in all two of my
previously published books). It parses through any text,
matching doublets from the text against an external identifier-free
doublet list, preserving all matching doublets from the text, and
blocking all non-matching words with an asterisk. If your list of
doublets contains no identifiers, the scrubbed output should be
perfectly de-identified. Though perfection can never be guaranteed,
I have never encountered any "missed" identifiers in a text that was
parsed under these conditions.
The doublet scrubber is small (just a few dozen lines of
code) and fast. It took approximately 2 seconds to parse the 15000 citations using a
Perl script with access to a list of about 200,000 identifier-free
doublets. I used my home computer (2.8 GHz, 512 MByte RAM).
This is a scrubbing rate of 1 MegaByte per second. At this speed, a
1 GByte file could be parsed in about 15 minutes. It can parse a
1 Terabyte file in about a week. Large hospitals produce about 1 Terabyte
of data each week, so this scrubber can, for now, "keep up" with the vast load of
data produced by many hospital on a modest desktop computer.
The only limitation that I have found for the doublet
scrubber is that it scrubs too much, blocking all doublets not found
in the external doublet list. You can be the judge by studying
the scrubbed output files:
Over 15000 PubMed Citations, scrubbed
Anomalies and Curiosities of Medcine (scrubbed output)
- Medical Autocoding: A Ruby medical autocoder ,
a Perl medical autocoder and
a Python medical autocoder.
- Prime number generation: A Ruby prime number generator ,
a Perl prime number generator and
a Python prime number generator.
- A phylogeny annotator for taxonomy.dat, the European Bioinformatics Institute's 100+ Mbyte
list of species, in Ruby ,
Perl, and
Python.
- A
MeSH Tree (more precisely MeSH ontology) climber in
Perl , which produces a full listing of the term relationships
for every entry in MeSH (the U.S. National Library of Medicine's Medical Subject
Headings).
- Combined Autocoded and Scrubbed output
for 95,260 PubMed Citations (computed in under a minute)
- Berman JJ. Nomenclature-based data retrieval without prior
annotation: facilitating biomedical data integration with fast
doublet matching.
In Silico Biology 5, 0029 (2005)
- Approximately 18,000
biomedical factoids
(distributed in 180 information pages).
- Over 12,000
biomedical abbreviations.
- Chronology of Earth (important events in the history
of science, medicine, informatics, and society).
-
Common misspellings in medicine and pathology.
-
Confusing terms in medicine and pathology.
-
Specified Life Blog: Devoted to the topic of data
specification (including data organization, data description,
data retrieval and data sharing) in the life sciences and in medicine.
- Perl script,
omim_4.pl extracts classified neoplasm terms from OMIM
(Online Mendelian Inheritance in Man). Described in Chapter 19,
of
Perl Programming for Medicine and Biology
- SEER AND CDC documents
- Compilation of age distributions (raw data and graphs) for hundreds
of types of cancers, using the SEER
(Surveillance Epidemiology End Result) public-use data records:
PDF document
- Compilations (raw data and graphs) for cancers
that have multimodal distributions
(two or more peaks in ages of occurrence data) in
(Surveillance Epidemiology End Result) public-use data records:
PDF document
- General instructions for acquiring and using the SEER
(Surveillance Epidemiology End Result) public-use data records:
Web page
- SEER sample projects:
Web page
Web page
- Tutorial on data analysis techniques and projects for the CDC (Centers
for Disease Control and Prevention) public use data sets:
PDF document
- Ruby Scripts
Last modified: February 19, 2009