Author Information
Contact
Jules J. Berman
by email.
Publications
of Jules J. Berman.
Jules J. Berman's
blog.
Corrections site for
Neoplasms: Principles of Development and Diversity
Approximately
240 lists
from
Biomedical Informatics
Phylogeny (class hierarchy) construction
for any of over 400,000 organisms.
White paper: Jules J. Berman and G. William Moore.
Implementing an RDF Schema for Pathology Images
, September 10, 2007.
Essay on
Tumor Speciation
, February 19, 2009.
Developmental Lineage Classification and Taxonomy of Neoplasms
The Developmental Lineage Classification and Taxonomy of Neoplasms is an open source computer-parsable data set that can be used to organize, collect, merge, share, analyze, understand, develop hypotheses, and test hyposthese, and discover new information related to neoplasia. The classification is used extensively in software projects featured elsewhere on this page.
The Developmental Lineage Classification and Taxonomy of Neoplasms is available as a free, open source document available in several different file formats at:
http://www.julesberman.info/devclass.htm
The ontology for the The Developmental Lineage Classification and Taxonomy of Neoplasms is illustrated in the schematic below.
The classification and taxonomy contains about 6,000 classified types of neoplasms and over 130,000 neoplasm names. It is the largest cancer nomenclature in existence.
The classification has been used to generate a list of inherited medical conditions that are associated with neoplasms, available at:
http://www.julesberman.info/omimneo.htm
The classification also serves as the dataset underlying a neoplasm search engine that provides synonyms and related terms for phrases entered into a query box. This is available at:
http://www.julesberman.info/neoget.htm
Parsable Doublets List
Word doublets are two-word phrases that appear in text (i.e., they are not randomly chosen two-word sequences.)
Doublets can be used in a variety of informatics projects: indexing, data scrubbing, nomenclature curation, etc.
The
available public domain doublet list
was generated from a large narrative pathology text. Thus the doublets included here would be particularly suitable for informatics projects involving surgical pathology reports, autopsy reports, pathology papers and books, and so on.
De-identifier for confidential medical records.
A scrubber (deidentification) use-case for the doublets list is
available
. It consists of scrubbed output for over 15000 citations of pathology papers (from PubMed)
A
slightly more complex scrubber
(deidentification) that preserves punctuation in the original narrative textuse.
A
scrubber
(deidentification) that parses a public domain book.
The scrubber, which is distributed under a GNU license, uses the doublet method (described in all two of my previously published books). It parses through any text, matching doublets from the text against an external identifier-free doublet list, preserving all matching doublets from the text, and blocking all non-matching words with an asterisk. If your list of doublets contains no identifiers, the scrubbed output should be perfectly de-identified. Though perfection can never be guaranteed, I have never encountered any "missed" identifiers in a text that was parsed under these conditions.
The doublet scrubber is small (just a few dozen lines of code) and fast. It took approximately 2 seconds to parse the 15000 citations using a Perl script with access to a list of about 200,000 identifier-free doublets. I used my home computer (2.8 GHz, 512 MByte RAM). This is a scrubbing rate of 1 MegaByte per second. At this speed, a 1 GByte file could be parsed in about 15 minutes. It can parse a 1 Terabyte file in about a week. Large hospitals produce about 1 Terabyte of data each week, so this scrubber can, for now, "keep up" with the vast load of data produced by many hospital on a modest desktop computer.
The only limitation that I have found for the doublet scrubber is that it scrubs too much, blocking all doublets not found in the external doublet list. You can be the judge by studying the scrubbed output files:
Over 15000 PubMed Citations, scrubbed
Anomalies and Curiosities of Medcine (scrubbed output)
Medical Autocoding: A
Ruby
medical autocoder
, a
Perl
medical autocoder
and a
Python
medical autocoder.
Prime number generation: A
Ruby
prime number generator
, a
Perl
prime number generator
and a
Python
prime number generator.
A phylogeny annotator for taxonomy.dat, the European Bioinformatics Institute's 100+ Mbyte list of species, in
Ruby
,
Perl
,
and
Python
.
A
MeSH Tree (more precisely MeSH ontology) climber in Perl
, which produces a full listing of the term relationships for every entry in MeSH (the U.S. National Library of Medicine's Medical Subject Headings).
Combined Autocoded and Scrubbed output
for 95,260 PubMed Citations (computed in under a minute)
Berman JJ. Nomenclature-based data retrieval without prior annotation: facilitating biomedical data integration with fast doublet matching.
In Silico Biology 5, 0029 (2005)
Approximately 18,000 biomedical
factoids
(distributed in 180 information pages).
Over 12,000 biomedical
abbreviations.
Chronology of Earth
(important events in the history of science, medicine, informatics, and society).
Common misspellings
in medicine and pathology.
Confusing terms
in medicine and pathology.
Specified Life Blog:
Devoted to the topic of data specification (including data organization, data description, data retrieval and data sharing) in the life sciences and in medicine.
Perl script,
omim_4.pl
extracts classified neoplasm terms from OMIM (Online Mendelian Inheritance in Man). Described in Chapter 19, of
Perl Programming for Medicine and Biology
SEER AND CDC documents
Compilation of age distributions (raw data and graphs) for hundreds of types of cancers, using the SEER (Surveillance Epidemiology End Result) public-use data records:
PDF document
Compilations (raw data and graphs) for cancers that have multimodal distributions (two or more peaks in ages of occurrence data) in (Surveillance Epidemiology End Result) public-use data records:
PDF document
General instructions for acquiring and using the SEER (Surveillance Epidemiology End Result) public-use data records:
Web page
SEER sample projects:
Web page
Web page
Tutorial on data analysis techniques and projects for the CDC (Centers for Disease Control and Prevention) public use data sets:
PDF document
Ruby Scripts
Basics of
image display
in Ruby
Using
RMagick
, Ruby's interface to ImageMagick
Creating and displaying a
simple image
in Ruby using Tk and RMagick.
Extracting and displaying a section of a
simple image
in Ruby using Tk and RMagick.
Ruby script
extracts a hierarchy for each individual organism listed in taxonomy.dat
Ruby script,
embryo.rb
for building an anatomic hierarchy from an embryological nomenclature
Ruby script,
combo.rb
for autocoding narrative medical text.
Another
Ruby medical autocoder
.
A
prime number generator in Ruby
.
Last modified: October 24, 2009, Jules Berman