Supplemental Book Resources for:
- Author Information
- Corrections site for Ruby Programming for Medicine and Biology
- Corrections site for Perl Programming for Medicine and Biology
- Corrections site for Biomedical Informatics
- Corrections site for The Ruby Programming Language
- Corrections site for The Perl Programming Language
- Corrections site for Neoplasms: Principles of Development and Diversity
- Approximately
240 lists from
Biomedical Informatics
- Phylogeny (class hierarchy) construction for any of over 400,000 organisms.
- White paper: Jules J. Berman and G. William Moore.
Implementing an RDF Schema for Pathology Images
September 10, 2007
- Developmental Lineage Neoplasms Classfication (now in RDF ontology format).
The RDF file is about 10 Megabytes and is so large that some
browsers cannot parse the entire file, with all of its child nodes.
I am providing the file as a simple text file. This way your
browser will not try parse the nodal hierarchy.
Whenever you want to parse the
file as an RDF document, you can just rename
the file "neordf.xml".
http://www.julesberman.info/neordf.txt
The file was validated using the W3C validator service
at http://www.w3.org/rdf/validator/, with a caveat. The
full ontology file (10+ Mbytes) was too large for the
validator, so I truncated the ontology, validated the
truncated file (that contained all of the classes,
subclasses, properties), and left out the repetitive
list of terms. Then I took the entire file and validated
it with an XML parser to verify that the file was well-formed).
That really covers everything (RDF semantics and XML structure).
The gzipped version of the RDF file (under 1 Megabyte)
http://www.julesberman.info/neorxml.gz
The flat file version, listing each term followed by its lineage (gzipped file).
http://www.julesberman.info/neoself.gz
The plain old XML version, with no RDF semantics (compressed gzip file).
http://www.julesberman.info/neoclxml.gz
The plain old XML version, with no RDF semantics (compressed zip file).
http://www.julesberman.info/neoclxml.zip
The ontology contains several parts, including the
neoplasms classification proper (illustrated in the schematic below).
In this version (October 27, 2007), there are 5841 classified
types of neoplasms and
130,503 terms representing the 5,841 types of neoplasms.
This version of the neoplasms classification represents the largest
nomenclature of neoplasms and, with
today's publication, the largest formal ontology (in RDF syntax) of
neoplasm names.
- Parsable Doublets List
Word doublets are two-word phrases that
appear in text (i.e., they are not randomly chosen two-word
sequences.)
Doublets can be used in a variety of informatics projects: indexing,
data scrubbing, nomenclature curation, etc.
The available public domain doublet list was generated from a large narrative pathology text. Thus
the doublets included here would be particularly suitable for
informatics projects involving surgical pathology reports, autopsy
reports, pathology papers and books, and so on.
- De-identifier for confidential medical records.
A scrubber (deidentification) use-case for the doublets
list is available.
It consists of scrubbed output for over 15000 citations of
pathology papers (from PubMed)
A slightly more complex scrubber (deidentification)
that preserves punctuation in the original narrative textuse.
A scrubber (deidentification)
that parses a public domain book.
The scrubber, which is distributed under a GNU license,
uses the doublet method (described in all two of my
previously published books). It parses through any text,
matching doublets from the text against an external identifier-free
doublet list, preserving all matching doublets from the text, and
blocking all non-matching words with an asterisk. If your list of
doublets contains no identifiers, the scrubbed output should be
perfectly de-identified. Though perfection can never be guaranteed,
I have never encountered any "missed" identifiers in a text that was
parsed under these conditions.
The doublet scrubber is small (just a few dozen lines of
code) and fast. It took approximately 2 seconds to parse the 15000 citations using a
Perl script with access to a list of about 200,000 identifier-free
doublets. I used my home computer (2.8 GHz, 512 MByte RAM).
This is a scrubbing rate of 1 MegaByte per second. At this speed, a
1 GByte file could be parsed in about 15 minutes. It can parse a
1 Terabyte file in about a week. Large hospitals produce about 1 Terabyte
of data each week, so this scrubber can, for now, "keep up" with the vast load of
data produced by many hospital on a modest desktop computer.
The only limitation that I have found for the doublet
scrubber is that it scrubs too much, blocking all doublets not found
in the external doublet list. You can be the judge by studying
the scrubbed output files:
Over 15000 PubMed Citations, scrubbed
Anomalies and Curiosities of Medcine (scrubbed output)
- Medical Autocoding: A Ruby medical autocoder ,
a Perl medical autocoder and
a Python medical autocoder.
- Prime number generation: A Ruby prime number generator ,
a Perl prime number generator and
a Python prime number generator.
- A phylogeny annotator for taxonomy.dat, the European Bioinformatics Institute's 100+ Mbyte
list of species, in Ruby ,
Perl, and
Python.
- A
MeSH Tree (more precisely MeSH ontology) climber in
Perl , which produces a full listing of the term relationships
for every entry in MeSH (the U.S. National Library of Medicine's Medical Subject
Headings).
- Combined Autocoded and Scrubbed output
for 95,260 PubMed Citations (computed in under a minute)
- Berman JJ. Nomenclature-based data retrieval without prior
annotation: facilitating biomedical data integration with fast
doublet matching.
In Silico Biology 5, 0029 (2005)
- Approximately 18,000
biomedical factoids
(distributed in 180 information pages).
- Over 12,000
biomedical abbreviations.
- Chronology of Earth (important events in the history
of science, medicine, informatics, and society).
-
Common misspellings in medicine and pathology.
-
Confusing terms in medicine and pathology.
-
Specified Life Blog: Devoted to the topic of data
specification (including data organization, data description,
data retrieval and data sharing) in the life sciences and in medicine.
- Perl script,
omim_4.pl extracts classified neoplasm terms from OMIM
(Online Mendelian Inheritance in Man). Described in Chapter 19,
of
Perl Programming for Medicine and Biology
- Ruby Scripts
Last modified: June 7, 2008