Books by Jules J. Berman, covers


Author Information
Other Resources

Selected disease lists
  • Approximately 240 lists from Biomedical Informatics

  • Phylogeny (class hierarchy) construction for any of over 400,000 organisms.

  • White paper: Jules J. Berman and G. William Moore. Implementing an RDF Schema for Pathology Images, September 10, 2007.

  • Essay on Tumor Speciation, February 19, 2009.

  • Developmental Lineage Classification and Taxonomy of Neoplasms

    The Developmental Lineage Classification and Taxonomy of Neoplasms is an open source computer-parsable data set that can be used to organize, collect, merge, share, analyze, understand, develop hypotheses, and test hyposthese, and discover new information related to neoplasia. The classification is used extensively in software projects featured elsewhere on this page.

    The Developmental Lineage Classification and Taxonomy of Neoplasms is available as a free, open source document available in several different file formats at:

    The ontology for the The Developmental Lineage Classification and Taxonomy of Neoplasms is illustrated in the schematic below.

    neoplasm classification

    The classification and taxonomy contains about 6,000 classified types of neoplasms and over 130,000 neoplasm names. It is the largest cancer nomenclature in existence.

    The classification has been used to generate a list of inherited medical conditions that are associated with neoplasms, available at:

    The classification also serves as the dataset underlying a neoplasm search engine that provides synonyms and related terms for phrases entered into a query box. This is available at:

  • Parsable Doublets List

    Word doublets are two-word phrases that appear in text (i.e., they are not randomly chosen two-word sequences.)

    Doublets can be used in a variety of informatics projects: indexing, data scrubbing, nomenclature curation, etc.

    The available public domain doublet list was generated from a large narrative pathology text. Thus the doublets included here would be particularly suitable for informatics projects involving surgical pathology reports, autopsy reports, pathology papers and books, and so on.

  • De-identifier for confidential medical records.

    A scrubber (deidentification) use-case for the doublets list is available. It consists of scrubbed output for over 15000 citations of pathology papers (from PubMed)

    A slightly more complex scrubber (deidentification) that preserves punctuation in the original narrative textuse.

    A scrubber (deidentification) that parses a public domain book.

    The scrubber, which is distributed under a GNU license, uses the doublet method (described in all two of my previously published books). It parses through any text, matching doublets from the text against an external identifier-free doublet list, preserving all matching doublets from the text, and blocking all non-matching words with an asterisk. If your list of doublets contains no identifiers, the scrubbed output should be perfectly de-identified. Though perfection can never be guaranteed, I have never encountered any "missed" identifiers in a text that was parsed under these conditions.

    The doublet scrubber is small (just a few dozen lines of code) and fast. It took approximately 2 seconds to parse the 15000 citations using a Perl script with access to a list of about 200,000 identifier-free doublets. I used my home computer (2.8 GHz, 512 MByte RAM). This is a scrubbing rate of 1 MegaByte per second. At this speed, a 1 GByte file could be parsed in about 15 minutes. It can parse a 1 Terabyte file in about a week. Large hospitals produce about 1 Terabyte of data each week, so this scrubber can, for now, "keep up" with the vast load of data produced by many hospital on a modest desktop computer.

    The only limitation that I have found for the doublet scrubber is that it scrubs too much, blocking all doublets not found in the external doublet list. You can be the judge by studying the scrubbed output files:

    Over 15000 PubMed Citations, scrubbed

    Anomalies and Curiosities of Medcine (scrubbed output)

  • Medical Autocoding: A Ruby medical autocoder , a Perl medical autocoder and a Python medical autocoder.

  • Prime number generation: A Ruby prime number generator , a Perl prime number generator and a Python prime number generator.

  • A phylogeny annotator for taxonomy.dat, the European Bioinformatics Institute's 100+ Mbyte list of species, in Ruby , Perl, and Python.

  • Perl script for moving files between (sub)directories

  • A MeSH Tree (more precisely MeSH ontology) climber in Perl , which produces a full listing of the term relationships for every entry in MeSH (the U.S. National Library of Medicine's Medical Subject Headings).

  • Combined Autocoded and Scrubbed output for 95,260 PubMed Citations (computed in under a minute)

  • Berman JJ. Nomenclature-based data retrieval without prior annotation: facilitating biomedical data integration with fast doublet matching.
    In Silico Biology 5, 0029 (2005)

  • Over 12,000 biomedical abbreviations.

  • Chronology of Earth (important events in the history of science, medicine, informatics, and society).

  • Common misspellings in medicine and pathology.

  • Confusing terms in medicine and pathology.

  • Specified Life Blog: Devoted to the topic of data specification (including data organization, data description, data retrieval and data sharing) in the life sciences and in medicine.

  • Perl script, extracts classified neoplasm terms from OMIM (Online Mendelian Inheritance in Man). Described in Chapter 19, of Perl Programming for Medicine and Biology

  • SEER AND CDC documents

  • Ruby Scripts
    Last modified: May 28, 2014
    Tags: informatics, perl programming, ruby programming, perl scripts, ruby scripts,

  • Books by Jules J. Berman, covers