Books by Jules J. Berman, covers

Lists from Biomedical Informatics, by Jules J. Berman, Ph.D., M.D.

These are the lists included in my book, Biomedical Informatics. The purpose of this page is to provide a sampling of the subjects covered in the book's narrative sections.

   List 0.0.0. What every determined reader will learn from this book.

1   How to acquire and organize biomedical data even when the
      data is received in the form of unstructured text
2   How to merge and share biomedical data even when the data 
      is confidential or comes from seemingly incompatible 
3   How to write your own programs in Perl that will allow you
      to perform common informatics tasks with just a few  lines
      of code
4   How to automatically index biomedical text and code text
      using freely available biological and medical nomenclatures
5   How to use metadata to provide structure and meaning to 
      biomedical datasets
6   How to use confidential medical data while obeying current
      law  and protecting patients
7   How to reduce the complexity of biomedical data and
      biomedical software
8   How to evaluate ethical problems related to intellectual
      property, privacy, human subjects research (see Glossary),
      data sharing (see Glossary),  and software development.

   List 0.0.1. People who will benefit from reading this book.

1   Bioinformaticians
2   Biomedical scientists
3   Clinical trialists
4   Computer scientists who need cross-over skills in the
      biomedical sciences
5   Government officials at any of the health-related federal
6   Healthcare graduate students and professionals who use large
      biomedical datasets, who need to have data/software
      interoperability, or who need to comply with federal, state,
      or institutional data requirements
7   Hospital staff, including medical students, physicians,
      nurses, technicians, hospital administrators, information
8   Lawyers who handle intellectual property (see Glossary)
      cases related to biomedicine
9   Library scientists
10   Medical ethicists
11  Medical software developers and vendors
12  Medical transcriptionists
13  Members of IRBs (Institutional Review Boards, see Glossary)
    and Privacy Boards (see Glossary)
14  Privacy experts who work with medical scientists

   List 1.1.1. Roles of the biomedical informatician.

1   Biologist
2   Healthcare professional
3   Lawyer
4   Software programmer
5   Computer scientist
6   Cryptographer
7   Metadata expert
8   Linguist
9   Statistician
10   Diplomat

   List 1.2.1. Pre-1955 biomedical advances resulting in increased longevity in many developed countries.

1   Antisepsis
2   Refrigeration of food
3   Standards for the hygienic preparation of food
4   Eradication of insect vectors for yellow fever and malaria
5   Potable public drinking water
6   Antibiotics effective against many bacterial infections
      including syphilis, gonorrhea and tuberculosis
7   Vaccines against smallpox and polio
8   The virtual elimination of iodine-deficiency associated
9   The near elimination of vitamin deficiency diseases
10   The marked reduction of cervical cancer in women thanks to 
      cytologic screening of cervical smears
11  The prevailing blood tests and quantitative blood cell
    analyses used to monitor deviations from normal function
12  The correction of diabetic hyperglycemia with insulin
13  The introduction of radiologic imaging
14  The treatment of hypertension with large variety of
    effective drugs
15  The recognition of the association between cigarette use and
16  The role of diet and of cigarettes in the progression of
    vascular diseases.

   List 1.2.2. Medical setbacks since 1955.

1   The global spread of AIDS
2   Diminished access to potable water in much of the world
3   The emergence of multiple antibiotic resistant strains of S.
      aureus and other previously treatable organisms
4   Increased number of cancer patients due primarily to an
      absolute increase in the number of senior citizens at
      highest risk for cancer
5   The re-emergence of tuberculosis
6   The re-emergence of insect and other vectors carrying viral
      and parasitic diseases
7   The astronomical costs of new, effective medications for
      chronic diseases, including cancer
8   High quality, long-term health care attainable for only a
      small fraction of the earth's population
9   The rising incidence of obesity and sequelae disorders,
10   The rapid geographic spread of outbreaks of new strains of
      influenza and other evolving viruses, including HIV and
      hemorrhagic fever viruses
11  The threat of destructive and pathogenic species of plants,
    insects and animals that have been introduced to new
    habitats through acts of human negligence or error
12  Weakening of the earth's ozone layer, increasing human
    exposure to ultraviolet radiation
13  The political uses of toxic agents, endemic diseases, and
    public health infrastructures.

   List 1.3.1. Immediate consequences of Semelweis' prevention of puerperal fever deaths.

1   The medical students were opposed to being forced to wash
      their hands
2   Semmelweis' superior, Johann Klein, was likewise opposed,
      considering the clinical trial a criticism of his
3   Other obstetricians agreed that Semmelweis' measures were an
      attack on their professional conduct
4   The maternity patients were opposed as well, interpreting
      sanitary measures as a criticism of their personal

   List 1.3.2. Beliefs held by biomedical informaticians.

1   Medical progress requires the integration of biological data
      and clinical data
2   Aggregate clinical data has value beyond its use in guiding
      the treatment of individual patients
3   Researchers need methods to acquire clinical data without
      harming patients
4   To be useful, biological and clinical data need to be
      organized in a standard manner that permits seamless data
      integration (see Glossary)
5   Classifications (see Glossary) drive down the complexity of
      clinical and biological data)
6   Important new testable hypotheses may derive from
      pre-existing biological and clinical datasets, but only if
      the datasets are made available to scientists
7   The primary data that supports scientific assertions should
      be be made publicly available, whenever feasible
8   Data analysis is an inexpensive science, particularly if you
      know how to program.

   List 1.4.1. Some important bottlenecks in translational research.

1   Access to clinically annotated tissues collected from human
2   Access to electronic medical records and other electronic
      archives of human clinical data
3   Methods to organize data in a manner that permits the data
      to be meaningful and comparable from laboratory to
      laboratory and institution to institution
4   Methods to draw clinically valid conclusions from large
      datasets containing heterogeneous types of data (e.g.
      molecular data and clinical test data).

   List 1.5.1. Basic skills and activities in biomedical informatics.

1   An understanding of a computer's file and subdirectory
2   The ability to download, install and use popular software
      applications and utilities
3   An awareness of the differences between structured and
      unstructured data
4   Basic understanding of XML (see Glossary) and metadata
5   Basic appreciation of computer algorithms
6   Some familiarity with data privacy rules and how these rules
      relate to the research uses of medical data. Most countries
      have such privacy regulations for biomedical data. In the
      U.S., this would be the HIPAA privacy rules (), and in the
      United Kingdom, it would be the Data Protection act ()
7   A general understanding of concepts of medical record
8   Familiarity with the publicly available biological search
      engines, databases and tools, including PubMed and

   List 1.5.2. Advanced skills and activities in biomedical informatics.

1   Programming at a moderate level in at least one programming
2   Experience choosing and implementing a laboratory or
      hospital information system
3   Knowledge of regulations pertaining to the use of identified
      medical data in research
4   Participation in an effort seeking FDA approval for a device
      or technology developed from a biomedical informatics effort
5   Participation on a standards committee
6   Intermediate level understanding of XML
7   Basic understanding of RDF (see Glossary)
8   Experience as a member of an IRB or Privacy Board
9   Competing for funding for a biomedical informatics grant
      (see Glossary) or contract.

   List 1.7.1. Steps in gold mining (or data mining).

1   Physical access to mine
2   Legal rights of access to mine
3   Acquire tools to find desired items in mine
4   Acquire tools to extract desired items in mine
5   Acquire tools to refine desired items
6   Acquire tools to certify the purity and quantity of desired
7   Transform the desired items into a standard format
8   Transport the desired items to an intended recipient
9   Arrang payment for the desired items
10   Store the desired items
11  Protect the stored items.

   List 1.8.1. Realistic uses of Biomedical Informatics.

1   Store, share, search, retrieve and analyze heterogeneous
      data sources. This entire process is vastly enhanced by our
      current ability to send any type of data anywhere at
      anytime, cheaply
2   Create large comprehensive databases (millions of cases)
      that allow you to ask questions that could not be asked of
      small or non-comprehensive databases
3   Drive down the complexity of biomedical data by using data
      specifications (see Glossary) and classifications
4   Track data collected in hospital information systems and
      dispatch automatic clinical alerts when data values fall
      outside an expected range of behavior or when values violate
      the expected properties of data classes
5   Develop new hypotheses by examining and correlating
      biological and clinical observations
6   Validate new clinical tests and treatments by examining the
      correlations between test values, treatment choices and
      clinical outcomes.

   List 1.8.2. Unrealistic uses of Biomedical informatics.

1   Replace physicians with computers. Doctors are trained to
      make diagnoses, and they don't desire or use software that
      purports to does this for them
2   Create superdoctors through the use of computer tools. The
      practice of medicine is learned through personal
      experiences. Doctors do not need simulations of reality
3   Vastly improve upon books and traditional teaching
      strategies. Books are an adequate method of conveying
      knowledge. Computers can certainly provide some improvements
      to book learning, but there is no reason to think that a
      system of learning based on printed literature, that works
      perfectly  fine, can be vastly improved
4   Solve subtle or complex problems via the use of medical
      ontologies (see Glossary). Complex systems are inherently
      chaotic, and inferences reached through a logical ontology
      modeling a complex system are likely to be misleading
5   Create, within the next decade, comprehensive medical
      records for all U.S. citizens that can be accessed  and
      annotated by all authorized care-givers. This holy grail of
      U.S. medical informatics is a worthy long-term pursuit, but
      there is no reason to expect that it can be achieved within
      a decade or even two decades.

   List 2.2.1. When are databases particularly useful?

1   When the stored data is complex (e.g., hospitals and
      academic centers)
2   When the basic data structure is constant (i.e., when the
      model of the data records does not change)
3   When there are continuous real-time additions, deletions and
      modifications of records by multiple users
4   When the computer staff (responsible for the data) prefers
      to work with databases.

   List 2.2.2. When are data files particularly useful?

1   When the dataset is relatively stable
2   When the data structure is relatively instable (i.e., when
      the fundamental model of the data records changes)
3   When XML is the native method of data representation
4   When the computer staff (responsible for the data) prefers
      to work with data files.

   List 2.3.1. Three properties of reality relevant to hospital databases.

1   Database records can be designed in such as manner as to
      corrupt the integrity of the database
2   Databases do not care if their integrity is corrupted
3   Modifications to the basic structures of database records
      almost always have negative (sometimes catastrophic)

   List 2.3.2. Common weaknesses of some hospital databases.

1   Inability to guarantee that every patient is uniquely
      identified within the database
2   Inability to classify types of data into groups with shared
3   Inability to extend data records to include data elements
      linked to other databases
4   Inability to organize data as simple collections of
      meaningful statements
5   Inability to produce self-describing data records (i.e.,
      including  in data records all the data necessary to fully
      describe the  meaning of the data record)

   List 2.3.3. Desiderata for hospital information systems.

1   Every patient must be uniquely identified within the system
2   Every report must be uniquely identified and associated with
      one patient
3   Data items contained in reports must be entered into reports
      once only
4   Data items must be well-defined and used in a consistent
      manner throughout the system
5   Data values must be bound to a unique identifier (see
      Glossary) and associated with a unique report
6   All data entered should be technically retrievable
7   Someone must have the authority to retrieve any and all data
      in the hospital information system
8   Data, once entered, should not be corrected or modified in
      any way without creating a visible transaction record of the
9   All electronic data related to an electronic record in the
      hospital information system should be included in the
      hospital information system.

   List 2.11.1. General classes of patents.

1   Utilities - new and useful methods, machines, items, or
      chemical compounds
2   Designs - a new appearance for a manufactured article
3   Plants - the invention or discovery of a plant variety that
      can be asexually reproduced

   List 2.12.1. Copyright Act of 1976, Title 17, U.S. Code, section 107. Limitations on exclusive rights: Fair use.

1   Notwithstanding the provisions of sections 106 and 106A, the
      fair use of a copyrighted work, including such use by
      reproduction in copies or phonorecords or by any other means
      specified by that section, for purposes such as criticism,
      comment, news reporting, teaching (including multiple copies
      for classroom use), scholarship, or research, is not an
      infringement of copyright. In determining whether the use
      made of a work in any particular case is a fair use the
      factors to be considered shall include -
2   (1) the purpose and character of the use, including whether
      such use is of a commercial nature or is for nonprofit
      educational purposes;
3        (2) the nature of the copyrighted work;
4   (3) the amount and substantiality of the portion used in
      relation to the copyrighted work as a whole; and
5   (4) the effect of the use upon the potential market for or
      value of the copyrighted work
6   The fact that a work is unpublished shall not itself bar a
      finding of fair use if such finding is made upon
      consideration of all the above factors. ()

   List 2.15.1. Tissues that are routinely destroyed by pathology departments.

1   Institutions regularly dispose of tissues removed during
      surgical procedures. When a large specimen, such as a colon,
      is received in a pathology department, samples are routinely
      embedded in paraffin and saved for at least 5 years. The
      unsampled colon (the bulk of the specimen) is saved for
      several weeks, sufficient time to ensure that the
      pathologist has rendered a final diagnosis on the specimen,
      and then the specimen is discarded
2   Institutions regularly dispose of archived paraffin-embedded
      tissues. Most institutions archive paraffin-embedded tissues
      for at least 5 years. At that time, some medical centers
      conclude that the tissues are no longer of any importance to
      the patient. To avoid the expense of continued storage, some
      institutions simply dispose of archived material after 5

   List 2.15.2. Questions that institutions should ask before transferring tissues and medical records to an external tissue repository.

1   Would the transfer to a third party constitute a sale of
      human tissue?
2   Would the transfer to a third party harm any of the patients
      from whom the tissue was excised?
3   Would the transfer to a third party benefit society?
4   Do any of the institutional staff encouraging the transfer
      of tissues and data have relevant conflicts of interest?

   List 2.16.1. Recent developments that have enhanced access to experimental datasets.

1   Online journals that invite authors to submit data files
2   Editor policies that require the submission of data files
      supporting assertions made in manuscripts
3   Technical ease of storing large datasets on publicly
      available servers
4   Technical ease of downloading large datasets from servers
      via the internet
5   Data sharing requirements issued by biomedical funding
6   Expansion of Freedom of Information Act
7   Greater involvement of informaticians in biomedical research
8   Scientific advancements using publicly available datasets
9   Stunning power and scope of publicly available search
      engines, including Google (internet documents) and PubMed
      (medical abstracts)

   List 2.18.1. Some definitions of terms related to the open source movement.

1   Free software: The concept of free software, as popularized
      by the Free Software Foundation, refers to software that can
      be used freely, without restriction, and does not
      necessarily relate to the actual cost of the software. The
      generally acknowledged father of the free software movement
      is Richard Stallman, an MIT visionary who has led an
      energetic and unwavering campaign to create and freely
      distribute some of the most valued software applications in
      use today. The free software movement is similar to the open
      source software movement, but some of the features of free
      software (ability to modify and re-distribute software in a
      prescribed manner as discussed in the software license) are
      not always guaranteed in open source software (see List)
2   Open source - The Open Source Software movement is an
      offspring of the Free Software movement. The reason that the
      open source movement was created was, in part, to placate
      developers who wanted to sell software and felt the the term
      "free" as in "free software movement",
      would be misconstrued by prospective customers to mean that
      the developer requires no remuneration. Although a good deal
      of free software is no-cost software, the intended meaning
      of the term "free" is that the software can be
      used without restrictions. The term "open source"
      obviates the need to draw this distinction. The Open Source
      Initiative posts an open source definition () and a list of
      approved open source licenses ()
3   Open Access - In general, open access applies to text and
      data the same way that open source applies to software. In
      general, open access biomedical data is retrievable (i.e.,
      you can find it by using a PubMed search or through a search
      engine), and once you've found it, you can download it and
      read it. There are several closely-related consensus
      statements on the meaning of open access (), ()
4   Open source software license - The Open Source Initiative
      has an approval process for open source licenses. Software
      distributed under an approved license can include a
      declaration that the software is "OSI Certified Open
      Source Software." The GNU copyleft licenses have been
      certified as open source software licenses.

   List 2.19.1. Examples of undifferentiated software.

1   Basic algorithms
2   Fundamental laws of physics, chemistry, mathematics and
3   Free, cross-platform programming languages
4   TCP/IP internet protocol
5   HTML and XML.

   List 2.19.2. Examples of undifferentiated data.

1   Human genome
2   Standards documents
3   Nomenclatures
4   Biological classification systems.

   List 2.19.3. Examples of differentiated software.

1   Programming languages with special features such as a
      easy-to-use interfaces or integrated environment, or a
      specialized purpose
2   Neural network programs designed for specific types of data
3   Complex software designed to support commercial devices,
      such as CT-scanners
4   Most hospital information systems and laboratory information

   List 2.19.4. Examples of differentiated data.

1   Lexis/Nexis and other legal databases
2   Subscription journals
3   Codes for billable procedures
4   Science Citation Index
5   Chemical Abstracts (R) database.

   List 2.20.1. A few of the human databases that have been described in the Nucleic Acids Research Database Issue.

1   Androgen Receptor Gene Mutations Database
2   Atlas of Genetics and Cytogenetics in Oncology and
3   Atlas of Genetics and Cytogenetics in Oncology and
4   BGED - Brain Gene Expression Database
5   Cancer Chromosomes
6   Cancer gene databases
7   CGED - Cancer Gene Expression Database
8   Collagen Mutation Database
9   COSMIC - Catalogue Of Somatic Mutations In Cancer
10   Cypriot national mutation database
11  Cytokine Gene Polymorphism Database
12  Cytokine Gene Polymorphism Database
13  Cytokine Gene Polymorphism in Human Disease
14  Database of Genomic Variants
15  Database of Germline p53 Mutations
16  EICO DB - Expression-based Imprint Candidate Organiser
17  EpoDB - Erythropoiesis Database
18  ERGDB - Estrogen Responsive Genes Database
19  Gene-, system- or disease-specific databases
20  General polymorphism databases
21  GOLD.db - Genomics Of Lipid-associated Disorders
22  GRAP Mutant Databases
23  HAGR - Human Ageing Genomic Resources
24  HCAD - Human Chromosome Aberration Database
25  HemoPDB - Hematopoietic Promoter Database
26  HERVd - Human Endogenous Retrovirus database
27  HGMDr - Human Gene Mutation Database
28  HGMDr - Human Gene Mutation Database
29  HORDE - Human Olfactory Receptor Data Exploratorium
30  HPMR - Human Plasma Membrane Receptome
31  Human p53, human hprt, rodent lacI and rodent lacZ databases
32  Human PAX2 Allelic Variant Database
33  Human PAX6 Allelic Variant Database
34  IARC TP53 Database
35  Imprinted Gene Catalogue
36  IPD - Immuno Polymorphism Database
37  Lowe Syndrome Mutation Database
38  MTB - Mouse Tumor Biology Database
39  NCL Mutation Database
40  OMIM - Online Mendelian Inheritance in Man
41  Oral Cancer Gene Database
42  PTCH1 Mutation Database
43  RB1 Gene Mutation Database
44  RTCGD - Retroviral Tagged Cancer Gene Database
45  SNP500Cancer
46  SV40 Large T-Antigen Mutant Database
47  T1DBase - Type 1 Diabetes Database
48  The Autism Chromosome Rearrangement Database
49  The Lafora Database
50  The SNP Consortium database
51  TPMD - Taiwan polymorphic microsatellite marker database
52  Tumor Gene Family Databases (TGDBs)

   List 2.23.1. A record in Taxonomy.

1   ID : 50
2   PARENT ID : 49
3   RANK: genus
4   GC ID : 11
5   SCIENTIFIC NAME : Chondromyces
6   SYNONYM : Polycephalum
7   SYNONYM : Myxobotrys
8   SYNONYM : Chondromyces Berkeley and Curtis 1874
9   SYNONYM : "Polycephalum" Kalchbrenner and Cooke
10   SYNONYM : "Myxobotrys" Zukal 1896
11  MISSPELLING : Chrondromyces

   List 3.2.1. The types of human subject research risks.

1   The risk to life and health as a direct result of a medical
2   The risk of loss of database functionality
3   The risk of loss of confidentiality resulting from
      participation in a medical study
4   The risk of loss of privacy resulting from participation in
      a medical study.

   List 3.8.1. Confidentiality issues for biomedical informaticians.

1   Demonstrating to the hospital's IRB (see Glossary) that the
      chosen methodology for anonymizing or de-identifying records
      is safe and reliable
2   Demonstrating to the hospital's IRB and to the hospital's
      information officers that the anonymization and
      de-identification processes can be performed automatically,
      without giving the informatician any access to the primary
      patient record and without opening any HIS vulnerabilities
      when data is transferred out of the system.

   List 3.9.1. Exemption 4 (E4) permitting unconsented research on de-identified medical records.

   List 3.9.2. Section 164.502(f) of the HIPAA Privacy Rule -- Deceased Individuals.

1   We proposed to extend privacy protections to the protected
      health information of a deceased individual for two years
      following the date of death. During the two-year time frame,
      we proposed in the definition of ``individual'' that the
      right to control the deceased individual's protected health
      information would be held by an executor or administrator,
      or other person (e.g., next of kin) authorized under
      applicable law to act on behalf of the decedent's estate.
      The only proposed exception to this standard allowed for
      uses and disclosures of a decedent's protected health
      information for research purposes without the authorization
      of a legal representative and without the Institutional
      Review Board (IRB) or privacy board approval required (in
      proposed Sec. 164.510(j)) for most other uses and
      disclosures for research
2   In the final rule (Sec. 164.502(f)), we modify the standard
      to extend protection of protected health information about
      deceased individuals for as long as the covered entity
      maintains the information. We retain the exception for uses
      and disclosures for research purposes, now part of Sec.
      164.512(i), but also require that the covered entity take
      certain verification measures prior to release of the
      decedent's protected health information for such purposes
      (see Secs. 164.514(h) and 164.512(i)(1)(iii))
3   We remove from the definition of ``individual'' the
      provision related to deceased persons...

   List 3.10.1. Five requirements for de-identifying medical records.

1   De-identification of data fields that specifically
      characterize the patient (name, social security number,
      hospital number, address, age, etc.)
2   Free-text data scrubbing, removing identifiers from the
      textual portion of medical reports
3   Free-text data privatizing, removing any information of a
      private nature that may be contained within the report
4   Rendering the dataset ambiguous, ensuring that patients
      cannot be identified by data records containing a unique set
      of characterizing information
5   Rendering the data non-complementary, ensuring that the data
      cannot be combined with data from other other databases or
      from multiple searches of the same database that can lead to
      the identification of records.

   List 3.12.1. Some possible consequences of Common Rule violations.

1   The loss to the institution of its funding for the grant in
2   The loss to the institution of its Federal Assurance. The
      Office of Human Research Protections issues Assurances
      (currently called Worldwide Federal Assurances or WFAs) to
      institutions that have in-place processes for IRB reviews of
      research and for maintaining research standards. An
      institution must have an assurance registered with OHRP in
      order to receive federal funding for human subjects research
3   An institution-wide suspension of human subject research
4   The imposition of grant-related restrictions imposed on the
      investigators (e.g. a prohibition from applying for federal
      grant funding).

   List 3.13.1. Section 1177 of the Act established civil and criminal penalties.

1   Civil Money Penalties. HHS may impose civil money penalties
      on a covered entity of $100 per failure to comply with a
      Privacy Rule requirement. Pub. L. 104-191; 42 U.S.C.
      1320d-5. That penalty may not exceed $25,000 per year for
      multiple violations of the identical Privacy Rule
      requirement in a calendar year. HHS may not impose a civil
      money penalty under specific circumstances, such as when a
      violation is due to reasonable cause and did not involve
      willful neglect and the covered entity corrected the
      violation within 30 days of when it knew or should have
      known of the violation
2   Criminal Penalties. A person who knowingly obtains or
      discloses individually identifiable health information in
      violation of HIPAA faces a fine of $50,000 and up to
      one-year imprisonment. Pub. L. 104-191; 42 U.S.C. 1320d-6.
      The criminal penalties increase to $100,000 and up to five
      years imprisonment if the wrongful conduct involves false
      pretenses, and to $250,000 and up to ten years imprisonment
      if the wrongful conduct involves the intent to sell,
      transfer, or use individually identifiable health
      information for commercial advantage, personal gain, or
      malicious harm. Criminal sanctions will be enforced by the
      Department of Justice.

   List 3.15.1. Questions related to consent tracking that institutions must be able to answer.

1   Does each consent form have an identifier and a locator, a
      study number, and a data element indicating that the consent
      form itself was approved by an IRB?
2   If needed, could you put your hands on the physical consent
3   Does your database indicate the specific study for which
      consent was approved?
4   Was the consent form sufficiently detailed, allowing the
      patient to approve certain uses of specimens/data and
      decline other uses?
5   Is each consent tagged with tracking data?
6   Was the consent approved or declined?
7   What day was the consent signed?
8   Does the institution have a policy that applies to
      situations wherein a subject cannot provide an informed
      consent (e.g., infants, patients with dementia)?
9   If the institution has a policy of excluding certain classes
      of patient from providing informed consent, has the
      institution received approval for the policy from its IRB?
10   For children and challenged subjects, was the informed
      consent document signed by a surrogate?
11  For children and challenged subjects, how is it determined
    who may act as a surrogate, and how is the identity of the
    surrogate recorded and tracked?
12  Did the consenting subject change her mind and withdraw
    consent after consent had been approved?
13  If consent was withdrawn, what date did this occur?
14  If consent was withdrawn, was consent withdrawn for a
    particular use of a specimen/data, or for all purposes
    described by the consent document?
15  If consent was withdrawn, does the withdrawal of consent
    apply to more than one consent form?

   List 3.16.1. Advantages of unconsented medical record research.

1   Saves money and time by eliminating the tedious and
      expensive process of obtaining individual consents
2   Sometimes favored by patient advocacy organizations who see
      unconsented research as a way of expediting medical progress
      and improving the chances of survival of the patients in
      their disease constituencies
3   De-identification requirements for most unconsented patient
      record research essentially guarantees that no harm will
      come to the patient
4   De-identified unconsented databases can be shared and used
      for multiple scientific efforts. Consented databases, in
      most cases, can be used only for the purposes specified in
      the consent form
5   De-identified unconsented databases pose no particular
      threat over time to patients. Consented databases often
      contain patient identifiers and may pose a confidentiality
      and privacy threat long after the consented research is

   List 4.1.1. Examples of dealt standards

1   The permitted levels of toxic substances in foods
2   TCP/IP (Transmission Control Protocol/Internet Protocol),
      the internet specification
3   IEEE 802.11, the wireless data transfer standard
4   Longitude and latitude assignments
5   Divisions of time (days, hours, minutes and seconds)
6   Statutes governing medical privacy

   List 4.2.1. Some causes of medical errors in the field of biomedical informatics.

1   Absence of standards (for describing clinical data)
2   Inadequate terminologies
3   Poorly written text
4   Inadequate object identifiers (e.g., identifiers for names,
      tests, reports)
5   Poor interoperability of software tools
6   Poor integration of biomedical databases
7   Poor documentation (of software, of medical devices, of
8   Poor annotation (of medical encounters and transactions)
9   Inadequate data structuring (of reports)
10   Sloppy data representation.

   List 4.2.2. Purposes of data standards.

1   Enhance interoperability of software
2   Enable data integration
3   Increase the efficiency of medical services
4   Increase the speed of medical research
5   Reduce medical errors.

   List 4.3.1. Why governments may choose to avoid creating biomedical standards.

1   Private entities that use a standard may be in the best
      position to create the best possible standard
2   Private entities that use a standard may be willing to pay
      for the standards development process
3   Private entities are more likely to adopt a new standard if
      they had a part in developing the standard
4   Governments may be unwilling to accept the responsibility of
      promoting a new standard
5   Governments know that many standards are never adopted by
      the public and do not want to waste their resources on a
      standard that will be ignored
6   Governments may be reluctant to face criticism for standards
      that may adversely effect certain segments of its

   List 4.4.1. Excerpt from RICO that may be applicable to standards developers.

1   "1951. Interference with commerce by threats or
2   (a) Whoever in any way or degree obstructs, delays, or
      affects commerce or the movement of any article or commodity
      in commerce, by robbery or extortion or attempts or
      conspires so to do, or commits or threatens physical
      violence to any person or property in furtherance of a plan
      or purpose to do anything in violation of this section shall
      be fined under this title or imprisoned not more than twenty
      years, or both
3   (b) As used in this section-
4   (1) The term "robbery" means the unlawful taking
      or obtaining of personal property from the person or in the
      presence of another, against his will, by means of actual or
      threatened force, or violence, or fear of injury, immediate
      or future, to his person or property, or property in his
      custody or possession, or the person or property of a
      relative or member of his family or of anyone in his company
      at the time of the taking or obtaining
5   (2) The term "extortion" means the obtaining of
      property from another, with his consent, induced by wrongful
      use of actual or threatened force, violence, or fear, or
      under color of official right."

   List 4.4.2. Disclaimer against hidden patents within standards

1   "The attention of adopters is directed to the
      possibility that compliance with or adoption of OMG
      specifications may require use of an invention covered by
      patent rights. OMG shall not be responsible for identifying
      patents for which a license may be required by any OMG
      specification, or for conducting legal inquiries into the
      legal validity or scope of those patents that are brought to
      its attention. OMG specifications are prospective and
      advisory only. Prospective users are responsible for
      protecting themselves against liability for infringement of
      patents. ()"

   List 4.4.3. Perceived risks of developing a new standard.

1   The standard may inadvertently contain intellectual property
      (particularly patented methods) resulting in a legal
      complaint against the creators of the standard
2   The standard may create loss of revenue or property to
      certain entities, resulting in legal actions taken against
      the creators of the standard
3   The standard may result in medical errors, resulting in
      injury to patients and subsequent legal actions taken
      against the creators of the standard
4   The standard may have been developed in a manner that
      excluded participation by an entity, resulting in a legal

   List 4.5.1. Questions that should be asked prior to developing a new standard.

1   Is there a pre-existing standard that covers the same
2   If there is a pre-existing standard, can it be enhanced or
      modified to provide a desired functionality?
3   How much will it cost to develop the standard?
4   How long will the standards development process take?
5   Will the intended beneficiaries of the standard pay for the
      standards development process?
6   Who will develop the standard? Are the selected developers
      competent to produce an adequate standard?
7   Are any of the developers conflicted?  Do they stand to
      profit if the standard is developed in a specific way?
8   Do any of the developers have proprietary software or data
      that they may wish to include in the standard?
9   Are the expected developers committed to work through the
      duration of the standards development process, and are they
      committed to providing all of the time and energy needed to
      develop the standard?
10   Will there be a mechanism whereby drafts of the standard are
      reviewed openly by the public?  Will the minutes of the
      working committee be made public? Will public comments be
      used to modify successive drafts of the standard?
11  Will the standard have dependencies on other standards? If
    so, are there intellectual property issues that must be
    resolved before development begins?  Will these issues
    require licenses or royalty agreements from the standards
    developers or the standards users?
12  Once created, is the standard likely to be adopted?  Is the
    anticipated standard easily implemented?
13  Who will be the adopters of the standard? Are the expected 
    standard adopters included in the development process for
    the standard?
14  Will the standard benefit a range of users beyond the
    standards developers?
15  What are the hazards that the standard may produce, and who
    might be hurt by the standard? In particular, will any
    entities be disadvantaged if they cannot readily adopt the 
16  Is it necessary to have the standard approved by an external
17  If so, who will pay for the extra costs of obtaining
    approval from an external standards organization?
18  Will the standard need to be continuously updated and
    modified? Is there a planned process for producing multiple
    versions of the standard?
19  Is it really important to have the standard?  Is it worth
    the effort?

   List 4.6.1. Organizations active in the field of biomedical standards.

1   ASTM, American Society of Testing and Materials
2   ANSI, American National Standards Institute (see Glossary)
3   HISB, Health Information Standards Board
4   IEEE,  Institute of Electrical and Electronics Engineers,
5   ACR/NEMA, American College of Radiology (ACR) and National
      Electrical Manufacturers Association (NEMA), which oversees
      the DICOM (Digital Imaging and Communications in Medicine)
      image standard
6   NCPDP, National Council for Prescription Drug Programs, Inc
7   NIST, National Institute of Standards and Technology
8   ISO, International Organization for Standardization
9   IEC, International Electrotechnical Commission.

   List 4.6.2. Some American National Standards programming languages.

1   Mumps (ANSI approval 1977)
2   Basic (ANSI approval 1978)
3   ADA (ANSI approval 1983)
4   C (ANSI approval 1989)
5   Common Lisp (ANSI approval 1994)
6   ADA 95 (ANSI approval 1995)
7   Smalltalk (ANSI approval 1998)
8   C++ (ANSI approval 1999).

   List 4.7.1. New and future technologies that create biomedical data.

1   Gene Expression arrays (see Glossary)
2   Proteomic arrays
3   Tissue Microarrays
4   Metabolomic arrays
5   Image morphometric arrays.

   List 4.8.1. Problems created by the introduction of new standards.

1   New classes of data object requires a new standard for the
      new object class. (Examples Tissue Microarray Data, Gene
      Expression Array Data)
2   New standards require new implementations
3   Existing data standard require revision
4   Revisions of existing standards require retro-active
      implementation in data records conforming to the prior
      version of the standard
5   New data standards require harmonization with other existing
      standards. Otherwise multiple standards may compete for the
      standards-based data structures and data descriptors
      applicable to data elements common to multiple standards
6   Because standards often become the intellectual property of
      the  standards development organization, new standards
      cannot include parts of standards developed by other
      organizations. This means that redundant standards may
      describe the same objects.

   List 4.9.1. Fundamental properties of a specification.

1   The object specified must be defined and distinguished from
      all other objects. (i.e., one object cannot have two
      different specifications and one specification cannot apply
      equally to two non-equivalent objects)
2   The description must be organized in a way that is
      understandable and unambiguous. (i.e., a standard method of
      describing things, in the general sense, can be used.
      Languages are standard methods of describing things, but a
      better method might employ a formal semantic logic)
3   The descriptors must be well-defined in the context of the
      specification and not confused with descriptors of the same
      name but different meaning that may appear in other
      specifications (e.g., a "date"  may be a calendar
      notation in one standard and a type of dried fruit in
      another specification)
4   The measurements and descriptor values must be well-defined
      and not confused with measurements and values of the same
      alphanumeric value but different meaning that may appear in
      other specifications. (e.g., 10 pounds is not the same as 10
5   The specification must describe itself, include information 
      pertaining to its purpose, its creator, its ownership,  any
      restrictions on its uses, and any instructions necessary to
      interpret the specification.

   List 4.9.2. Logistical advantages of specifications over standards.

1   A specification need not be developed through a standards
      development process. A specification is basically a
      descriptive document and only requires fully unambiguous
      language. An individual can create a specification that
      everyone in the world can understand and use
2   Specifications do not require approval by any federal agency
       or organization. Standards have almost no meaning unless
      they are approved. In some cases, standards are enforced by
      authority of law
3   There are usually many different ways of specifying  things.
      The same object can be described by different
      specifications. Standards tend to impose a monolithic
4   A specification is a general way of describing things and
      can be used for many different and new types of things.
      Standards are typically developed for specific items and
      cannot accommodate  new items without pursuing a development
      and approval process through a standards development
      organization. Biomedical informaticians who use research
      data will almost certainly find that existing standards will
      not keep pace with the arrival of new techniques and data 
      objects. The chair shown (see Figure) is a fully specified
      image created with Pov-Ray, a free, open source rendering
      program (see Appendix). It was created using a .pov file,
      which is a plain-text set of instructions written for the
      rendering application.

   List 4.9.3. Snippet from chair.pov rendering specification, modified from Matthias Opitz's public domain scene file.

   List 4.11.1. Parts of an LSID, from The LSID Resolution Protocol Project.

1   Network Identifier (NID)
2   root DNS name of the issuing authority
3   namespace chosen by the issuing authority
4   object id unique in that namespace and assigned locally
5   revision id for storing versioning information

   List 4.11.2. Examples of LSIDs, from The LSID Resolution Protocol Project.

1    This is the first version of the
      1AFT protein in the Protein Data Bank
2    References a
      PubMed article
3    Refers to the
      second version of an entry in GenBank

   List 4.13.1. Principles of unique object identification.

1   A unique object can be distinguished from all other unique
2   A unique object cannot be distinguished from itself
3   A class (or collection) of instances can be unique.

   List 4.13.2. Some registries that continually assign unique identifiers to requesting entities.

1   DOI, Digitial object identifier
2   PMID, PubMed identification number
3   LSID (Life Science Identifier)
4   HL7 OID  (Health Level 7 Object Identifier)
5   DICOM (Digital Imaging and Communications in Medicine)
6   ISSN (International Standard Serial Numbers)
7   Social Security Numbers (for U.S. population)
8   NPI, National Provider Identifier, for physicians
9   Clinical Trials Protocol Registration System
10   Office of Human Research Protections FederalWide Assurance
11  Data Universal Numbering System (DUNS) number ()
12  DNS, Domain Name Service.

   List 4.13.3. Dependable computer systems that rely on unique object identifiers.

1   Google (relies on URLs)
2   PubMed (relies on PubMed identifiers)
3   Libraries (relies on ISSN, DOI)
4   Swiss banks (relies on unique account numbers).

   List 4.13.4. Some medical errors related to misidentification.

1   Correctly identified medication provided to incorrectly
      identified person
2   Incorrectly identified medication provided to correctly
      identified person
3   Incorrectly identified dosage of correct medication provided
      to correctly identified person
4   Blood transfused provided to incorrectly identified person
5   Report sent to incorrectly identified physician
6   Report identified with wrong person's name
7   Bill sent to incorrectly identified person
8   Report provided with diagnosis intended for different person
9   Wrong operation performed on incorrectly identified patient
10   Incorrectly identified patient treated for another patient's

   List 4.15.1. Information deficiencies in the statement "John Smith has a blood glucose of 85".

1   No unique patient identifier (many people are named John
2   No unique time identifier (indicating when the test was
      performed and distinguishing the test results from other
      blood glucose values obtained from the patient at other
3   No unique test identifier (indicating the specific protocol
      used to measure blood glucose in this instance)
4   No unique identifier for the units of measurement
5   No unique report identifier (indicating that the report
      itself is a unique laboratory object that can be archived
      and retrieved)

   List 4.15.2. Three conditions for a meaningful assertion in informatics.

1   There is a specified object about which the statement is
      made. When the object is a unique object (such as a
      patient), the object must be specified in a manner that
      distinguishes the object from all other objects, and this is
      typically done with a unique object identifier
2   There is data that pertains to the specified object
3   There is metadata that describes the data (that pertains to
      the specified object.)

   List 4.15.3. Generalizable scientific statements.

1   f=ma -- Force is mass time acceleration
2   If a gas is held at constant temperature, its volume is
      inversely proportional to its pressure - Boyle's law
3   Ontogeny recapitulates phylogeny - fetal development follows
      the evolutionary path of the species (a false assertion)
4   There are 10 types of people, those who use binary notation
      and those who do not
5   (love of money) = (evil)x(evil) -- The love of money is the
      root of all evil.

   List 4.15.4. Algorithm for de-identifying with an identifier.

1   Collect data on unique object. "Joe Public has brown
2   Assign a unique identifier. "Joe Public has unique
      identifier, 77300183."
3   Substitute name of object with its identifier
4   Consistently use the identifier with data. "77300183
      has brown eyes."
5   Do not let anyone know that Joe Public is 77300183.

   List 5.2.1. Some questions that can be answered with short program scripts.

1   Strip all the private identifiers from a medical record
2   Find all the surgical procedures included in the dataset of
      surgical post-op notes, and annotate each procedure with its
      frequency of occurrence in the dataset
3   Index a book with the page location of all terms that are
      names of diseases
4   Find all the palindromes in a gene sequence database and
      arrange them by frequency of occurrence
5   Find the most common occurring sequence of octamers in the
      human genome database
6   Find all octamers that occur only once in the human genome
7   Rank sequences from a gene expression array experiment based
      on levels of over-expression
8   From a patient database, find the diseases that have a
      chronologic relationship with another condition (e.g.
      chicken pox never occurs after shingles)
9   Find all tumors associated with a gene fusion mutation
10   Collect 100 histopathologic images of liver disease from the

   List 5.2.2. The three programming tricks in medical informatics.

1   File parsing (opening a file and examining the contents of
      the file, one line at a time)
2   Pattern matching (finding a fragment of parsed text that
      matches a word, a phrase or a character pattern of interest)
3   Assigning data structures to hold numbers or textual data
      that can be operated on, with outputs placed in an external

   List 5.2.3. Pseudocode to collect all the lines from a file that contain the phrase "biomedical informatics".

1   1. Open a file for reading. (Verbose equivalent: Get a file
      from the hard drive that has a particular name and prepare
      it so that the data in the file can be extracted and put
      into holders in the computer's memory)
2   2. Parse the lines of the file. (Verbose equivalent: Grab
      the characters from the first line of the file and put it
      into a data holder that occupies a specific place in
      computer memory. Be prepared to repeat this for all the
      lines of the file.)
3   3. Collect all the lines that contain the phrase
      "biomedical informatics. (Verbose equivalent: As each
      line is placed in a holder in computer memory, determine
      whether the line contains  the string "biomedical
      informatics" and if it does, add the held data to a
      structure called an array, which can hold many character
      strings, in sequence.)
4   4. When the file is exhausted, empty all the matching lines
      into an external file, opened for writing, named
      "output.txt". (Verbose equivalent: At the end of
      the file parsing loop, take the array structure, and
      transfer all the character strings from the array, in
      sequence, into a newly created file that has been prepared
      to accept data.

   List 5.2.4. Reasons to program in Perl.

1   Perl can be obtained at no cost
2   Perl is available for virtually every operating system and
      comes bundled into Unix and Linux distributions
3   Perl is extremely popular among bionformaticians
4   It takes just a few hours to learn enough Perl to write your
      own biomedical informatics programs
5   Perl programs tend to be much shorter and easier to
      understand than programs written in C or Java
6   A Perl script written for your computer will probably work
      on any other computer loaded with a Perl interpreter, even
      if the other computer has a different operating system
7   Unlike C and C++, Perl comes with native pattern matching
      commands (so called regular expressions) which are used in
      virtually every program in the field of biomedical
8   There are many thousands of freely available Perl tools that
      perform a wide range of useful operations that can extend
      the functionality of your own programs
9   Perl code can be written in a manner that looks much like
      simple narrative text (if you make the effort) making it
      easy for others to to read
10   Once you've learned Perl, you can migrate to almost any
      other programming language with ease.

   List 5.5.1. Contents of typical flat-file, "taxo.txt" extracted from "Taxonomy".

1   SYNONYM        : Bacillus aegyptius
2   SYNONYM        : Haemophilus aegyptius
3   SYNONYM        : Hemophilus conjunctivitidis
4   SYNONYM        : Haemophilus influenzae aegyptius
5   SYNONYM        : Bacillus conjunctivitidis
6   SYNONYM        : Bacterium aegyptiacum
7   SYNONYM        : Bacterium conjunctivitis
8   SYNONYM        : Bacterium pseudo conjunctivitidis

   List 5.5.2. Perl script,, to open a file and read a file.

1        #!/usr/bin/perl
2        open(FILE, "taxo.txt");
3        $line = " ";
4        while ($line ne "")
5          {
6          $line = <FILE>;
7          print $line;
8          }
9        exit;

   List 5.5.3. Output of

1   C:\ftp>perl
2   SYNONYM        : Bacillus aegyptius
3   SYNONYM        : Haemophilus aegyptius
4   SYNONYM        : Hemophilus conjunctivitidis
5   SYNONYM        : Haemophilus influenzae aegyptius
6   SYNONYM        : Bacillus conjunctivitidis
7   SYNONYM        : Bacterium aegyptiacum
8   SYNONYM        : Bacterium conjunctivitis
9   SYNONYM        : Bacterium pseudo conjunctivitidis

   List 5.10.1., a ridiculously short text editor, in Perl.

1        #!/usr/bin/perl
2        open (OUT, ">>mycumu.txt");
3        open (NEW, ">mynew.txt");
4        $line = " ";
5        until ($line eq "\n")
6          {
7           $line = <STDIN>;
8           print OUT $line;
9           print NEW $line;
10           }
11       exit;

   List 5.10.2. Until loop in Perl.

1             $line = " ";
2             until ($line eq "\n")  #loop stops when
      all you've entered is
3                                    #the return key
4               {
5               $line = <STDIN>;  #waits for the next line
      of input
6               print OUT $line;  #appends to the cumulative
7               print NEW $line;  #writes to the current
      script-session file
8               }

   List 5.11.1. Common errors in Perl scripts.

1   Perl blocks must be balanced with curly brackets. Every 
      block (e.g., while, if, for, unless, foreach) must have a
      beginning curly bracket,"{" and a balanced closing
      curly bracket, "}". This can become hairy in
      scripts that have multi-nested blocks
2   Command lines must end with a semicolon
3   String variables must be pre-pended with a "$", 
      as in, $date
4   Spelling counts in scripts. Perl cannot interpret a
      misspelled command or variable
5   An uppercase character has a different ascii value than its
      lowercase equivalent. With few exceptions, you will find it
      useful to maintain case consistency in Perl scripts
6   Characters that serve as reserved Perl symbols must be 
      backslashed if they are used as string characters. For
      example, use \. \/ \\ \$  if you want to use ./\$ as 
      characters. There are exceptions to this rule: \n,\d, \w are
      reserved symbols and never refer to the letters, ndw. The
      strange and non-intuitive use of backslashes in Perl  takes
      some mental adjustment and accounts for the "leaning
      toothpick syndrome" in Perl scripts. Complex regular
      expressions often resemble toothpicks tossed amidst string
7   Certain operations must be enclosed by parentheses (e.g., if
      (1 == 2), not (if 1 == 2)
8   The "=" operator is assignes a value and does not
      test for equality. To test for equality, use "=="
      if you are comparing two numbers and use "eq" if
      you are comparing two strings. Remember that string
      comparison operators (eq, ne, lt, gt) are different from
      number  comparison operators (==, >, <)
9   Using an "=" operator when you really want to use
      the regex comparison operator, "=~".

   List 5.11.2. Summary of the first Perl programming section.

1   Perl scripts are simple text files. [Perl scripts should be
      named using the .pl extension [Perl is a quintessential
      command-line language. At the command prompt, run your
      scripts by typing perl, then the name of the script, then
      the return-key (on some systems, you needn't include the
      name perl)
2   Perl scripts start off with a header line
3   Perl commands end with a semicolon
4   Perl blocks are delineated by curly brackets ({ })
5   You can assign strings to variables by using the assignment
      operator, "="
6   You can read, write or append to files using the
      "open" command

   List 5.12.1. Pseudocode that outlines the general construction of a Perl script.

1        header (shebang) line;
2        input something;
3   if (something evaluates to true)
4          {
5          do something;
6     for or while (some condition)
7             {
8             do something;
9             }
10          do something;
11         do something;
12         }
13  for or while (some condition)
14         {
15         do something;
16    if (something evaluates to true)
17           {
18           do something;
19           do something;
20           do something;
21           }
22         output something;
23         }
24       exit;

   List 5.14.1. Perls script

1        #!/usr/bin/perl
3        #This script lets you page through enormous files,
4        #20 lines at a time, with no file load time
5        print "What file do you want to read?";
6        $filename = <STDIN>;
7        chomp($filename);
8        open (TEXT, $filename)||die"Can't open file";
9        $line = " ";
10   while ($line ne "")   [#comment: while $line is
      not equal to empty
11           {
12           for ($count = 1; $count <= 20; $count++)
13            {
14            $line = <TEXT>;
15            print $line;
16            }
17           print "Type QUIT if you want to quit.
    Otherwise press any key\n";
18           $response = <STDIN>;
19           if ($response =~ /QUIT/i)
20             {
21             last;
22             }
23           }
24       exit;

   List 5.14.2. Output of File Reader.

1   C:\ftp>perl
2   What file do you want to read?e:\omim.txt
3   *RECORD*
4   *FIELD* NO [100050 [*FIELD* TI [100050 AARSKOG SYNDROME
      [*FIELD* TX [Grier et al. (1983) reported father and 2 sons
      with typical Aarskog [syndrome, including short stature,
      hypertelorism, and shawl scrotum. [
7   sons and that this suggested autosomal dominant inheritance.
8   the mother seemed less severely affected, compatible with
9   Type QUIT if you want to quit. Otherwise press any key

   List 5.14.3. Summary of the second Perl programming section.

1   How to open and read from files, line by line
2   How to prompt a user for input
3   Looping using for() and while()
4   Evaluating if() blocks
5   Simple pattern matching

   List 5.15.1. Things you can do with a one-line Regular expression.

1   Collect the lines from a file that contain a specific word,
      phrase or number
2   Collect the lines from a file that contain any desired
      combination of the above
3   Substitute any alphanumeric character string for any other,
      for the entire file

   List 5.16.1. Using the match operator with regular expressions.

1   for all the lines of a given file
2          {
3          put the next line from the file into some variable;
4          check the line to see if it matches your regular
5          {
6   if the line matches the regular expression
7          {
8          do something with it, like put it into another file;
9          or do an operation on the matching value;
10          }

   List 5.16.2. Using the substitution operator with regular expressions.

1   for all the lines of a given file
2          {
3          put the next line from the file into some variable;
4   do a substitution on all of the parts of the line that match
      your regular expression;
5          do something with the the revised line, like
      rearranging it and then putting the rearranged line into
      another file;
6          }

   List 5.17.1. Pattern match options.

1   g     Match globally, (find all occurrences)
2   i     Do case-insensitive pattern matching
3   m     Treat string as multiple lines
4   o     Compile pattern only once
5   s     Treat string as single line
6   x     Use extended regular expressions
7   ^     Match the beginning of the line
8   . Match any character (except newline)
9        $     Match the end of the line (or before newline at
      the end)
10   |     Alternation
11  ()    Grouping
12  []    Character class
13  *     Match 0 or more times
14  +     Match 1 or more times
15  ?     Match 1 or 0 times
16  {n}   Match exactly n times
17  {n,}  Match at least n times
18  {n,m} Match at least n but not more than m times
19  \n    newline(LF, NL)
20  \W    Match a non-word character
21  \s    Match a whitespace character
22  \S    Match a non-whitespace character
23  \d    Match a digit character
24  \D    Match a non-digit character.

   List 5.17.2. Perl script, which creates a file wherein each new sentence begins on a new line.

1        #!/usr/local/bin/perl
2        open (TEXT, "1DFRE10.TXT")||die"Can't
      open file";
3        open (OUT,">1DFRE10.OUT")||die"Can't
      open file";
4        undef($/);
5        $string = <TEXT>;
6        $string =~ s/[\n]+/ /g;
7        $string =~ s/([^A-Z]+\.[ ]{1,2})([A-Z])/$1\n$2/g;
8        print OUT $string;
9        exit;

   List 5.18.1., a Perl script for removing periods that do not delineate sentences.

1        #!/usr/bin/perl
3        #replaces periods with *, except when period marks end
      of sentence
4        $k = "Mr. P.I.N. Ph.D. M.D. 0.3 .4 5.
      end_of_sentence. Hello";
5        $firstvalue = $k;
6        $k =~ s/\b([ \w\d]*)\.+(?=[\w\d]*)(?!  [A-Z])/$1\*$2/g;
7        print "$firstvalue =>\n$k";
8        exit;

   List 5.18.2. Output of

1   C:\ftp>perl
2        Mr. Dr. P.I.N. Ph.D. M.D. 0.3 .4 5.
      end_of_sentence. Hello =>
3   Mr* Dr* P*I*N* Ph*D* M*D* 0*3 *4 5*  4*6*7*8*9
      end_of_sentence. Hello
4        C:\ftp>

   List 5.18.3. Regex (regular expression) substitution examples.

1   $string =~ s/^ +//o; Removes leading spaces from a character
2   $string =~ s/ +$//o; Removes trailing spaces from a
      character string
3   $string =~ s/ +/ /g; Changes all sequences of one or more
      spaces to just a single space
4   $string =~ s/\n//g; Gets rid of newline (sometimes called
      linebreak) characters in your string
5        $string =~ s/\b(\w+\.[ ]{1,2})([A-Z])/$1\n$2/g;
6   This finds the most common sentence delimiter (the end of a
      word followed by a period followed one or two spaces,
      followed by by an uppercase letter) and substitutes a
      newline character to that the each new sentence begins on a
      new line
7   $string =~ tr/A-Z/a-z/ Every uppercase letter is converted
      to a lowercase letter using the translate operator
      (tr/a-z/A-Z/ does the opposite)
8   $string = lc($string) Every uppercase letter is
      converted to a lowercase letter using the lc operator
      (uc($string) does the opposite)
9   $string =~ s/\b([A-Z0-9)\.[ \n/$1\*/g; makes sentence break
      at stand-alone single alphanumeric followed by a period
10   $string =~ s/\<[^\<+\>/ /g; removes angle-bracketed
      expressions, such as HTML or XML markup
11  $string =~ s/^([^ *) *([^ *)/$2 $1/; The first word in a
    string is switched with the second word
12  $string =~ s/\b(.+)(\@)(.+)\b/email/g; replaces email
    addresses with the word "email."
13  $string =~ s/\bhttp\:(.+)\b/webURL/ig; replaces http
    addresses with the word webURL
14  $string =~ tr/0-9a-zA-Z.\n' \-\)\(/ /c; replaces with a
    space everything that is not a letter, number period,
    line-break, apostrophe, space or parenthesis.

   List 5.19.1. Perl script, which counts the words in a file in 5 commands.

1        #!/usr/local/bin/perl
2        open (TEXT, "1DFRE10.TXT");
3        undef($/);
4        $all_text = <TEXT>;
5        @wordarray = split(/[\n\s]+/, $all_text);
6        print scalar(@wordarray);
7        exit;

   List 5.20.1. The Zipf distribution of the prior paragraph of the prior paragraph.

1   c:\ftp>perl
2   00007 of
3   00005 a
4   00004 the
5   00003 words
6   00003 is
7   00003 in
8   00002 zipf
9   00002 text
10   00002 occurrences
11  00002 distribution
12  00001 zipf's
13  00001 way
14  00001 this
15  00001 their
16  00001 that
17  00001 small
18  00001 shown
19  00001 see
20  00001 practical
21  00001 paragraph
22  00001 order
23  00001 most
24  00001 listing
25  00001 list
26  00001 law
27  00001 interpreting
28  00001 for
29  00001 different
30  00001 descending
31  00001 any
32  00001 amount
33  00001 account

   List 5.20.2. The first ten items in the Zipf distribution of The Decline and Fall of the Roman Empire.

1   26856 the
2   18032 of
3   09136 and
4   06026 to
5   04654 a
6   04155 in
7   03170 was
8   03081 his
9   02815 by
10   02391 that

   List 5.20.3., a Perl script that creates a Zipf distribution in 6 commands.

1        #!/usr/local/bin/perl
2        open (TEXT, "1DFRE10.TXT");
3        open (OUT, ">1DFRE10.OUT");
4        undef($/);
5        $all_text = <TEXT>;
6        $all_text = lc($all_text);
7        $all_text =~ s/[^a-z\-\']/ /g;
8        @wordarray = split(/[\n\s]+/, $all_text);
9        foreach $thing (@wordarray)
10          {
11         $freq{$thing}++;
12         }
13       #The Zipf list finished. The next lines just display
    the distribution
14       while ((my $key, my $value) = each(%freq))
15           {
16           $value = "00000" . $value;
17           $value = substr($value,-5,5);
18           push (@termarray, "$value $key")
19           }
20       @finalarray = reverse (sort (@termarray));
21       print join("\n",@finalarray);
22       exit;

   List 5.20.4. Example of an associative array, %patient_weight.

1        $patient_weight{"John Public"} = 155;
2        $patient_weight{"Mary Smith"} = 110;
3        $patient_weight{"Jules Berman"} = 195;
4        $patient_weight{"Jules Berman"}++; #evaluates
      to 196

   List 5.20.5. Summary of the third Perl programming section.

1   Creating and interpreting complex regular expressions
2   Looping through arrays with foreach blocks
3   Looping through associative arrays with while blocks
4   New Perl operators and commands split(), push(), lc(),
      sort(), join(), substr(), scalar(), undef(), incrementing
      values and concatenating strings
5   Advanced pattern substitution and substitution options

   List 5.21.1. A sample MESH record.

3   MH = Heparin
5   PRINT ENTRY = Heparinic Acid|T118|T121|T123|
6        NON|EQV|UNK (19XX)|800523|abbbcdef
7   PRINT ENTRY = alpha-Heparin|T118|T121|T123|NON|NRW|
8        UNK (19XX)|800523|abbbcdef
9   ENTRY = Liquaemin|T118|T121|T123|TRD|NRW|UNK
10   ENTRY = Sodium Heparin|T118|T121|NON|NRW|UNK
11  ENTRY = Heparin, Sodium
12  ENTRY = alpha Heparin
13  MN = D09.698.373.400
14  PA = Anticoagulants
15  PA = Fibrinolytic Agents
16  EC = antagonists & inhibitors:Heparin Antagonists
17  MH_TH = BAN (19XX)
18  ST = T118
19  ST = T121
20  ST = T123
21  N1 = Heparin
22  RN = 9005-49-6
23  MS = A highly acidic mucopolysaccharide formed of equal 
    [parts of sulfated D-glucosamine and D-glucuronic acid with
24  sulfaminic bridges. The molecular weight ranges from six to 
    [twenty thousand. Heparin occurs in and is obtained from
25  lung, mast cells, etc., of vertebrates. Its function is
    unknown,  [but it is used to prevent blood clotting in vivo
    and vitro, in
26  the form of many different salts
27  PM = /therapeutic use was HEPARIN, THERAPEUTIC 1965
28  HN = /therapeutic use was HEPARIN, THERAPEUTIC 1965
29  MED = *1635
30  MED = 3275
31  M90 = *2406
32  M94 = 4517
33  MR = 20040707
34  DA = 19990101
35  DC = 1
36  UI = D006493

   List 5.21.2. Creating a persistent database object from the MESH flat-file.

1        #!/usr/bin/perl
2        use Fcntl;
3        use SDBM_File;
4        tie%item, "SDBM_File", 'mesh',
      O_RDWR|O_CREAT|O_EXCL, 0644;
5        untie%item;     #these two lines simply create a file
6        open (TEXT, "d2002.bin")||die"Can't open
7        $/ = "*NEWRECORD";
8        $line = " ";
9        while ($line ne "")
10           {
11          tie%item, "SDBM_File", 'mesh', O_RDWR,
    0644;  #use the created file
12            $line = <TEXT>;
13            @linearray = split(/\n/,$line);
14            foreach $piece (@linearray)
15              {
16              if ($piece =~ /MN = /)
17                {
18                $meshno = $';
19                }
20              if ($piece =~ /ENTRY = /)
21                {
22                $entry = $';
23                if ($entry =~ /\|/o)
24                   {
25                   $entry = $`;
26                   }
27                $entry =~ s/s\b//g;
28                $entry = lc($entry);
29                push (@synonyms, $entry);
30                }
31              }
32            foreach $term (@synonyms)
33              {
34              $item{$term} = $meshno;
35              }
36           undef $meshno;
37           undef @synonyms;
38           untie%item;
39          }
40       undef(%item);
41       close TEXT;
42       exit;

   List 5.22.1. Retrieving a persistent database object from the MESH flat-file.

1        #!/usr/bin/perl
2        use Fcntl;
3        use SDBM_File;
4        tie%item, "SDBM_File", 'mesh', O_RDWR, 0644;
5        while(($key, $value) = each (%item))
6          {
7          print "$key => $value\n";
8          }
9        untie%item;
10        exit;

   List 5.23.1. Syntax rules for valid XML tags.

1   XML tags, unlike Perl variables, are case-sensitive
      ("Name" is different from "name").
      Parsers must preserve character case
2   Letters, underscores, hyphens, periods and numbers may be
      used in a tag
3   Only letters and underscores are eligible as the first
4   Colons are allowed, but only as part of a declared namespace
      prefix. For all practical purposes, this means that only one
      colon is allowed in a tag, and the colon must appear in an
      internal location in the tag (not at the beginning or the
      end of a tag).

   List 5.23.2., a program that validates XML tags.

1        #!/usr/bin/perl
2   @elements = qw (gene 4gene gene:ncbi gene-autry ge::ne [    
                 gene&autry -gene _gene gene- gene: [         
            :gene ge:n:e  ge:ne: ge,ne;
3        foreach $value (@elements)
4          {
5           if ($value =~
6             {
7             print "$value is good\n";
8             }
9      else
10             {
11            print "$value is bad\n";
12            }
13         }
14       exit;

   List 5.23.3. Output of

1   c:\ftp>perl
2   gene is good
3   4gene is bad
4   gene:ncbi is good
5   gene-autry is good
6   ge::ne is bad
7   gene&autry is bad
8   -gene is bad
9   _gene is good
10   gene- is good
11  gene: is good
12  :gene is bad
13  ge:n:e is bad
14  ge:ne: is bad
15  ge,ne is bad
16 is good

   List 5.24.1. What we have learned so far.

1   The =~ operator tells Perl to look for the pattern that
      follows the operator in the variable that precedes the
      operator. Regular Expressions are Perl's way of describing a
2   You can create most of your patterns by following a few
      simple rules and by "borrowing" regular
      expressions from published listings
3   The most common usage for regular expressions are in scripts
      that examine a line (or all the lines) from a file and that
      perform a substitution or rearrangement or other operation
      on the line, based on the results of the pattern match
4   Regular expressions are a powerful and fast tool for
      modifying text or data records or finding exactly what you
      want in any text
5   Perl associative arrays can be tied to an external database
      object that persists even when the Perl script has finished

   List 6.1.1. Some biomedical informatics tasks that can be accomplished with Perl.

1   Statistics
2   Mathematical Computations
3   Mathematical modeling
4   Web protocols (e.g., http and ftp)
5   Cryptographic techniques
6   Integrating data
7   Glue functions (e.g., calling subroutines written in C)
8   Digital Signal Processing (including Image analysis)
9   Bioinformatics methods (e.g. interfacing to Blast)
10   Database interfaces
11  Remote procedure calls and distributed computing
12  Middleware (see Glossary) [Software agents (via web
    services, GRID, SOAP (see Glossary), or related protocols)
13  Transformations to and from XML
14  XML data queries
15  Logical annotation of data (e.g., RDF)

   List 6.2.1. Creating an MD_5 one-way hash value for any provided string.

1        #!/usr/local/bin/perl
2        use MD5;
3        print "What words would you like to
4        $holdstring = <STDIN>;
5        chomp;
6        $hexhashstring = MD5->hexhash($holdstring);
7        print "md_5 hexhash => $hexhashstring\n";
8        exit;

   List 6.2.2. Three executions of the the MD_5 algorithm.

1   Execution 1:
2   c:\ftp>perl
3   What words would you like to digest?
4   Jules Berman
5   md_5 hexhash => 0ab7ad79962fd2ea036cc8dbaade6f2a

   List 6.2.3. Creating an MD_5 one-way hash for a file.

1        #!/usr/local/bin/perl
2        use MD5;
3        print "What file would you like to
4        $holdfile = <STDIN>;
5        chomp;
6        open (TEXT,"$holdfile");
7        $context = new MD5;
8        $context->addfile(TEXT);
9        $digest = $context->digest();
10        print (unpack ("H*", $digest));
11       exit;

   List 6.3.1. Simple Perl script for computing the mean from an array of numbers.

1        #!/usr/bin/perl
3        #computes the mean of an array of numbers
4        @numbersarray = (1,2,3,4,5,6,7,8,9,10);
5        $arraysize = scalar(@numbersarray);
6        print "The number of elements in our array is
7        $sum = 0;
8        foreach $value(@numbersarray)
9          {
10          $sum = $sum + $value;
11         }
12       $mean = $sum / $arraysize;
13       print "Your population number is
14       print "The array mean is $mean\n";
15       exit;

   List 6.3.2. General method of building an array that can be used in a statistical or mathematical Perl routine.

1   Open the file containing your records
2   Go through the file, one line (record) at a time
3   From a complex record, pick out the number you want using
4   Add that number to your array variable (using the Perl push
5   Calculate the mean (or any other statistical test) on the
      array variable.

   List 6.3.3. Computing the mean of an array entered at keyboard.

1        #!/usr/bin/perl
3        #computes the mean of an array of numbers entered at
4        print "Type a bunch of numbers, pressing the
      return key\n";
5        print "after each number. Decimal numbers are
6        $number = " ";
7        until ($number eq "")
8          {
9          $number = <STDIN>;
10           $number =~ s/\n//o;  #deletes the newline character
11          if ($number eq "")
12            {
13            next;
14            }
15          if ($number !~ /[0-9]+/)     #the entry must contain
    at least one digit
16            {
17            print "You're only allowed to enter
18            print " We just won't count this
19            next;
20  } [   if ($number !~ /^[0-9

   List 6.3.4. Output of

1   C:\ftp>perl
2   Type a bunch of numbers, pressing the return key after each
      number. Decimal numbers are allowed

   List 6.4.1. Some of the available Perl statistics modules ().

1   Statistics-Basic                          
2   Statistics-ChiSquare                      
3   Statistics-ConwayLife                      [Statistics-DEA
4   Statistics-DependantTTest                
5   Statistics-Descriptive-Discrete           
6   Statistics-Frequency                      
7   Statistics-KruskalWallis                  
8   Statistics-Lite                           
9   Statistics-LSNoHistory                     [Statistics-LTU
10   Statistics-OLS                            
11  Statistics-RankOrder                      
12  Statistics-ROC                            
13  Statistics-Shannon                        
14  Statistics-Table-F                         [Statistics-Test
15  Statistics-TTest                            Before you can
    use these tests, you must download the appropriate module
    into your Perl installation. A sample installation of
    Statistics-Descriptive (by Colin Kuskie, Andrea Spinelli and
    Jason Kastner), through the ActiveState package manager is
    shown (see List). ppm> install statistics-descriptive
    ==================== Install 'statistics-descriptive'
    version 2.6 in ActivePerl ====================
    Downloaded 10294 bytes. Extracting 5/5:
    blib/arch/auto/Statistics/Descriptive/.exists Installing
    Installing C:\activepl\site\lib\Statistics\
    Successfully installed statistics-descriptive version 2.6 in
    ActivePerl Only the first line is input: 
    [ppm> install statistics-descriptive

   List 6.4.2. Perl script for calculating variance.

1        #/usr/local/bin/perl
2        use Statistics::Descriptive;
3        $stat = Statistics::Descriptive::Full->new();
4        $stat->add_data(1,2,3,4,5,6,7,8,9,10);
5        $mean = $stat->mean();
6        $var  = $stat->variance();
7        print "mean $mean\nvariance $var\n";
8        exit;

   List 6.4.3. Output of statistics script.

1   c:\ftp>perl
2   mean 5.5
3   variance 9.16666666666667

   List 6.4.4. Perl script for computing the ChiSquare statistic.

1        #!/usr/bin/perl
2        use Statistics::ChiSquare;
3        print chisquare([1, 9, 1, 15, 4, 7]), "\n";
4        print chisquare([20, 20, 20, 30, 20, 20, 30 ]),
5        exit;

   List 6.4.5. Output of

1   C:\ftp>perl
2   There's a <1% chance that this data is random
3   There's a >50% chance, and a <70% chance, that this
      data is random.

   List 6.5.1. Types of statistical errors.

1   Type 1 error. Rejecting the null hypothesis when the null
      hypothesis is correct (i.e., seeing an effect when there was
2   Type 2. Accepting the null hypotheses when the null
      hypothesis is false. (i.e. seeing no effect when there was
3   Type 3. Rejecting the null hypothesis correctly, but for the
      wrong reason, leading to an erroneous interpretation of the
      data in favor  of an incorrect affirmative statement
4   Type 4. Erroneous conclusion based on performing the wrong
      statistical test. The type 4 error is the most embarrassing
      and the least excusable. You cannot blame a type 4 error on
      the data. It's all on you. Considering the rich variety of
      exotic statistical tests available to the novice, the
      opportunities for type 4 errors are endless. One way of
      avoiding type 4 errors is to have a dedicated statistician
      analyze your data. For those informaticians who have access
      to the services of a trustworthy statistician, this may
      actually  be the best and most practical solution. There is,
      however, an alternate way approach: resampling. Resampling
      is a type of statistical analysis that uses computers to
      model experiments and then repeats the experiments thousands
      or millions of time to determine the occurrence frequencies
      for particular sets of  data. This area of statistics was
      popularized by Bradley Efron (), and may have particular
      interest for readers of this book (see List). [List. Reasons
      why resampling statistics are of interest to biomedical
5   Does not require any knowledge of statistical tests
6   Applicable to a wide range of problems, including clinical
      trial design and decision analyses
7   Easy to understand
8   Easy to program with Perl

   List 6.6.1., a Perl script that simulates 600,000 casts of the die.

1        #!/usr/bin/perl
3        #Simulation of a throw of a die
4        $count = 0;
5        while ($count < 600000)
6           {
7           $count++;
8           $one_of_six = (int(rand(6))+1);
9           $hash{$one_of_six}++;
10           }
11        while(($key, $value) = each (%hash))
12          {
13        print "$key => $value\n";
14          }
15       exit;

   List 6.6.2. Output of first test of

1   C:\ftp>perl
2   1 => 100002
3   2 => 99902
4   3 => 99997
5   4 => 100103
6   5 => 99926
7   6 => 100070

   List 6.6.3. Output of second test of

1   C:\ftp>perl
2   1 => 100766
3   2 => 99515
4   3 => 100157
5   4 => 99570
6   5 => 100092
7   6 => 99900

   List 6.6.4., a Perl script that assigns random names to newly created files.

1        #!/usr/bin/perl
3        #Makes 10 randomly named files, with 8 leading
4        #a period and three trailing characters
5        while ($count < 10)
6         {
7         $count++;
8         &ranfile;
9         }
10   [sub ranfile
11       {
12       my @listchar;
13       my $count;
14       for ($count = 1; $count <= 12; $count++)
15           {
16           push(@listchar, chr(int(rand(26))+65));
17           }
18       $listchar[8]= ".";
19       my $randomfilename = join("",@listchar);
20       print "Your filename is $randomfilename\n";
21       return $randomfilename;
22       }
23       exit;

   List 6.6.5. Output of

1   C:\ftp>perl
2   Your filename is EKDUFKBR.YNX
3   Your filename is QVDKUVBY.QUI
4   Your filename is FNZXNKEE.MLV
5   Your filename is NRTXEHQI.VFX
6   Your filename is GWMOLKMX.AYU
7   Your filename is LZAKZQDW.RYR
8   Your filename is PRUAONQQ.OSJ
9   Your filename is XDEDHLKD.GAY
10   Your filename is RUSLNSXI.XVR
11  Your filename is IEPGAWDP.LEH

   List 6.7.1., a Perl script that simulates clonal tumor growth.

1        #!/usr/bin/perl
3        #Simulates the growth of a tumor from a single cells,
4        #a cell death probability per generation as provided by
      the user
5        print "Enter the death probability for your
6        print "Number must be between zero and
7        print "Most realistic numbers are .45 to
8        $value = <STDIN>;
9        $value =~ s/\n//o;
10   if ($value > 1) [  {
11         print "Exiting... you must pick a number between
    zero and one\n";
12         end;
13         }
    SIMULATION IS $value\n\n";
15       my $roundnumber = 1; #initiate the generation counter
16       &cycle;

   List 6.7.2. Output of

1   C:\ftp>perl
2   Enter the death probability for your simulation [Number must
      be between zero and one. [Most realistic numbers are .45 to
      .46 [Starting with a single malignant cell, let's watch the
      clonal growth. Tumor terminated...good!
3   Starting with a single malignant cell, let's watch the
      clonal growth. 1 Tumor terminated...good!
4   Starting with a single malignant cell, let's watch the
      clonal growth. 2 1 4 2 1 1 1 Tumor terminated...good!
5   Starting with a single malignant cell, let's watch the
      clonal growth. 2 1 5 6 8 8 8 12 15 18 18 20 19 27 32 30 31
      20 14 16 23 30 30 36 38 34 50 52 67 75 97 114 133 143 150
      156 159 178 200 254 302 292 329 336 382 441 489 603 630 701
      770 862 923 1056 1084 1210 1369 1473 1664 1776 1959 2196
      2475 2862 3098 3327 3740 4095 4634 Bad news. Let's stop
      watching this malignancy
6   Starting with a single malignant cell, let's watch the
      clonal growth. Tumor terminated...good!
7   Starting with a single malignant cell, let's watch the
      clonal growth. Tumor terminated...good!
8   Starting with a single malignant cell, let's watch the
      clonal growth. 3 1 3 5 3 1 1 Tumor terminated...good!
9   Starting with a single malignant cell, let's watch the
      clonal growth. 4 6 5 6 3 3 1 Tumor terminated...good!
10   Starting with a single malignant cell, let's watch the
      clonal growth. 2 2 5 3 2 2 6 5 6 5 4 2 1 Tumor
11  Starting with a single malignant cell, let's watch the
    clonal growth. 3 5 3 7 6 3 3 1 1 Tumor terminated...good!
    I've seen enough!

   List 6.7.3. Perl snippet showing the algorithm that repeatedly assigns probabilitic outcomes to an event. 

1        while ($i < $sum +1)
2          {
3          $i++;
4          $randnum = int( rand(100) ) + 1;
5          if ($randnum > (100 * $value))
6             {
7             $sum = $sum + 1;
8             }
9          if ($randnum < ((100 * $value) -1))
10             {
11            $sum = $sum - 1;
12            }
13         }

   List 6.8.1., a resampling script in Perl, that simulates runs of errors.

1        #!/usr/local/bin/perl
2        $errorno = 0;
3        while ($count < 100001)
4          {
5          $count++;
6          $x = rand(100);
7          if ($x < 2)
8             #similates a 2% error rate
9             {
10             $errorno++;
11            }
12    else
13            {
14            $errorno = 0;
15            }
16          if ($errorno == 3)
17            {
18            print "Uh oh. 3 consecutive errors\n";
19            $errorno = 0;
20            }
21         }
22  exit;                                     The Perl script
    simulates 100,000 diagnoses, which is a fair estimate of the
    total number of diagnoses a pathologist might render in
    their entire career (at 4,000 diagnoses per year over 25
    years of service). Each diagnosis is assigned a random
    number between 0 and 100. The "diagnosis" loop is
    repeated 100,000 times. In each loop, if the  randomly
    assigned number is less than 2, the pathologist's error
    number is incremented by 1. If the next diagnosis is
    randomly assigned a number greater than 2, the error number
    is dropped back down to 0 (i.e. the diagnosis is correct and
    the run of errors is broken). If an error occurs on 3
    consecutive occasions, the event is printed to the computer
    monitor (see List). [List. Output of
23  c:\ftp>perl
24  Uh oh. 3 consecutive errors
25  Uh oh. 3 consecutive errors

   List 6.9.1. Snippet of Perl code to determine unbiased random selection.

1        open
2        while ($n < 1000000)
3          {
4   $x = int(rand(100)) + 1;  [#pick a number
      between 1 and a hundred
5        #make a hash of the numbers picked and the
6   #number of times each is picked            [ 
7   $n++;                     [  }
8        foreach $key (sort byval keys %randhash)
9          {
10          print HOLD "$randhash{$key} $key\n";
11         }
12  sub byval
13         {
14         $randhash{$a} <=> $randhash{$b};
15         }

   List 6.10.1. Output of

1   C:\ftp>perl
2   6598
3   C:\ftp>perl
4   3408

   List 6.11.1., calling a POSIX function from a Perl script.

1        #!/usr/local/bin/perl
2        use POSIX qw(ceil floor);
3        $num = 11.3;
4        print "Floor is ", floor($num),
5        print "Ceil is ", ceil($num), "\n";
6        exit;

   List 6.11.2. Output of

1   c:\ftp>perl
2   Floor is 11
3   Ceil is 12

   List 6.12.1. Using the ActiveState Programmer's Package Manager.

1   c:\ftp>ppm
2   ppm - programmer's package manager version 3.3
3   copyright (c) 2001 activestate corp. all rights reserved
4   activestate is a division of sophos.

   List 6.12.2. Simple example script for the Fast Fourier Transform Module.

1   #!/usr/local/bin/perl [use Math::FFT;
2        my $PI = 3.1415926539;
3        my $N = 8; #N can be any power of 2, such as 4,8,16,64
4        $series = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16];
      #could be anything
5        print "series " . join("
      ",@$series). "\n";
6        my $fft = new Math::FFT($series);
7        my $coeff = $fft->rdft();
8        print "coefficients \n @{$coeff}\n\n";
9        my $spectrum = $fft->spctrm;
10        print "spectrum \n @{$spectrum}\n";
11       exit;

   List 6.12.3. Output of Fast Fourier Transform script

1   C:\FTP>perl
2   series 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

   List 6.13.1. A full concordance in 10 commands.

1        #!/usr/local/bin/perl
2        open (TEXT, "1DFRE10.TXT");
3        open (OUT, ">1DFRE10.OUT");
4        $line = " ";
5        while ($line ne "")
6           {
7           $cumline = "";
8           for($i=0;$i<100;$i++)
9              {
10              $line = <TEXT>;
11             $cumline = $cumline . $line;
12             }
13          $page++;
14          $cumline = lc($cumline);
15          $cumline =~ s/[^a-z\-\']/ /g;
16          @wordarray =  sort(split(/[\n\s]+/,$cumline));
17          @concordance = grep { $marked{$_}++; $marked{$_} ==
    1; } @wordarray;
18          undef %marked;
19          foreach $thing (@concordance)
20             {
21             $wordpage{$thing} = $wordpage{$thing} . "
22             }
23          }
24       #The concordance is finished. The next lines just
    display it on screen
25       foreach $key (sort keys %wordpage)
26          {
27          print OUT "$key \= $wordpage{$key}\n";
28          }
29       exit;

   List 6.13.2. A full indexing script in 10 commands.

1        #!/usr/local/bin/perl
2        open (TEXT,
3        open (OUT,
4        $line = " ";
5        @indextermarray = ("gaul","roman
6        while ($line ne "")
7           {
8           $cumline = "";
9           for($i=0;$i<100;$i++)
10              {
11             $line = <TEXT>;
12             $cumline = $cumline . $line;
13             }
14          $page++;
15          $cumline = lc($cumline);
16          $cumline =~ s/[^a-z\-\']/ /g;
17          $cumline =~ s/ +/ /g;
18          foreach $thing (@indextermarray)
19              {
20              if ($cumline =~ /\b$thing\b/)
21                 {
22                 $wordpage{$thing} = $wordpage{$thing} .
    " $page";
23                 }
24              }
25          }
26       #The index is finished. The next lines just display it
    on screen
27       foreach $key (sort keys %wordpage)
28          {
29          print OUT "$key \= $wordpage{$key}\n";
30          print "$key \= $wordpage{$key}\n";
31          }
32       exit;

   List 6.13.3. An excerpted output for the indexing program, listing only the terms "england" and "village" and the pages on which they are found.

1   c:\ftp>perl
5   england =  5 14 23 134 207 208 229 277
9   village =  43 77 81 94 128 141 147 184 185 225 226 244

   List 6.13.4. Problems with human-based indexing.

1   Incredibly labor-intensive and time consuming
2   The index cannot be built until the book is in final form
      and the page numbers are known, delaying the publication of
      the book until the indexing is completed
3   If important phrases are omitted completely or if one or
      more of their locations are omitted, no one will likely
      catch the error
4   The indexing effort needs to be repeated if there are book
      revisions and pagination changes.

   List 6.13.5. Extracting candidate index phrases from a test file.

1   #!/usr/local/bin/perl [@stop = qw(
2   a about absent absence again all almost also although always
      among an
3   and another any are as at be because been before being
      between both
4   but by can could cm did do does done due during each either
5   especially etc for found from further had has have having
      here how
6   however i if in into is it its itself just kg km made mainly
      make may
7   mg might ml mm most mostly must nearly neither no nor
8   obtained often on our overall perhaps present presence quite
9   really regarding seem seen several should show showed shown
10   significantly since so some such than that the their theirs
      them then
11  there therefore these they this those through thus to upon
    use used
12  using various very was we were what when which while with
13  without would or can't doesn't not
14       );
15       open
16       open
17       undef($/);
18       $phrase = <TEXT>;
19       $phrase =~ s/\n/ /g;
20       $phrase = lc($phrase);
21       $phrase =~ s/[^a-z \']/ /g;
22       foreach $stopword (@stop)
23         {
24         $phrase =~ s/ $stopword / \# /g;
25         }
26       $phrase =~ s/[\s]+/ /g;
27       $phrase =~ s/ ?\# ?/\#/g;
28       @phraselist = sort (split("#",$phrase));
29  @phraselist = grep
30       {$i{$_}++;(($i{$_}==2)&&(scalar(split("
31  print OUT join("\n",@phraselist); [exit;  [List.
    First 9 lines of output from script
32  abate fortis
33  abbe foucher
34  abdication of diocletian
35  abilities of
36  able leader
37  abolition of
38  absolute power
39  abuse of
40  academy of inscriptions

   List 6.14.1. Algorithm for regular expression searches of text files.

1   1. Asks you for a regular expression to search a file. If
      you're not adept at regular expressions, just enter any
      word. Remember, a word or phrase is always the simplest
      regular expression. In the output example, we'll search for
      the word "adenocarcinoma"
2   2. If you enter the return-key without entering a regular
      expression, it simply exits the script
3   3. Asks Perl to give you the current epoch time (number of
      seconds passed since some point in history)
4   4. Opens an enormous publicly available file (138 Mbytes)
      named MRCON (we'll learn a lot about this file in Biomedical
5   5. Reads every line of MRCON (about 2 million of them),
      testing each line to see if it contains a substring that
      matches the regular expression that you provided (step 1)
6   6. If it finds a match, it adds the line number and the line
      to an external file named regexout.txt
7   7. When it's finished reading the file, it asks Perl again
      for the epoch time, and determines the script execution time
      by subtracting the script's end time from the script's
      beginning time
8   8. It prints to the monitor the time spent executing the
      script, as well as the filename containing the output of all
      the lines from the MRCON file that matched your provided
      regular expression.

   List 6.14.2. Perl script for regular expression searches of text files.

1        #!/usr/bin/perl
3        #11/20/01
4        #this will pull out all the matching lines for a
5        #regular expression from any text file. This short
      script is incredibly
6        #powerful, but it requires the user to have facility
7        #regular expressions
8        open (OUT,
      ">regexout.txt")||die"Can't open file
9        $filename = "regexout\.txt";
10         print "What's your search regex?\n";
11        $regex = <STDIN>;
12        $regex =~ s/\n//o;
13        if ($regex eq "")
14          {
15          close TEXT;
16          close OUT;
17          print "\nYou didn't give a
18          }
19        #$re = qr/$regex/oi;
20        $start = time();
21        &searchsub;
22        $end = time() - $start;
23        print "Retrieval time is $end seconds.\n";
24        print "Your search results are in file

   List 6.14.3. Output of regular expression search.

1   C:\ftp>perl
2   What's your search regex?
3   adenocarcinoma
4   Retrieval time is 5 seconds
5   Your search results are in file regexout.txt.

   List 6.15.1. A short script that performs a binary search on a file.

1        #!/usr/bin/local/perl
2        open (TEXT, "find_bin.txt");
3        seek(TEXT, 0, 2);
4        print "What word would you like to find?\n";
5        $findword = <STDIN>;
6        $findword =~ s/\n$//o;
7        $filesize = tell (TEXT);
8        for($i=1;$i<129;$i++)
9           {
10           $portion = int(($filesize * $i)/128);
11          push(@portionarray,$portion);
12          }
13       seek(TEXT, 0, 0);
14       $arraynumber = 64;
15       foreach $division (4,8,16,32,64,128)
16          {
17          $place = $portionarray[$arraynumber-1];
18          seek(TEXT, $place, 0);
19          $line = <TEXT>;
20          $line = <TEXT>;
21          $line =~ /^([a-z]+) /;
22          $estimate_word = $1;
23          if ($estimate_word gt $findword)
24            {
25            $arraynumber = $arraynumber - (128/$division);
26            }
27     else
28            {
29            $arraynumber = $arraynumber + (128/$division);
30            }
31          }
32       undef ($/);
33       seek(TEXT,($place - 10000), 0);
34       read(TEXT,$holder,20000);
35       if ($holder =~ /\n($findword)[0-9\= ]+\n/)
36          {
37          print $&;
38          }
39  else
40          {
41          print "Sorry. Couldn't find $findword in the
42          }
43       exit;

   List 6.16.1., a Perl script demonstrating clustering algorithm.

1        #!/usr/local/bin/perl
2        use Algorithm::Cluster;

   List 6.16.2. Output of script.

1   c:\ftp>perl
2   Row0 => Cluster 1
3   Row1 => Cluster 2
4   Row2 => Cluster 2
5   Row3 => Cluster 1
6   Row4 => Cluster 0
7   Row5 => Cluster 0

   List 6.17.1. Example of a very simple program using the LWP (Library for WWW in Perl).

1        #!/usr/bin/perl
2        use LWP::Simple;
3        print (get "");
4        exit;

   List 6.18.1. Some Perl books in bioinformatics (a very different field from biomedical informatics)

1   Beginning Perl for Bioinformatics, by James Tisdall
2   Mastering Perl for Bioinformatics, by James Tisdall
3   Genomic Perl: From Bioinformatics Basics to Working Code by
      Rex A. Dwyer
4   Perl Programming for Biologists, by D. Curtis Jamison
5   Developing Bioinformatics Computer Skills by Per Jambeck and
      Cynthia Gibas
6   Bioinformatics Biocomputing and Perl: An Introduction to
      Bioinformatics Computing Skills, by Michael Moorhouse and
      Paul Barry

   List 6.18.2. A simple DNA palindrome, GAATTC.

   List 6.18.3. Perl script for finding palindromes in a gene sequence.

1        #!/usr/bin/perl
2        $filename = "sample";
3        open (TEXT, "sample")||die"Cannot";
4        $line = " ";
5        $count = 0;
6        for $n (5..20)
7           {
8           $re = qr /[CAGT]{$n}/;
9           $regexes[$n-5]= $re;
10           }
11       NEXTLINE: while ($count < 1000)
12          {
13          $line = <TEXT> ;
14          $count++;
15          foreach my $value (@regexes)
16             {
17             $start = 0;
18             while ($line =~ /$value/g)
19                {
20                $endline = $';
21                $match = $&;
22                $revmatch = reverse($match);
23                $revmatch =~ tr/CAGT/GTCA/;
24                if ($endline =~ /^([CAGT]{0,15})($revmatch)/)
25                   {
26                   $start = 1;
27                   $palindrome = $match . "*" . $1 .
    "*" . $2;
28                   $palhash{$palindrome}++;
29                   }
30                }
31             if ($start == 0)
32                {
33                goto NEXTLINE;
34                }
35             }
36          }
37       close TEXT;
38       while(($key, $value) = each (%palhash))
39          {
40          print "$key => $value\n";
41          }
42       exit;

   List 6.18.4. Input of (line-breaks omitted from original file).


   List 6.18.5. Output of

1   (* separates the spacer region from the flanking palindromic
2   C:\FTP>perl
4   AGTAT*T*ATACT => 1

   List 6.21.1. Examples of software utility functions.

1   Archiving utilities
2   Calculator utility
3   Compression/decompression utilities
4   Conversion utilities - Converts files (text, images, sound,
      video) to and from different formats
5   Database utilities
6   Directory searching
7   Email service
8   Encryption/decryption utilities
9   File copying utilities
10   File reading and parsing utilities
11  FTP file retrieval
12  Indexing utilities
13  Sorting utilities
14  Searching utilities
15  Telnet remote computer access
16  Text editing
17  Web retrieval utilities

   List 6.22.1. Types of software of possible interest to the FDA.

1   Software used as a component, part, or accessory of a
      medical device
2   Software that is itself a medical device (e.g., blood
      establishment software)
3   Software used in the production of a device (e.g.,
      programmable logic controllers in manufacturing equipment)
4   Software used in implementation of the device manufacturer's
      quality system (e.g., software that records and maintains
      the device history record) .

   List 6.22.2. Features of software that buyers want.

1   Easy installation
2   Simple instructions and documentation
3   Friendly graphic user interface
4   Functionality that supports the user's goals
5   Transparency (no need for user to understand the underlying
      assumptions, algorithms and data structures upon which the
      functionality of the software is based)
6   Compatibility with operating system and other software
      residing on the user's computer
7   Good user support services

   List 6.22.3. Features of good software (that serious biomedical informaticians need).

1   Extensibility. The functionality of the software and the
      data can be modified and expanded
2   Scalability. Should work with any size of inputs
3   Standardization of all data (input and output)
4   Open source code
5   Open access data
6   Self-describing software
7   Cross-platform functionality. Software should operate in
      multiple operating systems
8   Interoperability
9   Availability of updates
10   Full documentation of methods and algorithms

   List 6.22.4. Some properties of valid software, modeled on FDA Principles of Software Validation.

1   Verified "Software verification looks for consistency,
      completeness, and correctness of the software and its
      supporting documentation, as it is being developed, and
      provides support for a subsequent conclusion that software
      is validated."
2   Tested "confirm that software development output meets
      its input requirements." Includes testing at user site
3   Validated "confirmation by examination and provision of
      objective evidence that software specifications conform to
      user needs and intended uses, and that the particular
      requirements implemented through software can be
      consistently fulfilled."
4   Risk-assessed The safety risks posed by the software should
      be specified
5   Requirement-Documented  "The software and system
      requirements must be fully documented. ()"  The
      validation step requires an analysis of compliance with
      documented requirements
6   Controlled development process. Bugs are often introduced
      during the software development process. A controlled
      development process ensures that changes in the design of
      software are tracked and evaluated, documented and corrected
      when necessary
7   Design review "Design reviews are documented,
      comprehensive, and systematic examinations of a design to
      evaluate the adequacy of the design requirements, to
      evaluate the capability of the design to meet these
      requirements, and to identify problems."

   List 6.22.5. Conditions that can make software difficult to evaluate.

1   Intrinsic complexity of the software ()
2   Use of off-the-shelf software (see Glossary)
3   Use of external software components
4   Witheld source code and poor documentation.

   List 7.1.1. Equivalent terms for the Concept identifier C4863000.

1   C4863000  prostate with adenoca  [C4863000  adenoca arising
      in prostate
2   C4863000  adenoca involving prostate  [C4863000  adenoca
      arising from prostate
3   C4863000  adenoca of prostate  [C4863000  adenoca of the
4   C4863000  prostate with adenocarcinoma  [C4863000 
      adenocarcinoma arising in prostate
5   C4863000  adenocarcinoma involving prostate  [C4863000 
      adenocarcinoma arising from prostate
6   C4863000  adenocarcinoma of prostate  [C4863000 
      adenocarcinoma of the prostate
7   C4863000  adenocarcinoma arising in the prostate  [C4863000 
      adenocarcinoma involving the prostate
8   C4863000  adenocarcinoma arising from the prostate 
      [C4863000  prostate with ca
9   C4863000  ca arising in prostate  [C4863000  ca involving
10   C4863000  ca arising from prostate  [C4863000  ca of
11  C4863000  ca of the prostate  [C4863000  prostate with
12  C4863000  cancer arising in prostate  [C4863000  cancer
    involving prostate
13  C4863000  cancer arising from prostate  [C4863000  cancer of
14  C4863000  cancer of the prostate  [C4863000  cancer arising
    in the prostate
15  C4863000  cancer involving the prostate  [C4863000  cancer
    arising from the prostate
16  C4863000  prostate with carcinoma  [C4863000  carcinoma
    arising in prostate
17  C4863000  carcinoma involving prostate  [C4863000  carcinoma
    arising from prostate
18  C4863000  carcinoma of prostate  [C4863000  carcinoma of the
19  C4863000  carcinoma arising in the prostate  [C4863000 
    carcinoma involving the prostate
20  C4863000  carcinoma arising from the prostate  [C4863000 
    prostate adenoca
21  C4863000  prostate adenocarcinoma  [C4863000  prostate ca
22  C4863000  prostate cancer  [C4863000  prostate carcinoma
23  C4863000  prostatic cancer  [C4863000  prostatic carcinoma
24  C4863000  prostatic adenocarcinoma  [C4863000  prostate
    gland adenocarcinoma
25  C4863000  adenocarcinoma of the prostate gland  [C4863000 
    adenocarcinoma of prostate gland
26  C4863000  prostate gland carcinoma  [C4863000  carcinoma of
    the prostate gland
27  C4863000  carcinoma of prostate gland    When a nomenclature
    collects synonymous terms to  unique concept identifiers,
    medical text containing any of the terms corresponding to  a
    single the term can be assigned the unique concept code.
    When all the terms in a medical database have been coded, 
    they can be retrieved through a concept search that collects
    all synonymous terms by their unifying concept identifier. A
    medical sentence can be coded many different ways (see
    List). [List. Example of a sentence coded using a neoplasm
28  Primary synovial sarcoma of the mediastinum a
    clinicopathologic immunohistochemical and ultrastructural
    study of 15 cases
29  term="sarcoma of the mediastinum"
30  term="synovial sarcoma" code="C3400000"
31  term="synovial sarcoma of the mediastinum"
32  term="primary synovial sarcoma"

   List 7.1.2. Synonyms for the rare tumor, nasopharyngeal carcinoma.

1   Regaud tumor
2   nasopharyngeal carcinoma
3   lymphoepithelial carcinoma
4   Schmincke tumor

   List 7.2.1. Advantages of standardized nomenclatures.

1   Permanence. [Vetting - review and approval by experts and
      community stakeholders
2   Wide use and universal recognition. [Comprehensive over a
      knowledge domain that extends over many specialized areas
3   Mapping (between standards). [Relationships between
      different domains
4   Proven utility. [Development costs often transferred to
      users or to funding agencies.

   List 7.2.2. Advantages of small, specialized nomenclatures.

1   1. Rapid addition of new terms. [2. Complete vocabulary for
      a narrow knowledge domain
2   3. Immediate availability of frequently updated versions of
3   4. Data model appropriate for specialized uses of
4   5. Comprehensible by experts in the field. [6. Inexpensive
      to create, and often available free to users.

   List 7.3.1. How names of diseases are chosen.

1   As an an expression of a characteristic  pathologic process
      (e.g., muscular dystrophy)
2   For the physical agent that produced the disease (e.g.,
3   For a group of people who were at high risk for the disease
      (e.g., Legionnaires' Disease, named after a group of
      conventioneers who succombed in an early outbreak)
4   For a molecule found in diseased cells (e.g. amyloidosis,
      prion disease
5   For a geographic region in which the disease occurs (e.g.,
      Tangier Disease from Tangier Island, Maryland)
6   For a geographic region from which an epidemic emanated
      (Lyme disease from Lyme, New York
7   For a striking clinical feature of the disease (e.g.,
      sleeping sickness
8   As a crude and insensitive comparison to an inanimate 
      object (e.g., gargoylism)
9   As a literary metaphor (e.g., Pickwickian syndrome, Mad
      Hatter's disease)
10   For a striking morphologic feature (e.g., sickle cell
11  For a patient who had the disease (e.g., Lou Gehrig disease)
12  For physician or scientist who treated, described or
    researched the disease  (e.g., Hodgkin disease, Cushing
    disease, Kaposi sarcoma)
13  As a clueless acronomym (e.g. CATCH 22, cardiac abnormality,
    abnormal facies, t-cell deficit due to thymic hypoplasia,
    cleft palate, hypocalcemia resulting from a deletion on
    chromosome 22)
14  As a trope from any existing language (e.g., Moyamoya
    disease derives from  "moyamoya" meaning
    "puff of smoke" in Japanese,  for the
    characteristic tangle of tiny cerebral vessels seen on
15  As an homage to Greek and Latin scholarship (e.g.,
    pityriasis lichenoides et varioliformis acuta)
16  As inscrutable combinations of one or more of the above (My
    personal favorite inscrutable disease name is the
    wistful-sounding "floating-harbor syndrome". This 
    disease was named by combining the hospital in which one of
    the first case appeared, Boston Floating Hospital, and for a
    second hospital in which another case appeared, Harbor
    General Hospital in Torrance, California.)

   List 7.3.2. Examples of offensive medical terms.

1   Gargoylism. The name invites comparison of a patient with a
2   Mongolism and Mongoloid idiot (naming a disease after the
      peoples that the doctor believes look most like the person
      with the disease)
3   Monster. The name suggests that the individual is not human
4   Cretinism. The name links a patient with a pejorative term

   List 7.3.3. Tasks of the ancient curator.

1   Select canonical (best) terms for concepts
2   Delete obsolete or otherwise denigrated terms
3   Prepare precise definitions for the included terms
4   Prepare revised versions of the nomenclature at intervals,
      perhaps once each century
5   Prepare nomenclature in an academic language, such as Latin,
      that limits access to scholars.

   List 7.3.4. Tasks of the modern curator.

1   Add new terms to the nomenclature when they occur in the
      domain literature
2   Group synonymous terms under a unique concept code
3   Determine the relationships among the different terms in the
      nomenclature and provide links or ontologic classes that
      express these relationships
4   Comply with standards for representing the terms in  a
      nomenclature that will support data integration with other
5   Ensure the logical consistency of the nomenclature
6   Update and release revised versions of the nomenclature at
      intervals, perhaps daily
7   Develop methods for representing variations in the manner
      that terms are  interpreted and used
8   Post the nomenclature to the Internet, as an Open Access
9   Prepare a legal "use" disclaimer
10   Develop methodology for linking concept codes to annotative
      data on the internet.

   List 7.3.5. Good curation practices for Medical Nomenclatures.

1   General characteristics relate to utility and
      appropriateness in  clinical applications, including that
      concepts are not vague, ambiguous, or redundant; purpose and
      scope are clear; coverage is in-depth, explicit, and
      comprehensive; there are systematic and formal definitions
      of all concepts; and the concepts are built into a reference
2   Structure of the vocabulary model determines the ease with
      which practical and useful interfaces for term navigation,
      entry, or retrieval can be supported
3   Maintenance characteristics provide the technical choices
      which impact the capacity of a vocabulary to evolve, change,
      and remain usable over time, including context-free
      identifiers, persistence of identifiers, and version control
4   Evaluation criteria address how a vocabulary should be
      evaluated, and include a clear statement of purpose and
      scope, availability of tools for mapping, and usability.

   List 7.4.1. Doublet method for finding candidate terms from text.

1   1. Collect all the doublets that occur in the entire
      nomenclature (i.e., accumulate a list of the doublets from
      every term in the nomenclature)
2   2. Parse text into an ordered collection of overlapping
      doublets. As an example, "serous borderline ovarian
      tumor" would be parsed as "serous borderline,
      borderline ovarian, ovarian tumor"
3   3. Compare each consecutive text doublet against the array
      of doublets from the nomenclature to determine whether the
      doublet exists somewhere in the nomenclature
4   4. If the doublet from the text does not exist in the
      nomenclature, it can be deleted. If it exists in the
      nomenclature, it is concatenated with the following doublet
      if the following doublet exists in the nomenclature.
      Otherwise, it is deleted. This process continues,
      concatenating doublets that exist somewhere in the
      nomenclature. Extraneous leading words (the, in, of, with,
      and) and trailer words, (the, and, with, from, a) are
      automatically deleted from the final concatenated sequence.
      Final concatenated sequences of two or greater consecutive
      doublets that match to doublets from the nomenclature are
      saved as candidate terms.

   List 7.4.2. Snippet of Perl code for the doublet method.

2           @hoparray = split(/ /,$line);
3           my $olddoublet = "";
4           for ($i=0;$i<(scalar(@hoparray)-1);$i++)
5              {
6              $doublet =
7              if (exists $doubhash{$doublet})
8                  {
9                  if ($englishline ne "")
10                    {
11                   $englishline = $englishline . "
12                   }
13            else
14                   {
15                   $englishline = $doublet;
16                   }
17                 }

   List 7.4.3. Criteria for including a phrase as a candidate new term.

1   Candidate phrases are composed of concatenated strings of
      word doublets that are contained in terms found in an
      existing nomenclature
2   Candidate phrases do not already occur in the

   List 8.2.1. Output of fuzzy spelling match.

1   c:\ftp>perl
2   What word would you like to approximate?
3   hemocromatosis
4   Approximate matches
5   haemochromatosis
6   hemochromatoses
7   hemochromatosis

   List 8.2.2. One-letter (mostly) differences among properly spelled words.

1   arteritis <=> arthritis
2   auxiliary => axillary
3   brachial <=> branchial
4   callous <=> callus
5   chlorpromazine <=> chlorpropamide
6   chorionic <=> chronic
7   coitus <=> colitis
8   colic <=> colonic
9   costal <=> coastal
10   cygnet <=> signet
11  diploic <=> diploid
12  disc <=> disk
13  disease <=> decease
14  dyskaryosis <=> dyskeratosis
15  ectatic <=> ecstatic
16  enema <=> anemia
17  facial <=> fascial
18  facies <=> feces
19  faeces <=> facies
20  fascial <=> facial
21  fetal <=> fatal
22  firearm <=> forearm
23  hallux <=> helicis
24  helicis <=> hallux
25  herpangina <=> herpetic
26  herpetic <=> herpangina
27  hydatid <=> hydatidiform
28  hydatidiform <=> hydatid
29  ileitis <=> iliitis
30  ileum <=> ilium
31  isotope <=> isotrope
32  keratin => kerasin
33  keratinocytic => keratinolytic
34  keratosis <=> ketosis
35  lipoma <=> lymphoma
36  live <=> liver
37  lover <=> liver
38  malleolus <=> malleus
39  milia <=> milium
40  mucous <=> mucus
41  myelofibrosis <=> myofibrosis
42  oncology <=> ontology
43  osteoblastoma <=> osteoclastoma
44  paleodontology <=> paleontology
45  paleontology => paleodontology
46  palette <=> palate
47  palpation => palpitation
48  parental <=> parenteral
49  penal <=> penile
50  penicillamine <=> penicillin
51  penile <=> penal
52  perineal <=> peroneal
53  pleural <=> plural
54  porphyria <=> porphyruria
55  prostate <=> prostrate
56  protuberant <=> protruberant
57  quinidine <=> quinine
58  rachischisis <=> rachitis
59  rachischitic <=> rachitic
60  ret <=> rett
61  rosacea <=>rosea
62  semantic <=> somatic
63  silicon <=> silicone
64  taenia <=> tinea
65  thecoma <=> thekeoma
66  tinnitus <=> tinnitis
67  trichinosis <=> trichosis
68  ureteral <=> urethral
69  vagitis <=> vaginitis

   List 8.2.3. Drugs with similar names.

1   acetazolamide <=> acetahexamide
2   ambien <=> amen
3   amiodarone <=> amrinone
4   cardene sr <=> cardizem sr
5   chlorpropamide <=> chlorpromazine
6   clonidine <=> klonipin
7   clozapine <=> olanzapine
8   feldene <=> seldane
9   flomax <=> volmax
10   flutamide <=> flumadine
11  imipenem <=> omnipen
12  lodine <=> codeine
13  methadone <=> methylphenidate
14  ms contin <=> oxycontin
15  oruvail <=> clinoril
16  penicillin <=> penicillamine
17  prilosec <=> prozac
18  quinidine <=> quinine
19  retrovir <=> ritonavir
20  zocor <=> cozaar

   List 8.2.4. Common misspellings appearing pathology reports.

1   abcess  (should be abscess)
2   anastamosis  (anastomosis)
3   bissected (dissections are done with bisections)
4   caricnoma (the most commonly occurring terms are commonly
5   casset
6   cassett  (both cassette and casette are permissible)
7   debridment
8   entirley
9   formlain illeocecal (one "l" please)] lymphnode (a
      lymph node is two words)]
10   membraneous (the noune, "membrane" has an
      adjective, "membranous")
11  mesentary
12  palmer
13  spleenic (the noun is "spleen" but the adjective
    is "splenic")
14  tannish  ("ish" is a popular but unnecessary
15  uretheral (ureteral is a word and so is urethral, but
    uretheral is not)

   List 8.2.5. Permissible alternate spellings.

1   anonymization = anonymisation
2   artifact = artefact
3   cassette = casette
4   catheterisation = catheterization
5   dilatation = dilation
6   exotropia = exotrophia
7   preventative = preventive
8   sulfate = sulphate
9   sulfur = sulphur
10   sulfuric = sulphuric
11  travelling = traveling

   List 8.2.6. Dually occurring orthographic variants in UMLS that are probably not proper equivalences.

1   neurilemmoma and neurilemoma
2   sacroiliitis and sacroileitis
3   costalchondritis, costochondritis and costal chondritis
4   azoospermia and azospermia
5   Bartter's Disease and Barter's Disease
6   in situ and insitu
7   gall bladder and gallbladder.

   List 8.3.1. Disease homonyms.

1   cervical carcinoma (of neck or of uterus?)
2   medullary carcinoma (can refer to medullary carcinoma of
      breast, or thyroid or of adrenal medulla)
3   Paget's disease (can refer to different diseases involving
      either breast or bone)
4   Bowen's disease (can refer to different diseases in skin and

   List 8.17.1. Dangerous pathology abbreviations.

1   abg aortic bifurcation graft, or aortobifemoral graft
2   aha acquired hemolytic anemia, or autoimmune hemolytic
3   ascvd arteriosclerotic cardiovascular disease, or
      arteriosclerotic cerebrovascular disease
4   chd congenital heart disease, or congestive heart disease,
      or coronary heart disease
5   doa date of admission, or dead on arrival
6   edc estimated date of conception, or estimated date of
      confinement ("due date" means almost the opposite
      of "conception date")
7   hzo herpes zoster ophthalmicus, or herpes zoster oticus
8   ibd inflammatory bowel disease, or irritable bowel disease
9   lll left lower lid, or left lower lip, or left lower lobe,
      or left lower lung
10   mcgn mesangiocapillary glomerulonephritis or minimal change
11  mvr mitral valve regurgitation, or mitral valve repair, or
    mitral valve replacement
12  nc no change, or noncontributory
13  nkda no known drug allergies, or nonketotic diabetic
14  pe pulmonary effusion, or pulmonary edema, or pulmonary
    embolectomy or pulmonary embolism
15  sk seborrheic keratosis, or solar keratosis
16  uvf ureterovaginal fistula, or urethrovaginal fistula

   List 8.18.1. JCAHO "do not use" abbreviations (minimum list, effective january 1, 2004).

1   U (for unit) Reason: "U" visually mistaken as 0 or
      . Write "unit."
2   IU (for international unit) Reason: Mistaken as IV
      (intravenous or the number 4), or the number 10. Write
      "international unit."
3   Q.D., Q.O.D. (Latin abbreviation for once daily and every
      other day) Reason: Mistaken for each other. The period after
      the Q can be mistaken for an "I" and the
      "O" can be mistaken for "I." Write
      "daily" and "every other day."
4   Trailing zero (X.0 mg), Lack of leading zero (.X mg) Reason:
      Decimal point is missed. Never write a zero by itself after
      a decimal point (X mg), and always use a zero before a
      decimal point (0.X mg).]
5   MS, MSO4, MgSO4 Reason: Confused for one another. Can mean
      morphine sulfate or magnesium sulfate. Write "morphine
      sulfate" or "magnesium sulfate."
6   mg (for microgram) Reason: Mistaken for mg (milligrams),
      resulting in a 1000-fold dosing overdose. Write
7   H.S. (for half-strength or Latin abbreviation for bedtime),
      q.H.S. Reason: Mistaken for either half-strength or hour of
      sleep (at bedtime). q.H.S. mistaken for every hour. All can
      result in a dosing error. Write out
      "half-strength" or "at bedtime."
8   T.I.W. (for three times a week) Reason: Mistaken for three
      times a day or twice weekly, resulting in an overdose. Write
      "3 times weekly" or "three times
9   S.C. or S.Q. (for subcutaneous) Reason: Mistaken as SL for
      sublingual, or "5 every." Write "Sub-Q,"
      "subQ," or "subcutaneously."
10   D/C (for discharge) Reason: Interpreted as discontinue
      whatever medications follow (typically discharge meds).
      Write "discharge."
11  cc (for cubic centimeter) Reason: Mistaken for U (units)
    when poorly written. Write "ml" for milliliters
12  A.S., A.D., A.U. (Latin abbreviation for left, right, or
    both ears) O.S., O.D., O.U. (Latin abbreviation for left,
    right, or both eyes) Reason: Mistaken for each other (e.g.,
    AS for OS, AD for OD, AU for OU, etc.). Write "left
    ear," "right ear," or "both ears;"
    "left eye," "right eye," or "both

   List 9.1.1. Seven interpretations of "I didn't say you lied to me".

1   "I didn't say you lied to me."  Stressing
      "I," the sentence means that somebody else said
      that you lied to me
2   "I didn't say you lied to me." Stressing
      "didn't," the sentence means that I had nothing to
      do with it
3   "I didn't say you lied to me."  Stressing
      "say," the sentence means that I didn't speak the
      assertion but I may have made the assertion in a written or
      other non-verbal communication. ["I didn't say you lied
      to me."  Stressing "you," the sentence means
      that someone else lied to me. ["I didn't say you lied
      to me." Stressing "lied," the sentence means
      that I say you did something to me (other than lying)
4   "I didn't say you lied to me." Stressing
      "to," the sentence means that you lied but not to
      my face
5   "I didn't say you lied to me." Stressing
      "me," the sentence means that you lied to someone

   List 9.1.2. Common problems that reduce the meaning of narrative text.

1   Complex or run-on sentences
2   Inscrutable use of negations
3   Polysemous words and terms
4   Idiomatic phrases
5   Indiscriminate use of abbreviations
6   Ambiguous pronouns
7   Misspellings.

   List 9.1.3. The following have been widely distributed over the web and purportedly came from real medical charts.

1   "The baby was delivered, the cord clamped and cut, and
      handed to the pediatrician, who breathed and cried
2   "The patient had waffles for breakfast and anorexia for
3   "The patient lives at home with his mother, father, and
      pet turtle, who is presently enrolled in day care three
      times a week."
4   "Bleeding started in the rectal area and continued all
      the way to Los Angeles."
5   "Coming from Detroit, this man has no children."
6   "Examination reveals a well-developed male lying in bed
      with his family in no distress."

   List 9.2.1. Some steps in machine translation.

1   Parsing sentences into grammatic structures
2   Identifying idiomatic expressions
3   Disambiguating polysemous terms (based on sentence context)
4   Re-ordering terms based on grammar rules
5   Providing gender, tense and specialized language structures
      that may be absent in the source language
6   Determining grammar rule exceptions existing for words and
      terms in the source and target languages
7   Mapping between two different vocabularies.

   List 9.2.2. Some controlled English rules ().

1   Each word in the text may convey only one meaning (e.g., if
      iris is an anatomic part of the eye, it cannot also be a
2   For each meaning, only one term may be used (e.g., if you
      use the term "tumor" you should not use the terms
      "neoplasm, neoplastic growth, or mass" when you
      want to convey the same conceptual meaning as
3   Each word is used in only one word class (e.g., if
      "report" is a noun, as in surgical pathology
      report, it cannot be used as a verb, as in "Please
      report the pathology results.")

   List 9.2.3. Regular English version of excerpt from the Winston Churchill's

   List 9.2.4. Basic English version of excerpt from the Winston Churchill's

   List 9.2.5. Suggestions for controlling medical text.

1   Sentences should be short and declarative, with an
      unambiguous sentence terminator
2   Negations should include the word "not" and double
      negations should never be used
3   Abbreviations and acronyms should be represented as
      all-uppercase letters and should not contain periods, except
      when they occur at the end of a sentence. Abbreviations can
      be made plural by adding a lowercase "s".

   List 9.3.1. Research value of archived surgical pathology data.

1   All biopsied disease entities are included in the database,
      representing every category of biopsied disease (e.g.,
      metabolic/toxic, traumatic, genetic/congenital, neoplastic,
      degenerative, inflammatory, infectious)
2   Specimens can be characterized not only by diagnosis but
      also by descriptive terminology that may relate to
      prognostic or treatment categories
3   Database entries correspond to archived material (glass
      slides and paraffin blocks) that can be recovered for
      research purposes
4   Preparing and coding reports is an established and required
      activity of surgical pathology departments.

   List 9.4.1. Five common coding errors in human-coded reports.

1   Factually correct but unhelpful codes (e.g., coding all
      benign lesions as `negative for tumor'
2   Inconsistent codes (coding `dysplasia' on Monday and
      `atypia' on Tuesday)
3   Idiosyncratic codes (using a mnemonic for a lesion, often
      inscrutable to other people, such as coding all fungal
      infections as "fungus ball," under the morphology
      axis, rather than taking the time to assign a specific code
      from the infection axis, and remembering that the now
      private code "fungus ball" must be used for any
      future fungal searches)
4   Entry errors (e.g., entering `lipoma' when one intends to
      enter `lymphoma' and accepting the wrong code matched by the
5   Incomplete coding due to impatience or laziness.

   List 9.5.1. Algorithm for the doublet autocoder.

1   1. Each phrase (term) in the nomenclature (neocl.xml) is
      converted into intercalated doublets, and each doublet is
      assigned a consecutive number
2   2. Each nomenclature phrase is assigned the concatenated
      list of numbers that represent the ordered doublets
      composing the phrase
3   3. Every text record (pubmed abstract in this case) is split
      into an array consisting of the consecutive words in the
      text record
4   4. The text array is parsed as intercalated doublets.
      Intercalated doublets from the text that match doublets
      found anywhere in the nomenclature are assigned their
      numeric values (from the doublet index created for the
      nomenclature). Runs of consecutive doublets from the text
      that match doublets from the nomenclature are built into
      concatenated strings of doublet values. The occurrence of a
      text doublet that does not match any doublet in the
      nomenclature cannot possibly be part of a nomenclature term.
      Such text doublets serve as "stop" doublets
      between candidate runs of text doublets that match
      nomenclature doublets
5   5. The runs of matching doublets are tested to see if they
      match any of the runs of doublets that compose nomenclature
      terms or if they contain any subsumed terms that match
      nomenclature terms
6   6. The array of doublet runs extracted from the text that
      match nomenclature terms are cached in an external file.

   List 9.8.1. Algorithm for code-based searches without pre-annotation.

1   The user enters a query term
2   All the terms from a preferred nomenclature that are
      synonymous to the query term are collected into a list
3   Each term in the list is matched against the text corpus to
      determine the locations in the corpus where the term is
4   A list is assembled of corpus locations matching the query
      term or its synonyms
5   The user's query term is matched against all the equivalent
      terms included in a preferred vocabulary.

   List 9.8.2. Steps involved in implementing on-the-fly code searches.

1   Users will need to have a dataset or plain-text corpus in
      which every record is separated by the same delimiter. For
      testing purposes, the author created a 105 megabyte text.
      The corpus was created by a PubMed query on "pathology
      (ad)AND neoplasm", at the U.S. National Library of
      Medicine's website ( The query gathered all
      abstracts from the Pubmed database in which the term
      neoplasm occurs somewhere in the Pubmed entry, and in which
      the affiliation of the author contains the word
      "pathology". The query yielded abstracts that are
      likely to contain names of neoplasms. The PubMed output file
      can serve as a good test for an autocoder that uses a
      neoplasm nomenclature. The PubMed search yielded 66,509
      abstracts. All of the abstracts were downloaded into a
      single file from the PubMed site by setting the
      "Display" attribute to "Abstract" and
      the "Send to" attribute to "file". This
      produced a 105,689,546 byte plain-text file. The file was
      given the filename tumorab.txt, and this filename was used
      by the autocoders as a parsing input file. Although this
      file is not included with this manuscript, anyone in the
      world with internet access can obtain a near-identical file
      by repeating the same PubMed query. The records from the
      text consisted of paragraphs delimited by double return
      characters (also referred to as double-newline characters or
      double ASCII(13)-ASCII(10) character pairs).
      This is a common way of delimiting textual paragraphs
2   Users will create a file of all the doublets in the file,
      one line to each doublet, and each doublet followed by the
      record number and the record word-offset for the doublet
      occurrence. When a doublet occurs several times in a single
      record, each occurrence is indexed, with a different
      word-offset for each occurrence. The Perl script that
      creates the doublet index is It may require
      several minutes to execute, and it produces an index file,
      that is 400+ Megabytes in length. This file is, in turn,
      sorted alphabetically by a short Perl script that
      sorts files of any size. The index file is named
      doubdat.out. also keeps track of the begin-byte
      location in tumorab.txt where each text record begins.
      Another Perl script,, creates a database file
      containing the byte location in doubdat.out for the first
      occurrences of doublets. performs term searches
      on 100 terms selected randomly from a chosen vocabulary. In
      this case, the vocabularies tested are Snomed-CT, extracted
      from the 2004 version of UMLS, (), (), (), (), and the
      "Taxonomy for the developmental lineage classification
      of neoplasms", made available for public download by
      the Association for Pathology Informatics ().

   List 9.9.1. When pre-annotation is useful.

1   When there is the need for a global analysis of the dataset.
      (In a global analysis of a dataset, all the data elements
      are examined at once. Typically, the researcher is looking
      for relationships among the different data elements
2   When the query response time must be very rapid. [When the
      expected number of queries may be very large.

   List 9.9.2. When fast doublet matching is useful.

1   When the dataset is typically searched one item at a time
2   When the dataset does not change or changes only by the
      addition of records
3   When the dataset is being searched using many different
      types of vocabularies
4   When the dataset is searched with a single vocabulary that
      is constantly changing
5   When one dataset needs to be integrated with another
      dataset, and the datasets have not been annotated with the
      same nomenclature.

   List 9.10.1. When you will need to re-code.

1   Whenever you want to change from one nomenclature to another
      (eliminates problem of brand-name loyalty)
2   Whenever you introduce a new version of a nomenclature
      [Whenever you want to use a new coding algorithm (e.g.
      Parsimonious versus comprehensive, or linking code to a
      particular extracted portion of report) [Whenever you add
      legacy data to your LIS [Whenever you merge different
      pathology datasets - eliminates many mapping projects

   List 10.2.1. Making medical record data harmless.

1   1. De-identification of data fields that specifically
      characterize the patient (name, social security number,
      hospital number, address, age, etc.)
2   2. Free-text data scrubbing, removing identifiers from the
      textual portion of medical reports
3   3. Rendering the dataset ambiguous, ensuring that patients
      cannot be identified by data records containing a unique set
      of characterizing information
4   4. Free-text data privatizing, removing any information of a
      private nature that may be contained within the report.

   List 10.3.1. HIPAA safe harbor identifiers.

1   Names
2   Geographic subdivisions smaller than a State
3   Dates (except year) directly related to patient
4   Telephone numbers
5   Fax numbers
6   E-mail addresses
7   Social security numbers
8   Medical record numbers
9   Health plan beneficiary numbers
10   Account numbers
11  Certificate/license numbers
12  Vehicle identifiers and serial numbers
13  Device identifiers and serial numbers
14  Web URLs
15  Internet Protocol (IP) address numbers
16  Biometric identifiers, including finger and voice prints
17  Full face photographic images and any comparable images
18  Any other unique identifying number, characteristic, or
    code, except as permitted under HIPAA to re-identify

   List 10.6.1. Deficiencies of subtractive data scrubbing methods.

1   Requires the creation and continuous maintenance of an
      identifier list conisting of names of patients, staff, and
      medical centers as well as addresses and other geographic
2   Requires the creation and continuous maintenance of rules
      for excluding text based on co-locations or patterns of
      expression that might signify a HIPAA identifier (e.g. a
      sequence of digits and slashes that might represent a date)
3   Does not exclude private information that is non-identifying
      but which may be incriminating or distasteful
4   Does not satisfy the "minimum necessary" (see
      Glossary) principle, holding that medical data convey only
      that  information which is needed for research purposes
5   Slow. Each parsed sentence is typically evaluated through
      the  entire list of pattern rules. This means that parsing a
      long corpus of medical text will take considerable time
6   Complex. Maintaining the rule list and the identifier list
      will add to the overall complexity of the software. Each
      institution that implements the software will need to
      maintain their own lists created for their patients and for
      their textual styles and formats
7   Inadequate. Subtractive scrubbers, under the best of
      circumstances, will occasionally miss an identifier. If a
      scrubber is 99% accurate, it may miss thousands of
      identifiers in a large text.

   List 10.6.2. Algorithm for the Concept-Match method of data scrubbing.

1   1. Parse all input into sentences
2   2. Parse each sentence into words
3   3. Each stop word (high-frequency words, including
      prepositions and common adjectives) is preserved in its
      original place within each sentence
4   4. Intervening words and phrases are mapped to a standard
      nomenclature. This step requires breaking phrases into all
      possible ordered concatenations of words. For instance,
      "Margins free of tumor" would become "margins
      free of tumor, margins free of, free of tumor, margins free,
      free of, of tumor, margins, free, of, tumor." Each
      member of the derivative list is matched against the entire
      database of Unified Medical Language System (UMLS) terms to
      determine if a code exists for the term. Large terms subsume
      smaller substring terms
5   5. Each coded term is replaced by an alternate term that
      maps to the same concept code, if an alternate term exists.
      For instance, the term renal cell carcinoma appearing in the
      text would be replaced by C0007134 (the UMLS Concept Unique
      Identifier for renal cell carcinoma) and by a different term
      that maps to the same code (such as rcc, hypernephroma,
      hypernephroid carcinoma, or Grawitz tumor). This step
      produces an output containing a different set of words than
      the original text (see List)
6   6. All other words are replaced by blocking symbol
      consisting of an asterisk.

   List 10.6.3. Sample output of the Concept-Match scrubbing method.

1        <rdf:Description
2            <dc:title>
3   primary synovial sarcoma of the mediastinum a  [   
      clinicopathologic immunohistochemical and ultrastructural
4       study of 15 cases
5            </dc:title>
6            <v:autocode term="sarcoma"
      code="C0000000" />
7            <v:autocode term="sarcoma of the
      mediastinum" code="C6606000" />
8            <v:autocode term="synovial sarcoma"
      code="C3400000" />
9            <v:autocode term="synovial sarcoma of the
      mediastinum" code="C6618000" />
10            <v:autocode term="primary synovial
      sarcoma" code="C8826000" />
11           <de_id>
12  * primary synovial sarcoma of the mediastinum a  [    * *
    and * * of * * * *
13           </de_id>
14       </rdf:Description>

   List 10.6.4., a Perl script to scrub text.

1        #!/usr/local/bin/perl
2   open (TEXT, "doubdb.txt")||die"Can't open
      file";  [$line = " ";
3        while ($line ne "")
4          {
5          $line = <TEXT>;
6          $line =~ s/\n//o;
7          $doubhash{$line} = "";
8          }
9        close TEXT;
10        print "What would you like to scrub?\n";
11       $line = <STDIN>;
12       print "Scrubbed text.... ";
13       $line = lc($line);
14       $line =~ s/\'s//g;
15  $line =~ s/[^a-z0-9 \-/ /g;  [@hoparray = split(/ +/,$line);
16       $lastword = "\*";
17  for ($i=0;$i<(scalar(@hoparray));$i++)  [   {
18          $doublet = "$hoparray[$i]
19  if (exists $doubhash{$doublet})  [       {
20  print " $hoparray[$i";  [       $lastword = "
    $hoparray[$i+1 ";]
21              }
22     else
23              {
24  print $lastword;  [       $lastword = " \*";
25              }
26          }
27       print "\n";
28       exit;

   List 10.6.5. Sample output from

1   C:\ftp>perl
2   What would you like to scrub?
3   Basal cell carcinoma, margins involved
4   Scrubbed text.... basal cell carcinoma margins involved
5   C:\ftp>perl
6   What would you like to scrub?
7   Rhabdoid tumor of kidney
8   Scrubbed text.... rhabdoid tumor of kidney
9   C:\ftp>perl
10   What would you like to scrub?
11  Mr Brown has a basal cell carcinoma
12  Scrubbed text.... * * has a basal cell carcinoma
13  C:\ftp>perl
14  What would you like to scrub?
15  Mr. Brown was born on Tuesday, March 14, 1985
16  Scrubbed text.... * * * * * * * * *
17  C:\ftp>perl
18  What would you like to scrub?
19  The doctor killed the patient
20  Scrubbed text.... * * * * *

   List 10.10.1. Rules of non-uniqueness in database de-identification.

1   1. No unique cancers are allowed  (i.e., every concept in
      the database must be present in at least two records). Any
      record with a uniquely occurring tumor is expunged
2   2. N-plicate records cannot contain values that are unique
      to the n-plicate set (if you have n identical records, you
      should be able to look at all the cancer diagnoses in the
      replicated record and find other records with the same
      diagnosis). If that is not the case, all such n-plicate
      records should be expunged from the public use database
3   3. No value may be allowed to occur in every record from a
      set of records that share a common diagnosis. If that is the
      case, every record containing the concept is expunged
4   4. Every value that co-occurs in a record that contains a
      gender-specific tumor must occur in more than one additional
      record that does not contain a gender-specific tumor.
      Otherwise, all records containing the concept are expunged.
      (NOTE: the same argument may be made for tumors that are
      highly age specific, or ethnicity-specific or possibly
5   5. No value-binned record can be unique to a second
      diagnosis (i.e. there may be 10 rhabdoid tumor cases, but
      only one case of rhabdoid tumor and basal cell carcinoma).
      Such cases are expunged.

   List 10.11.1. Performance issues for de-identification software.

1   1. Product availability. Is the software product freely
      available and open source? Grant applications that propose
      proprietary data sharing solutions may receive disparaging
      reviews in study sections. Reviewers may expect large,
      multi-institutional efforts to implement open source
      de-identification algorithms. Conversely, proprietary
      solutions my be ideal for laboratory personnel who lack the
      resources to implement and test published algorithms and who
      prefer turnkey applications
2   2. Product speed. Is the de-identification process fast?
      This becomes important when the research project involves
      millions of records or requires reprocessing records to
      satisfy research objectives that change over time or that
      serve different research protocols
3   3. Product error rate. There is a trade-off between the
      accurate preservation of textual information and the
      successful elimination of all identifiers. If the research
      project requires the human review of de-identified reports,
      it may be necessary to use a de-identification method that
      preserves as much of the original text as is feasible.
      De-identification methods that maximize the preservation of
      original text will tend to have the highest error rates
4   4. Product integration and support issues. Will the
      de-identification software work with heterogeneous data
      sources, or is it constrained to work with a specific data
      input? Will the software permit an interface to the
      researcher's preferred database, or will the researcher be
      required to transform the primary data structure to a
      secondary data structure? If so, will the secondary data
      structure conform to an open standard (see Glossary), or
      will it be a proprietary data structure? Will the software
      be upgraded and will the upgrades be freely available? Can
      the software be modified without violating license
5   5. Convenience. Will the product require continual 
      maintenance, staff training, and quality assurance? 
      Sometimes simplicity and easy maintenance will  justifiably
      outweigh performance.

   List 11.1.1. Cryptographic methods vital to biomedical informatics.

1   Encrypting and decrypting messages
2   Electronic signatures
3   Message authentication
4   Time stamping
5   Creating unique identifiers
6   Reconciling patients across institutions
7   De-identification and Re-identification
8   Privatizing data sharing protocols
9   Data referencing (with message digests)
10   Watermarking and steganography (see Glossary) utilities

   List 11.2.1. Example protocol for a one-way hash de-identified record linkage.

1   1. John Q. Public arrives for the first time in your medical
2   2. John Q. Public has a glucose test ordered and recives a
      glucose value of 85
3   3. Using the MD_5 one-way hash algorithm, on the character
      string, "John Q. Public" a hash value of
      "3f875ec450dfbb07ed889e7b9c36da92" is generated
4   4. In addition to John Q. Public's identified medical
      record, a de-identified record is prepared:
5   3f875ec450dfbb07ed889e7b9c36da92^^glucose^^85
6   A property of the one-way hash value is that it is a
      seemingly random collection of letters and numbers and no
      computational efforts applied to the one-way hash value can
      yield the patient's name
7   The de-identified record is given to a trusted database
      administrator who adds it to the database of de-identified
      records. The database administrator cannot identify any of
      the patients whose records are included in the database
8   5. Ten years later, John Q. Public returns to the medical
      clinic and has another glucose test. This time, the glucose
      value is 95
9   A one-way hash is performed on the string "John Q.
      Public" yielding 3f875ec450dfbb07ed889e7b9c36da92, and
      a new de-identified record is prepared:
10   3f875ec450dfbb07ed889e7b9c36da92^^glucose^^95
11  The de-identified record is given to the trusted database
    administrator who adds it to the aggregate database. The
    database program finds a match to the one-way hash and
    concatenates the new record to the old record:
12  3f

   List 11.4.1. Section of HIPAA regulation specifically addressing use of one-way hash de-identifiers.

1   Comment: Several commenters who supported the creation of
      de-identified data for research based on removal of facial
      identifiers asked if a keyed-hash message authentication
      code (HMAC) can be used as a re-identification code even
      though it is derived from patient information, because it is
      not intended to re-identify the patient and it is not
      possible to identify the patient from the code. The
      commenters stated that use of the keyed-hash message
      authentication code would be valuable for research, public
      health and bio-terrorism detection purposes where there is a
      need to link clinical events on the same person occurring in
      different health care settings (e.g. to avoid double
      counting of cases or to observe long-term outcomes)
2   Response: The HMAC does not meet the conditions for use as a
      re-identification code for de-identified information. It is
      derived from individually identified information and it
      appears the key is shared with or provided by the recipient
      of the data in order for that recipient to be able to link
      information about the individual from multiple entities or
      over time. Since the HMAC allows identification of
      individuals by the recipient, disclosure of the HMAC
      violates the Rule..

   List 11.4.2. Steps of the protocol.

1   Step 1. Institution A and Institution B each create a random
      character string and send it to the other institution
2   Step 2. Each institution receives the random character
      string from the other institution and sums it with their own
      random character string, producing a random character string
      common to both institutions (RandA+B)
3   Step 3. Each institution takes a patient identifier (a name,
      a social security number, a birth date, or some combination
      of identifiers) and sums it with RandA+B. The result is a
      patient random character string that is identical across
      institutions when the patient is identical in both
      institutions. This step may be implemented several different
4   Step 3, implementation strategy 1. RandA+B is now destroyed
      at both institutions, and RandA and RandB are destroyed by
      the institutions that created each random string, leaving
      only the patient random character string at each
      institution. The destruction of these random numbers makes
      it impossible to recompute the original identifier from the
      patient random character string
5   Step 3, implementation strategy 2. At this point,
      institutions may provide the patient random character string
      to a data broker. Having only the patient random character
      strings, the broker has zero patient-related information
6   Step 3, implementation strategy 3. The summation function
      can be any one of many logical operations on characters or
      strings or their constitutive bits, including logical or,
      xor, modulo addition, etc
7   Step 4. Institutions A and B compare a subset of their
      patient random character strings
8   Step 4, implementation strategy 1. Institution A sends the
      first character of the patient random character string to
      Institution B. If the first character (or any subsequent
      character) is not identical in both institutions, the
      protocol ends. The 2 patients are not the same person. If
      the first character is identical in both institutions,
      Institution B sends the second character of its patient
      random character string to Institution A. If the second
      character held by Institution B is the same as the second
      character held by Institution A, the process is repeated
      until a sufficient number of transactions have occurred,
      short of the length of the random character string, to
      convince the institutions that they have the same patient
      random character string. Implementing this optional strategy
      ensures that the patient random character strings are never
      actually exchanged between institutions.

   List 11.4.3. Zero-Check Properties.

1   No knowledge about the patient is transmitted across
      institutions. When institutions use an institutional broker
      to complete the transactions, the institutions themselves
      have no knowledge of the identity of the individuals
2   The protocol uses no encryption or 1-way hash algorithm, and
      therefore, there is no need to protect the protocol from
3   By destroying RandA, RandB, and RandA+B, the protocol can be
      implemented in a manner that makes it impossible to
      recompute the original identifier from the patient random
      character string.

   List 11.5.1. The basic threshold protocol.

1   1. Text is divided into short phrases
2   2. Each phrase is converted by a one-way hash algorithm into
      a seemingly-random set of characters
3   3. Threshold Piece 1 is composed of the list of all phrases,
      with each phrase followed by its one-way hash
4   4. Threshold Piece 2 is composed of the text with all
      phrases replaced by their one-way hash values, and with
      high-frequency words preserved.

   List 11.5.2. Bob's piece 1.

1   684327ec3b2f020aa3099edb177d3794 => suggested autosomal
      dominant inheritance
2   3c188dace2e7977fd6333e4d8010e181 => mother
3   8c81b4aaf9c2009666d532da3b19d5f8 => manifestations
4   db277da2e82a4cb7e9b37c8b0c7f66f0 => suggested
5   e183376eb9cc9a301952c05b5e4e84e3 => sons
6   22cf107be97ab08b33a62db68b4a390d => severe

   List 11.5.3. Bob's piece 2.

1   they db277da2e82a4cb7e9b37c8b0c7f66f0 that the
2   8c81b4aaf9c2009666d532da3b19d5f8 were as
3   22cf107be97ab08b33a62db68b4a390d in the
4   3c188dace2e7977fd6333e4d8010e181 as in the
5   e183376eb9cc9a301952c05b5e4e84e3 and that this
6   684327ec3b2f020aa3099edb177d3794.

   List 11.5.4. Properties of Piece 1 (the listing of phrases and their one-hashes).

1   Contains no information on the frequency of occurrence of
      the phrases found in the original text (because recurring
      phrases map to the same hash code and appear as a single
      entry in piece 1)
2   Contains no information that Alice can use to connect any
      patient to any particular patient record. Records do not
      exist as entities in Piece 1
3   Contains no information on the order or locations of the
      phrases found in the original text
4   Contains all the concepts found in the original text. Stop
      words are a popular method of parsing text into concepts
5   Bob can destroy piece 1 and re-create it later from the
      original file
6   Alice can use the phrases in Piece 1 to transform, annotate
      or search the concepts found in the original file
7   Alice can transfer piece 1 to a third party without
      violating HIPAA privacy rules or Common Rule human subject
      regulations (in the U.S.). For that matter, Alice can keep
      piece 1 and add it to her database of piece 1 files
      collected from all of her clients.

   List 11.5.5. Properties of Piece 2.

1   Contains no information that can be used to connect any
      patient to any particular patient record
2   Contains nothing but hash values of phrases and stop words,
      in their correct order of occurrence in the original text
3   Anyone obtaining piece 1 and piece 2 can reconstruct the
      original text
4   Bob can lose or destroy piece 2, and re-create it later from
      the original file.

   List 11.6.1. FDA definitions related to digital signatures, from ().

1   Biometrics means a method of verifying an individual's
      identity  based on measurement of the individual's physical
      feature(s) or  repeatable
      action(s) where those features and/or actions
      are  both unique to that individual and measurable
2   Digital signature means an electronic signature based upon 
      cryptographic methods of originator authentication, computed
      by  using a set of rules and a set of parameters such that
      the identity  of the signer and the integrity of the data
      can be verified
3   Electronic signature means a computer data compilation of
      any symbol  or series of symbols executed, adopted, or
      authorized by an individual to  be the legally binding
      equivalent of the individual's handwritten signature.

   List 11.6.2. Qualities of an adequate electronic signature.

1   Non-repudiation. Can the signator plausibly deny that she
      signed the document?
2   Unique identification. Does the electronic signature
      uniquely identify the signator?
3   Universal. Can the signature be used with other existing
      signature or biometric systems?
4   User-friendliness. Can the signature be signed incorrectly
      or  misinterpreted?
5   Non-obsolescence. Will everyone be able to read the
      signature ten years or 100 years from now?
6   Security. Can someone change the file that was signed,
      change the signature to someone else's, invalidate a valid
      signature, or steal the signature?
7   Extensibility. Can someone sign for the signator, sign with
      the signator, or notarize the signature with another
      electronic signature?

   List 11.6.3. Do you trust digital signatures?

1   Is your computer secure?  Might someone have stolen your
      private key?
2   Is your encryption software secure?  Might it contain a
      trapdoor subprogram that sends your private key to a
      malevolent entity?
3   Is your encryption software reliable. Might it produce the
      same private key for different customers?
4   Can you be certain that the software "signed" the
      correct document. Might it have signed a different document
      by mistake during a signing transaction?
5   Are you certain that you will not lose track of your 
      private keys over time?

   List 12.2.1. Six extraordinary properties of XML.

1   Enforced and defined structure (XML rules and schema)
2   Formal metadata (through ISO11179 specification)
3   Namespaces (permits sharing of uniquely identifiable common
      data [elements (CDEs))
4   Linking data via the internet (through Unique Resource
5   Logic and meaning (the Semantic Web and Ontologies)
6   Self-awareness (software agents (see Glossary), artificial
      intelligence (see Glossary), embedded protocols and

   List 12.4.1. Descriptors for the metadata tag <core_organism>.

1   core_organism
2   Identifier: core_organism
3   Version: 1.0
4   Registration Authority: Association for Pathology
5   Language: English (en)
6   Obligation: Optional
7   Datatype: Character String representing taxonomy.dat
      identifier number followed by an allowable taxonomy.dat name
      for the identifier number
8   Maximum Occurrence: Unlimited
9   Definition: Organism name at species level for organism
      whose tissue is represented in the donor block
10   Comment: URI for taxonomy.dat is
      The correct entry for human tissue is "9606

   List 12.10.1. Steps to create data documents that can be easily integrated across heterogeneous datasets.

1   1. Find a simple way of describing pathology data using a
      syntax that confers meaning onto data (i.e., RDF syntax)
2   2. Develop a simple approach to listing the unique objects
      relevant to the pathology domain (e.g. surgical pathology
      report, specimen, block, stain, laboratory test, submission
      data, completion data) and a way to uniquely identify these
      objects that can be used by any laboratory without risk of
      losing track of unique objects and without risk of assigning
      non-unique objects the same identifier)
3   3. Develop a repository for metadata that defines each
      metadata element in the pathology domain and describes the
      data constraints (if any) of the data elements described by
      the metadata
4   4. Develop general algorithms/software for integrating,
      aggregating and transforming data held in RDF triple

   List 12.11.1. Image description using RDF triples in Notation 3 format.

1   file:image.n3 @prefix :
3        @prefix rdf:
4   :Baltimore_Hospital_Center rdf:type "Hospital"
5   :Baltimore_Hospital_Center_4357
6   :Baltimore_Hospital_Center_4357 :patient_name
7   :Baltimore_Hospital_Center_4357 :surgical_pathology_specimen
8   :S_3456_2001 rdf:type
9   :S_3456_2001 :image
10   :S_3456_2001:log_in_date "2001-08-15"
11  :S_3456_2001 :clinical_history
12  <> rdf:type
13  <>
    :surgical_pathology_accession_number "S3456-2001"
14  <>
    :specimen "2"
15  <> :block
16  <> :format
17  <> :width
18  <> :height
19  <>
    :hash_value "84027730gjsj350489"
20  <>
    :hash_type "md_5"
21  <>
    :watermark "none"
22  <> :camera
23  <>
    :camera_model "342"
24  <>
    :capture_date "2002-02-02"
25  <>
    :diagnosis "squamous_cell_carcinoma"
26  <>
    :topography "floor_of_mouth"
27  <> :has
28  <>
    :copyright "all_rights_reserved"
29  <>
    :copyright_holder "Baltimore_Hospital_Center"
30  <>
    :microscope "Olympus"
31  <>
    :microscope_model "3453"
32  <>
    :microscope_objective_power "40X"
33  <>
    :photographer_name "Jules_Berman".

   List 12.11.2., a Perl script that converts Notation 3 into RDF.

1        #!/usr/local/bin/perl
3        use RDF::Notation3;
4        use RDF::Notation3::Triples;
5        $path = "image.n3";
6        $rdf = RDF::Notation3::Triples->new();
7        $rdf->parse_file($path);
8        $triples = $rdf->get_triples;
9        print @$triples->[0]->[0];
10        print "###\n";
11       use RDF::Notation3::XML;
12       $rdf = RDF::Notation3::XML->new();
13       $rdf->parse_file($path);
14       $string = $rdf->get_string;
15       print $string;
16       use RDF::Notation3;
17       exit;

   List 12.11.3. Properties of image.n3.

1   Every statement has a fully specified object followed by a
      property and a value (a triple)
2   Every unique object belongs to a class (at least one).

   List 12.12.1. A few examples of XML schema primitives that can be incorporated in DAML.

1   enumeration
2   positiveInteger
3   minInclusive
4   integer
5   pattern

   List 13.1.1. Complexities of commercial biomedical software.

1   Must protect the confidentiality and privacy of patients
2   Must not produce errors. Patient lives are at stake
3   Must not crash. Patient lives are at stake
4   Must have graphic user interface that anyone can use. Staff
      training is an expensive proposition, and most  medical
      centers hate to train their staff to use their computers
5   Must provide the key functionality to the user. Most errors
      found by users relate to a misunderstanding between the user
      and the developer relating to the  intended purpose of the
      software (). Most software developers know nothing about the
      work flow of hospital Hospitals. Developers seldom know
      which functions of their software are vital to patient care.
      Hospital personnel  have trouble understanding how anyone
      could NOT know these things
6   All functionalities must be scalable (able to accommodate 
      unanticipated increases in usage demands)
7   Must successfully interoperate with many other  systems
      (hardware and software) throughout the hospital and over the
8   Must anticipate future needs. Hospitals pay many millions of
      dollars for information systems. They would much prefer not
      to buy a new system every time their activities change
9   Must survive the demise of the company that produced the
      software. The death rate of commercial biomedical software
      developing companies is atrociously high. Hospitals have
      been  left with virtually useless systems when their
      software vendors have vanished
10   Must not be vulnerable to malicious attack. There are
      actually very few programmers with the expertise to design
      software  with principles of computer security. The
      likelihood that  any piece of software has been written with
      the  help of a computer security expert is quite low
11  Must not be vulnerable to the unintended consequences of 
    exuberant users. Large systems have been known to crash when
    users strain computational resources with recursive

   List 13.2.1. Reasons why hospital informatics projects, such as CPOE, may fail.

1   Tasks that were traditionally accomplished through
      intepersonal communication may be replaced by solitary entry
      sessions with  HIS computer terminals. Opportunities to
      share helpful explanations and patient status updates may be
2   Computer entry tasks may be tedious, time-consuming, and
      repetitive. Harried staff, under these circumstances, may do
      an incomplete or sloppy job
3   Computer orders, once entered, may have no mechanism for
      correcting entry errors, leading to miscommunications
4   The asynchronous nature of multi-user entries into the HIS
      may cause havoc in a system that depends on coordinated
      workflow. For instance a prescription may not be filled by
      pharmacy until an order entered by a clerk-typist is
      released by a physician. If there is no system to  ensure
      that each entry occurs in a timely and coordinated manner, 
      workflow is halted.

   List 13.3.1. Desperate actions intended to cope with complexity.

1   Lock-down your data. Restricting the variety of data
      permitted in the database can sometimes reduce complexity
2   Lock down your software components. Use the last software
      version that worked and stop trying to enhance any
3   Lock down your operating system. Do not even try to offer 
      cross-platform interoperability
4   Lock down your set of assumptions. Software built on an 
      static set of assumptions about the user's world is often 
      manageable. Changing these assumptions is asking for
5   Lock down activities. Reduce the number of people or
      services  that are supported by the software
6   Hire specialized programmers. Each programmer should
      concentrate on a very small component of your software. Hire
      more programmers so that every software component is
      adequately staffed with  developers, analysts, and technical
      support staff. There is strength in numbers
7   Stop thinking about fundamental appraches to problems. There
      is no going back once the juggernaut is launched
8   Replace system with a newer, more expensive and more complex
      system. Ahh. The cycle of life is renewed.

   List 13.3.2. A few effective measures to cope with software complexity.

1   Write clean software code and use in-line documentation to
      explain the purpose of software commands
2   Provide detailed and clear documentation for all software
3   Use object-oriented languages and follow standard techniques
      for good object design
4   Use UML
5   Use refactoring (see Glossary) methodology to improve
      complex code
6   Continually test software and carefully document all

   List 13.3.3. Intrinsically complex components of biomedical informatics.

   List 13.3.4. Intrinsically simple components of biomedical informatics.

1   Classifications. A class inherits properties in a direct
      lineage  from a parent class. An object can only occupy a
      single class. Classifications are easy to understand and
2   Flat data files that can be extended but not re-written. A
      telephone book is a close example. If people never changed
      their names, never died, and never changed their telephone
      numbers, a telephone directory would be an ideal flat-file
3   The EMR (electronic medical record). The EMR is the  digital
      equivalent of the patient chart. In this model, all new
      clinical reports pertaining to a patient are inserted into
      the EMR object for the patient. This is a simple data model
      that can work well so long as one and only one record is
      created for each patient
4   Small, self-contained specialized information systems. These
       applications are built designed for a specific and narrow 
      function (e.g. cytopathology information system). Complexity
      does not intervene until the specialized information system
      needs to interact with other systems in the hospital
5   Fundamental algorithms. Almost all important algorithms are
      simple and can be explained in a few steps. From  these
      simple algorithms, complex systems can arise
6   Simple protocols. Very simple protocols can support
      incredibly complex systems. TCP/IP (the internet protocol)
      is a simple  strategy for transferring packets of
      information over a network of computers
7   Elegant object oriented programming languages, such as Ruby.
      Though Ruby is a simple and elegant language, it can be used
      to  create hopelessly complex software. Programmers need
      extensive training in design principles that minimize
8   Specifications. Specifications are formal ways of explaining
      what you've done so that computers and humans can understand
      and replicate your work
9   Unique data identifiers. Computers are good at creating and
      tracking unique identifiers
10   Encryption. It is easy to make something a secret
11  De-identified public datasets. Publicly released
    de-identified  data has immense scientific value and, with
    remarkably few exceptions, has not hurt patients.

   List 13.5.1. CDC report of annual number of deaths in the U.S. from leading diseases.

1   Heart disease: 696,947
2   Cancer: 557,271
3   Stroke (cerebrovascular diseases): 162,672
4   Chronic lower respiratory diseases: 124,816
5   Accidents (unintentional injuries): 106,742
6   Diabetes: 73,249
7   Influenza/Pneumonia: 65,681
8   Alzheimer's disease: 58,866
9   Nephritis, nephrotic syndrome, and nephrosis: 40,974
10   Septicemia: 33,865

   List 13.6.1. Properties of a classification.

1   A classification is a grouped taxonomy (listing of all
      objects in a knowledge domain) with the following four
2   1. Inheritance: Hierarchical structure, with each class of
      tumors inheriting properties of its ancestors
3   2. Uniqueness: Each tumor occurs in only one place in the
4   3. Comprehensive: All tumors are included
5   4. Class-intransitive: A tumor from one class does not
      change into a tumor from another class (e.g. an
      adenocarcinoma does not become a lymphoma)

   List 13.6.2. Limitations of current neoplasm classifications.

1   Classifications are created piecemeal for specific sites or
      organ systems. Nobody has published a comprehensive
      classification, although comprehensive taxonomies have been
2   Classifications are often based on medical disciplines,
      rather than on any biologic principles (e.g. classification
      of dermatologic tumors)
3   A given tumor will appear redundantly when
      subclassifications are merged
4   No tumor classification has been prepared in a standard
      format designed to exchange, merge or analyze heterogeneous
      biological data The most widely-used authoritative resources
      are the World Health Organisation classifications, which
      list the tumors that occur at different body sites. The
      problem with an organ system approach to classification is
      that every organ contains organ-specific and organ
      non-specific cell types. The brain, for instance, contains
      connective tissue and lymphoid tissue, and therefore is
      prone to tumors of connective tissue and lymphoid tissue. A
      listing of tumors that occur in the brain must include:
      osteocartilaginous tumors, lipoma, fibrous histiocytoma,
      hemangiopericytoma, rhabdomyosarcoma, melanoma, lymphoma and
      myeloma, among others. These same tumors will be included
      again and again in every site-specific classification.
      Although each term may occur only once in each site-specific
      classification, the same lesion may occur a virtually
      limitless number of times when the site classifications are
      combined into a comprehensive classification of tumors.
      Although cancer taxonomies are different from
      classifications, they usefully provide all the instances of
      tumors that must be grouped within a classification.
      Excellent tumor taxonomies are now publicly available at no

   List 13.6.3. Schematic for the Developmental Lineage Classification of Cancer.

1   embryonic
2   primitive
3       primitive_differentiating
4          totipotent_or_multipotent_differentiating
5          limited_differentiating
6       germ cell
7       primitive_non_differentiating
8   non_primitive
9          endoderm_or_ectoderm
10             endoderm_or_ectoderm_surface
11            endoderm_or_ectoderm_endocrine
12            endoderm_or_ectoderm_parenchymal
13            odontogenic_epithelium
14         mesoderm
15            mesenchyme
16               connective_tissue
17                   muscle
18                   fibrous_tissue
19                   vascular
20                   adipose_tissue
21                   bone_cartilage
22               heme_lymphoid
23            non_mesenchymal_mesoderm
24               coelomic
25                  coelomic_ductal
26                  coelomic_cavities
27                  coelomic_gonadal
28               sub_coelomic
29                  sub_coelomic_gonadal
30                  sub_coelomic_endocrine
31                  sub_coelomic_ nephric
32          neuroectoderm_neural_plate
33            neural_tube
34               neural_tube_parenchyma
35               neural_tube_lining
36            neural_crest
37               peripheral_nervous_system
38               neural_crest_endocrine
39               neural_crest_melanocytic

   List 13.6.4. General features of the tumor classification relevant to biomedical informatics.

1   Instance uniqueness. Each tumor entity appears occurs only
      once within the classification
2   Comprehensive. The classification ensures that every tumor
      of man can be placed somewhere within the classification
3   Simplicity. One of the purposes of a classification is to
      drive down the complexity that exists when the domain
      taxonomy is large. The entire classification is described by
      under 40 classifiers
4   Principled. The classification is based on known principles
      of developmental biology, not on political or artifactual
      distinctions between tumors. A counterexample would be a
      tumor classifications based on medical specialty (e.g.
      dermatologic neoplasms, hematologic neoplasms, head and neck
      tumors, etc.)
5   The classification has "competence." In the field
      of informatics, competence is the ability to answer
      questions related to the instances of a data group
6   Standard method of organization. The classification is
      represented as an XML document
7   Scalability. It is easy to expand the classificaiton with
      new subclasses. This is important, as the molecular analysis
      of tumors is likely to provide new taxa
8   Modifiable. It is easy to move subdivisions of the
      classification. Classifications are hypothetical
      re-creations of reality and must be changed as information
      is accrued
9   Understandable. The classification is easily understood by
      developmental biologists. Developmental biologists are major
      participants in post-genomic science and need to have tools
      to relate basic research with clinical exigencies
10   Credible. The classification complements modern theories of
      the "stem cell" origin of tumors
11  Compatible with other visions of reality. The classification
    does not invalidate existing diagnoses found in pathology
    reports. The medicolegal importance of this feature cannot
    be exaggerated. This relieves pathologists from reviewing
    all their prior cases and re-diagnosing them in conformance
    with a new classification
12  Open access. The classification is an open access document
    that can be used or criticized freely by the biomedical
    community ().

   List 13.8.1. Distinctions between ontologies and classifications.

1   1. An ontology does not need to provide a theoretic
      embodiment of a data domain. In fact, an ontology need not
      be comprehensive (i.e. an oncology does not need to include
      all the instances of a knowledge domain) and may extend over
      several different knowledge domains
2   2. Ontologic classes are characterized by one or more
      logical rules and include object instances that behave in
      conformity to the class rules. The classes in
      classifications are determined by a set of features that are
      shared among the member of the class. The features that
      define a class within a classification are usually not
      logical rules
3   3. Ontologies permit multi-class inheritance. Any ontologic
      class can inherit from any number of father classes. A class
      within a classification can have at most one father class
4   4. An ontologic object instance may belong to more than one
      class, just as long as the object obeys the rules of the
      class in which they are a member. An object in a
      classification can belong to only one class
5   5. An ontologic class inherits the rules of its superclass.
      However, an ontologic class is not required to have a
      superclass (i.e., an ancestor class) or descendat classes
      (i.e. subclass).

   List 13.8.2. Good things about ontologies.

1   Ontologies are computable and fit neatly into the object
      oriented programming model
2   Ontologies are semantically sensible and can be described
      with standard RDF syntax
3   Ontologies have competence, meaning they can be used to draw
      a variety of inferences about class members based on class
4   Ontologies are extensible and can be integrated with other
5   When the same ontology is used by different researchers,
      concepts in common use will have the same meaning and 
      properties .

   List 13.8.3. Bad things about ontologies.

1   Ontologies place no constraints on internal complexity and
      can quickly become incomprehensible to humans
2   Ontologic complexity may lead to unanticipated consequences
      (including paradoxes of self-referral)
3   Ontologies are relatively new and there are very few
      examples where they have shown to be of any biomedical
      value. Classifications have proven their value over
4   Ontologies work on the assumption that medically relevant
      domains have an intrinsic logic that can be described by
      rules. This assumption may be false.

   List 14.2.1. Some medical breakthroughs that occurred without benefit of randomized prospective clinical trials.

1   1796. Edward Jenner successfully vaccinates 8 year old James
      Phipps with unproven smallpox vaccine (prepared from cowpox)
2   1881. Louis Pasteur successfully vaccinates Joseph Meister
      with unproven rabies vaccine
3   1900. Jesse Lazear demonstrates (on himself) that yellow
      fever is transmitted by mosquito bite. Lazear dies from
      successful inoculation
4   1944, 1972, 1992. Sudden Infant Death syndrome often
      associated with infants sleeping on stomach, on soft
5   1985 Marshall infects himself with H. pylori, thus
      developing gastrities and  demonstrating the bacterial
      origin of gastric ulcers.

   List 14.3.1. Questions for clinical trialists.

1   Are data from clinical trials made available to the public?
2   Are we making the best use of data collected in clinical
3   How often do clinical trials fail to provide definitive
      answers to the question that motivated the trial?
4   When a prospective clinical trial fails to answer the
      question that motivated the trial, is the trial data made
      available to the public?
5   Are we using the best available methods to guarantee that
      clinical trials are designed properly?
6   Might clinical trials be designed in a manner that enhances
      the scientific and medical value of the trials beyond a
      single hypotheses?
7   Might some prospective clinical trials be replaced by
      cheaper, faster retrospective trials?
8   Might some clinical trials be replaced by new, innovative
      models producing clinically sound results in less time and
      for less money?

   List 14.5.1. Snippet of Perl code demonstrating how metastastic events can be simulated using th Monte Carlo technique.

1        $badoutcome = "No mets for the bad tumor"; 
      #begin with no metastases
2        $start = time();
3        while (1)   #this will loop forever unless something in
      block promts exit
4          {
5          $bad = 2 * $bad;
6          print "$bad\n";
7          srand;
8          for (1...$bad)
9             {
10             $badchance = int(rand(5000)) +1;
11            if ($badchance == 5000)
12                {
13                $badchance = int(rand(10000)) +1;
14                if ($badchance == 10000)
15                   {
16                   $badoutcome = "$n $bad
17                   print $badoutcome;
18                   exit;
19                   }
20                }
21            }
22         $n++;
23         if ($bad > 50000000)
24           {
25           print "$badoutcome\n";
26           $end = time();
27           $totaltime = $end - $start;
28           print "Time for execution is $totaltime
29           exit;
30           }
31         }

   List 14.7.1. Role of biomedical informaticians in clinical trials-based translational research.

1   Protecting human subject privacy and confidentiality (always
      the first responsibility of biomedical informaticians)
2   Developing new approaches for clinical trials that reduce
      the cost and length of trials without sacrificing scientific
3   Choosing a primary hypothesis whose scientific importance
      will endure to the end of the study
4   Expanding study designs to test multiple hypotheses during
      the course of the trial
5   Capturing data that can complement other scientific efforts
6   Designing the studies in ways that ensure that the primary
      hypothesis (i.e., the hypotheses that justifies the study)
      is adequately tested
7   Organizing the data (using common standards such as CDISK
8   Ensuring that the analysis of data is conducted in a manner
      free from bias
9   Reporting the conclusions of the study
10   Distributing the data that support the conclusions of the

   List 15.1.1. Plain-English description of a software agent protocol.

1   You provide the software program with the following inputs:
      a list of people who you would like to make an appointment
      with, and a calendar file that contains free dates and times
      as well as dates and times that have already been obligated
2   The software agent uses the standard http (web) protocol to
      visit all the URLs of all the names on your list
3   At each web site, the software agent searches for the class
      objects associated with the unique person name. If all goes
      well, the software agent finds a calendar object that
      belongs to the named person
4   The calendar object contains unique date-time entries
      associated with a prior appointments. The agent matches a
      list of available dates and times from your calendar.

   List 15.2.1. Increasingly complex task-sharing network protocols.

1   1. FTP (file transfer protocol)
2   2. TELNET
3   3. HTTP (hypertext transfer protocol)
4   4. RPC (remote procedure calls)
5   5. XML-RPC (xml-based remote procedure calls)
6   6. SOAP (simple object access protocol)
7   7. P2P (peer-to-peer networking)
8   8. WEB Services
9   9. GRID computing

   List 16.1.1. Infrastructural issues that have delayed advancement in the field of biomedical informatics.

1   Lack of standards for acquiring, collecting, annotating, and
      exchanging all types of biomedical data
2   Poor quality of clinical data in hospital information
3   Inability of institutions to cope with HIPAA privacy
4   Reluctance of funded researchers to share data
5   Questionable data analysis methodologies for new
6   Enormous administrative cost of obtaining and tracking the
      patient  consent process. High cost and complexity of
      privacy/security tasks related to prospective clinical
7   Poor access to large clinically annotated banks of human
      tissue samples of a wide variety of diseases, required to
      validate candidate diagnostic tests.

   List 16.3.1. Statement on the key function of biomedical informatics in CTSA funding.

1   "Biomedical Informatics is the cornerstone of
      communication within  C/D/Is and with all collaborating
      organizations. Applicants should  consider both internal,
      intra-institution and external interoperability  to allow
      for communication among C/D/Is and the necessary research 
      partners of clinical and translational investigators (e.g.
      government,  clinical research networks, pharmaceutical
      companies, commercial  vendors, laboratories, and equipment
      manufacturers). Biomedical  Informatics support is expected
      to be flexible and innovative. Interoperability, security,
      workflow, usability and standards  are essential areas of
      work. To facilitate the conduct of research  in health care
      settings and the transfer of research findings into  routine
      care, clinical and translational research must employ
      applicable  standards (e.g., identifiers, vocabularies,
      transactions, security measures)  adopted by the Department
      of Health and Human Services for use in U.S. health care and
      public health operations. All human subject data must be 
      handled securely to ensure privacy and confidentiality.
      Biomedical  informatics research activity should be
      innovative in the development  of new tools, methods, and

   List 16.3.2. Some common biomedical informatics goals for institutions.

1   Develop a thoughtful approach to issues of human subjects
      protection, data organization and data sharing. These may
      not be the focus of your research, but your research will
      suffer if you minimize their importance
2   Develop general protocols, approved by your IRB, for dealing
      with issues of confidentiality and data sharing. Experienced
      grant reviewers appreciate institutions that use a tested
      infrastructure to support their research staff
3   Develop collaborations with researchers outside your
      department, and outside your institution. Translational
      research needs cross-disciplinary expertise and biomedical
      informatics requires large datasets collected from multiple
      institutions. Funding agencies and grant reviewers
      understand this and will look favorably at innovative
      proposals that draw information and expertise from diverse
4   Train staff in the fundamentals of biomedical informatics.
      Hiring an informatics guru does not compensate for a staff
      of luddites (unless the guru can bring enlightenment to the
      entire staff).

   List 16.4.1. Ingredients of a good grant application in the field of biomedical informatics (in order of descending importance).

1   A solid, credible set of specific aims (absolutely crucial)
2   A known track record in the general area (usually determines
      who in the research team will be named the PI)
3   Preliminary data
4   A respected institutional infrastructure supporting the
5   Collaborators who have the ability to provide a
      translational component
6   Statistical expertise sufficient to convince the study
      section that the project is well-designed and that the data
      analysis will be unimpeachable
7   A clear understanding of data sharing and human subjects
      issues related to the project
8   A sense of where the project will move into after the
      initial funding period
9   A sense of where the project complements other current
      efforts in the same field
10   Good communication with a bright and competent Program
11  An important idea.

   List 16.4.2. An ancient observation that success falls occasionally on the undeserving.

1   The race is not to the swift,
2   nor the battle to the strong,
3   nor bread to the wise,
4   nor riches to the intelligent,
5        nor favor to the men of skill;
6   but time and chance happen to them all.

   List 16.4.3. Suggestions for researchers.

1   When an investigator submits a work for publication, where
      the data is derived from patient records, the Methods
      section should include a description of the steps taken to
      minimize patient risks, and should document that that the
      IRB reviewed the research proposal. When these items are
      missing from a paper, editors and reviewers should feel free
      to ask authors to supply this information
2   When an investigator submits a grant application
      (particularly an application to a U.S. Federal Agency), a
      detailed strategy for protecting human subjects from human
      subject risks is required. Investigators should be aware
      that research using patient records is human subject
      research. Investigators should also be aware that current
      U.S. Federal Guidelines call for the inclusion of
      minorities, women and children in clinical studies, unless
      there is a good reason for excluding them from the study
      population. For the purpose of satisfying federal inclusion
      guidelines, most agencies consider studies based on patient
      records to be clinical studies. A statement describing the
      inclusion of minorities, women and children will be, in most
      circumstances, a requirement for would-be biomedical
      informaticians who seek federal funding
3   Human subjects issues are a legitimate area of research for
      the biomedical informaticians. Novel protocols for achieving
      confidentiality and security while performing increasingly
      ambitious studies (distributed network queries across
      disparate databases, extending the patient's record to
      collect rich data from an expanding electronic medical
      record, linking patient records to the records of relatives
      or probands, peer-to-peer exchange of medical data) will be
      urgently needed by the data mining community. Data miners
      would do well to stay abreast of regulations controlling the
      use of medical data so that they can develop
      regulation-compliant protocols for data mining activities
4   Anonymized data, by definition, cannot be linked to
      patients. Therefore, there is no legal or ethical reason to
      withold anonymized datasets from the public. Quite the
      opposite. Anonymized datasets have enormous value to other
      researchers who can merge your data with theirs, derive new
      ways of analyzing your data, or develop new questions that
      can be addressed by your dataset. Researchers who have
      created anonymized datasets should seriously consider
      publishing their data as a primary resource or as a
      secondary resource attached to any publication that results
      from the research project. Many journals and on-line
      publication services (such as PubMed Central and BioMed
      Central) encourage authors to submit their datasets as
      publication attachments. Issues of intellectual property
      impacting on the investigator and the institution (e.g.
      ownership, licensing of data, derivative work
      "reach-through") have accumulated very little
      legal precedent
5   Funding agencies often have grandiose hopes for their
      research initiatives. They are willing to pay large sums of
      taxpayer  money to support ambitious research projects. With
      few exceptions, large research projects create complexity.
      Historically, projects whose chief goal is to simplify data
      resources or reduce software complexity are seldom funded. I
      would encourage investigators to persevere. When complexity
      has become an impediment to biomedical progress,
      investigators must provide a convincing description of the
      problem,  along with a clear explanation of the benefits
      derived when complexity is reduced. A savvy study section
      may be receptive to an application that contains a credible
      set of goals for reducing complexity.

   List 17.1.1. Sources of ethical challenges for biomedical informaticians.

1   HIPAA regulations
2   IRB approvals
3   Flawed de-identification methods
4   Problematic network security protocols
5   Data sharing requirements from funding agencies
6   Contractual IP arrangements with employer
7   Conflicts of interest
8   Hidden patent violations
9   Unanticipated lawsuits

   List 17.2.1. Conditions under which it may be OK to lie. All conditions must hold.

1   The lie protects a human subject for which you have a
      fiduciary responsibility
2   You are certain that the lie will not harm another human
3   You do not personally benefit from the lie
4   The lie is not an included component in a plan to mislead
5   There is no honest way of protecting the patient that does
      not involve lying
6   The liar is willing to accept any negative consequences that
      may ensue.

   List 17.3.1. HIPAA: 164.512 Uses and disclosures for which consent, an authorization, or opportunity to agree or object is not required.

2   (1) Permitted uses and disclosures. A covered entity may use
      or disclose protected health information for research,
      regardless of the source of funding of the research,
      provided that:
3   (iii) Research on decedent's information. The covered entity
      obtains from the researcher:
4   (A) Representation that the use or disclosure is sought is
      solely for research on the protected health information of
5   (B) Documentation, at the request of the covered entity, of
      the death of such individuals; and
6   (C) Representation that the protected health information for
      which use or disclosure is sought is necessary for the
      research purposes.

   List 17.3.2. Suggested research purposes of an autopsy database integrated with the electronic medical records of the deceased patients.

1   Validating the sensitivity and specificity of new diagnostic
      tests, including imaging techniques and new medical devices
2   Determining the extent of disease, at death, of persons
      enrolled in clinical trials, as a measurement of response to
      different treatment protocols
3   Documenting adverse effects of medications administered
      during the patient's life
4   Correlating autopsy findings with pharmacogenomic databases,
      attributing diseases found at autopsy with specific gene

   List 17.4.1. Questions that researchers should ask before purchasing a software license.

1   Will colleagues who want to review my work be required to
      buy their own version of the software?
2   Can the software be modified by the licensee?
3   Will the software export files or data objects in a standard
      format that can be exchanged with and used by colleagues who
      use other software applications
4   Will the software export files or data objects that can be
      publicly distributed (as in supplementary files distributed
      with a manuscript), or publicly displayed (as on a web
5   Does the software license contain reach-through clauses?  A
      reach-through is a legal device through which licensees must
      pay royalties on intellectual property produced with the
      help of the software.

   List 17.9.1. Reasons for refusing to share research data.

1   The data is confidential and it would be unethical to
      release the data to the public
2   Members of the public may misunderstand the data
3   Competitors may purposely misinterpret the data to dispute
      the validity of the work
4   Preparing the data in a format that can be distributed to
      the public (e.g., the internet) requires time, effort,
      expertise and money that could be better spent on research
5   The data owner wishes to control the direction of research
      in her specific area, and her control would be lost if
      everyone in the field had access to the data
6   The data owner is offended that anyone would question her
      integrity by asking to review the primary data
7   The data is integral to another manuscript that has not yet
      been published. Distributing the data would violate the
      publisher's ban on pre-publication release of research

   List 17.9.2. When is data sharing important?

1   When the data contributes one piece of a planned multi-part
      research effort towards which many different laboratories
2   When the data can contribute towards other research efforts
3   When the validity of the assertions drawn from  the data are
4   When the validity of the data itself is doubted
5   When there is reason to believe that the data can be
      re-examined to yield additional results.

   List 17.10.1. Excerpted from 164.514 Other requirements relating to uses and disclosures of protected health information.

1   (1) A person with appropriate knowledge of and experience
      with generally accepted statistical and scientific
      principles and methods for rendering information not
      individually identifiable:
2   (i) Applying such principles and methods, determines that
      the risk is very small that the information could be used,
      alone or in combination with other reasonably available
      information, by an anticipated recipient to identify an
      individual who is a subject of the information; and
3   (ii) Documents the methods and results of the analysis that
      justify such determination;

   List 17.10.2. Excerpted from Section 164.512 Uses and disclosures for which consent, an authorization, or opportunity to agree or object is not required.

1   (ii) Waiver criteria. A statement that the IRB or privacy
      board has determined that the alteration or waiver, in whole
      or in part, of authorization satisfies the following
2   (A) The use or disclosure of protected health information
      involves no more than minimal risk to the individuals;
3   (B) The alteration or waiver will not adversely affect the
      privacy rights and the welfare of the individuals;
4   (C) The research could not practicably be conducted without
      the alteration or waiver;
5   (D) The research could not practicably be conducted without
      access to and use of the protected health information;
6   (E) The privacy risks to individuals whose protected health
      information is to be used or disclosed are reasonable in
      relation to the anticipated benefits if any to the
      individuals, and the importance of the knowledge that may
      reasonably be expected to result from the research;
7   (F) There is an adequate plan to protect the identifiers
      from improper use and disclosure;
8   (G) There is an adequate plan to destroy the identifiers at
      the earliest opportunity consistent with conduct of the
      research, unless there is a health or research justification
      for retaining the identifiers, or such retention is
      otherwise required by law; and
9   (H) There are adequate written assurances that the protected
      health information will not be reused or disclosed to any
      other person or entity, except as required by law, for
      authorized oversight of the research project, or for other
      research for which the use or disclosure of protected health
      information would be permitted by this subpart.

   List 17.13.1. Ethical questions for biomedical informaticians building tissue biorepositories?

1   Does the use of the tissues for research purposes deprive
      patients of material that may have importance to their
      current or future medical care?
2   Were the patients informed that their tissues may be used
      for research purposes that could result in commercial
      products, and that they (the patients) will not share in any
      resultant profits?
3   Are patients fully protected from any harms that may result
      from the research on their tissues?
4   Is the data collected from patient charts de-identified and
      is it compliant with "Minimum necessary" policy
      (see Glossary) described in HIPAA (see Glossary)?
5   Are the anticipated revenues from research so large as to be
      onerous or otherwise conflicting with the best interests of
      the patients whose tissues are used in the research?

   List 17.14.1. Some physician rejoinders to the opening phrase, "I would describe the HIPAA Privacy Rule as:"

1   A disaster for future of medical practice and a windfall for
      trial attorneys
2   A disingenuous scam and a ridiculous waste of time and money
3   A further burden on physicians who are already having
      trouble surviving
4   A joke, not really privacy. Maybe impossible to actually
      comply with
5   A major hassle; no protection for patients' privacy
6   A Trojan horse
7   A way not to pay doctors
8   A worthless but potentially damaging body of legislation
9   Another example of bureaucrats justifying their existence
10   Another governmental intervention to make medicine

   List 17.14.2. Expectations of HIPAA privacy act prosecutions.

1   1. Prosecutions would not be frequent (the August 19, 2004
      conviction was the first criminal conviction under the HIPAA
      regulations that went into effect 16 months earlier)
2   2. Criminal prosecutions would apply to cases where patients
      suffered actual harm as the result of a willful HIPAA
      violation (as in the first reported criminal prosecution)
3   3. Hospitals would not be subject to frivolous prosecutions
      for minor HIPAA violations that did not result in harm to

   List 17.16.1. Examples of "reasonableness" standard applied within HIPAA.

1   Statutory Background... The security standard authority
      applies to both the transmission and the maintenance of
      health information, and requires the entities described in
      section 1172(a) to maintain reasonable and
      appropriate safeguards to ensure the integrity and
      confidentiality of the information, protect against
      reasonably anticipated threats or hazards to the security or
      integrity of the information or unauthorized uses or
      disclosures of the information, and to ensure compliance
      with part C by the entity's officers and employees
2   Sec. 164.502(g)... We proposed to define ``individually
      identifiable health information'' to mean information that
      is a subset of health information, including demographic
      information collected from an individual, and that:
3   (1) Is created by or received from a health care provider,
      health plan, employer, or health care clearinghouse; and
4   (2) Relates to the past, present, or future physical or
      mental health or condition of an individual, the provision
      of health care to an individual, or the past, present, or
      future payment for the provision of health care to an
      individual, and
5       (i) Which identifies the individual, or
6   (ii) With respect to which there is a reasonable basis to
      believe that the information can be used to identify the
7   Section 164.504(e)--Business Associates... A covered entity
      would have been in violation of this rule if the covered
      entity knew or reasonably should have known of a material
      breach of the contract by a business associate and it failed
      to take reasonable steps to cure the breach or terminate the

   List 17.17.1. Closing platitudes.

1   Insisting that something is true does not make it true.
      (Don't let people intimidate you into believing anything)
2   All biomedical informatics research is human subjects
      research and all human subjects must be protected from harm.
      (This feature distinguishes biomedical informatics from
      bioinformatics and reminds us that biomedical data comes
      from patients who place their trust in us.)
3   Get funded so that you can do research, don't do research so
      that you can get funded. (Getting funded is not an
      achievement. Funding is a societal contract in which the
      investigator promises that she will achieve something.)
4   The good fight is the fight that you lose and lose and lose
      until you win. (Don't worry about losing. Just worry about
      working toward an important goal.)
5   Life is short but art is long.(This was written by
      Hippocrates around 400 BC and refers specifically to the art
      of medicine. Advances in biomedical informatics will endure
      beyond our short lives.)

   List 19.5.1. Ruby script to read lines from a file.

1        #!/usr/local/bin/ruby
2        #READsome.rb, reads 300 lines from a big file
3   f = "big"
4   outf ="bigout.out", "w")
5   count = 0
6   while count < 300
7        STDOUT.puts f.gets
8        outf.puts f.gets
9        count = count + 1
10   end

   List 19.5.2. Ruby script to scrub an input line.

1        #!/usr/local/bin/ruby
2   f = "doubdb.txt"
3   outf ="scrub.out", "w")
4   doubhash =
5   while line = f.gets
6        line = chomp
7        doubhash[line] = " "
8   end
9   f.close
10   puts "What would you like to scrub?"
11  line = gets.chomp.downcase
12  line = line.gsub(/\'s/, '')
13  line = line.gsub(/[^\w\s]/, ' ')
14  line = line.gsub(/ +/, ' ')
15  linearray = line.split
16  arraysize = linearray.length - 2
17  lastword = "*"
18  for arrayword in (0 .. arraysize)
19     phrase = linearray[arrayword] + " " +
20     if doubhash.key?(phrase)
21       print " " + linearray[arrayword]
22       lastword = " " + linearray[arrayword+1]
23     else
24       print lastword
25       lastword = " *"
26     end
27     if arrayword == arraysize
28       print lastword
29     end
30  exit

Page last updated May 29, 2014, Jules Berman

Books by Jules J. Berman, covers