Biomedical books by Jules J. Berman, Ph.D., M.D.


biomedical informatics cover Perl Programming for Medicine and Biology Cover Ruby for Medicine and Biology Cover


















Biomedical Informatics Lists


This page was composed by a Perl script that extracted the lists from the book document and formatted each list as an html table. Some of the original book formatting may have been lost in the process, particularly in Perl scripts appearing in lists. Therefore, this page should be considered to be a somewhat flawed representation of the actual book material. The purpose of this page is to provide people with an indication of the kinds of subject matter that are explained in the narrative sections of Biomedical Informatics.

- Jules Berman, December 7, 2006

   List 0.0.0. What every determined reader will learn from this book.

1   How to acquire and organize biomedical data even when the
      data is received in the form of unstructured text
2   How to merge and share biomedical data even when the data 
      is confidential or comes from seemingly incompatible 
      sources
3   How to write your own programs in Perl that will allow you
      to perform common informatics tasks with just a few  lines
      of code
4   How to automatically index biomedical text and code text
      using freely available biological and medical nomenclatures
5   How to use metadata to provide structure and meaning to 
      biomedical datasets
6   How to use confidential medical data while obeying current
      law  and protecting patients
7   How to reduce the complexity of biomedical data and
      biomedical software
8   How to evaluate ethical problems related to intellectual
      property, privacy, human subjects research (see Glossary),
      data sharing (see Glossary),  and software development.


   List 0.0.1. People who will benefit from reading this book.

1   Bioinformaticians
2   Biomedical scientists
3   Clinical trialists
4   Computer scientists who need cross-over skills in the
      biomedical sciences
5   Government officials at any of the health-related federal
      agencies
6   Healthcare graduate students and professionals who use large
      biomedical datasets, who need to have data/software
      interoperability, or who need to comply with federal, state,
      or institutional data requirements
7   Hospital staff, including medical students, physicians,
      nurses, technicians, hospital administrators, information
      officers
8   Lawyers who handle intellectual property (see Glossary)
      cases related to biomedicine
9   Library scientists
10   Medical ethicists
11  Medical software developers and vendors
12  Medical transcriptionists
13  Members of IRBs (Institutional Review Boards, see Glossary)
    and Privacy Boards (see Glossary)
14  Privacy experts who work with medical scientists


   List 1.1.1. Roles of the biomedical informatician.

1   Biologist
2   Healthcare professional
3   Lawyer
4   Software programmer
5   Computer scientist
6   Cryptographer
7   Metadata expert
8   Linguist
9   Statistician
10   Diplomat


   List 1.2.1. Pre-1955 biomedical advances resulting in increased longevity in many developed countries.

1   Antisepsis
2   Refrigeration of food
3   Standards for the hygienic preparation of food
4   Eradication of insect vectors for yellow fever and malaria
5   Potable public drinking water
6   Antibiotics effective against many bacterial infections
      including syphilis, gonorrhea and tuberculosis
7   Vaccines against smallpox and polio
8   The virtual elimination of iodine-deficiency associated
      goiter
9   The near elimination of vitamin deficiency diseases
10   The marked reduction of cervical cancer in women thanks to 
      cytologic screening of cervical smears
11  The prevailing blood tests and quantitative blood cell
    analyses used to monitor deviations from normal function
12  The correction of diabetic hyperglycemia with insulin
13  The introduction of radiologic imaging
14  The treatment of hypertension with large variety of
    effective drugs
15  The recognition of the association between cigarette use and
    cancer
16  The role of diet and of cigarettes in the progression of
    vascular diseases.


   List 1.2.2. Medical setbacks since 1955.

1   The global spread of AIDS
2   Diminished access to potable water in much of the world
      population
3   The emergence of multiple antibiotic resistant strains of S.
      aureus and other previously treatable organisms
4   Increased number of cancer patients due primarily to an
      absolute increase in the number of senior citizens at
      highest risk for cancer
5   The re-emergence of tuberculosis
6   The re-emergence of insect and other vectors carrying viral
      and parasitic diseases
7   The astronomical costs of new, effective medications for
      chronic diseases, including cancer
8   High quality, long-term health care attainable for only a
      small fraction of the earth's population
9   The rising incidence of obesity and sequelae disorders,
      worldwide
10   The rapid geographic spread of outbreaks of new strains of
      influenza and other evolving viruses, including HIV and
      hemorrhagic fever viruses
11  The threat of destructive and pathogenic species of plants,
    insects and animals that have been introduced to new
    habitats through acts of human negligence or error
12  Weakening of the earth's ozone layer, increasing human
    exposure to ultraviolet radiation
13  The political uses of toxic agents, endemic diseases, and
    public health infrastructures.


   List 1.3.1. Immediate consequences of Semelweis' prevention of puerperal fever deaths.

1   The medical students were opposed to being forced to wash
      their hands
2   Semmelweis' superior, Johann Klein, was likewise opposed,
      considering the clinical trial a criticism of his
      performance
3   Other obstetricians agreed that Semmelweis' measures were an
      attack on their professional conduct
4   The maternity patients were opposed as well, interpreting
      sanitary measures as a criticism of their personal
      hygiene


   List 1.3.2. Beliefs held by biomedical informaticians.

1   Medical progress requires the integration of biological data
      and clinical data
2   Aggregate clinical data has value beyond its use in guiding
      the treatment of individual patients
3   Researchers need methods to acquire clinical data without
      harming patients
4   To be useful, biological and clinical data need to be
      organized in a standard manner that permits seamless data
      integration (see Glossary)
5   Classifications (see Glossary) drive down the complexity of
      clinical and biological data)
6   Important new testable hypotheses may derive from
      pre-existing biological and clinical datasets, but only if
      the datasets are made available to scientists
7   The primary data that supports scientific assertions should
      be be made publicly available, whenever feasible
8   Data analysis is an inexpensive science, particularly if you
      know how to program.


   List 1.4.1. Some important bottlenecks in translational research.

1   Access to clinically annotated tissues collected from human
      subjects
2   Access to electronic medical records and other electronic
      archives of human clinical data
3   Methods to organize data in a manner that permits the data
      to be meaningful and comparable from laboratory to
      laboratory and institution to institution
4   Methods to draw clinically valid conclusions from large
      datasets containing heterogeneous types of data (e.g.
      molecular data and clinical test data).


   List 1.5.1. Basic skills and activities in biomedical informatics.

1   An understanding of a computer's file and subdirectory
      system
2   The ability to download, install and use popular software
      applications and utilities
3   An awareness of the differences between structured and
      unstructured data
4   Basic understanding of XML (see Glossary) and metadata
      annotation
5   Basic appreciation of computer algorithms
6   Some familiarity with data privacy rules and how these rules
      relate to the research uses of medical data. Most countries
      have such privacy regulations for biomedical data. In the
      U.S., this would be the HIPAA privacy rules (), and in the
      United Kingdom, it would be the Data Protection act ()
7   A general understanding of concepts of medical record
      de-identification
8   Familiarity with the publicly available biological search
      engines, databases and tools, including PubMed and
      GenBank.


   List 1.5.2. Advanced skills and activities in biomedical informatics.

1   Programming at a moderate level in at least one programming
      language
2   Experience choosing and implementing a laboratory or
      hospital information system
3   Knowledge of regulations pertaining to the use of identified
      medical data in research
4   Participation in an effort seeking FDA approval for a device
      or technology developed from a biomedical informatics effort
5   Participation on a standards committee
6   Intermediate level understanding of XML
7   Basic understanding of RDF (see Glossary)
8   Experience as a member of an IRB or Privacy Board
9   Competing for funding for a biomedical informatics grant
      (see Glossary) or contract.


   List 1.7.1. Steps in gold mining (or data mining).

1   Physical access to mine
2   Legal rights of access to mine
3   Acquire tools to find desired items in mine
4   Acquire tools to extract desired items in mine
5   Acquire tools to refine desired items
6   Acquire tools to certify the purity and quantity of desired
      items
7   Transform the desired items into a standard format
8   Transport the desired items to an intended recipient
9   Arrang payment for the desired items
10   Store the desired items
11  Protect the stored items.


   List 1.8.1. Realistic uses of Biomedical Informatics.

1   Store, share, search, retrieve and analyze heterogeneous
      data sources. This entire process is vastly enhanced by our
      current ability to send any type of data anywhere at
      anytime, cheaply
2   Create large comprehensive databases (millions of cases)
      that allow you to ask questions that could not be asked of
      small or non-comprehensive databases
3   Drive down the complexity of biomedical data by using data
      specifications (see Glossary) and classifications
4   Track data collected in hospital information systems and
      dispatch automatic clinical alerts when data values fall
      outside an expected range of behavior or when values violate
      the expected properties of data classes
5   Develop new hypotheses by examining and correlating
      biological and clinical observations
6   Validate new clinical tests and treatments by examining the
      correlations between test values, treatment choices and
      clinical outcomes.


   List 1.8.2. Unrealistic uses of Biomedical informatics.

1   Replace physicians with computers. Doctors are trained to
      make diagnoses, and they don't desire or use software that
      purports to does this for them
2   Create superdoctors through the use of computer tools. The
      practice of medicine is learned through personal
      experiences. Doctors do not need simulations of reality
3   Vastly improve upon books and traditional teaching
      strategies. Books are an adequate method of conveying
      knowledge. Computers can certainly provide some improvements
      to book learning, but there is no reason to think that a
      system of learning based on printed literature, that works
      perfectly  fine, can be vastly improved
4   Solve subtle or complex problems via the use of medical
      ontologies (see Glossary). Complex systems are inherently
      chaotic, and inferences reached through a logical ontology
      modeling a complex system are likely to be misleading
5   Create, within the next decade, comprehensive medical
      records for all U.S. citizens that can be accessed  and
      annotated by all authorized care-givers. This holy grail of
      U.S. medical informatics is a worthy long-term pursuit, but
      there is no reason to expect that it can be achieved within
      a decade or even two decades.


   List 2.2.1. When are databases particularly useful?

1   When the stored data is complex (e.g., hospitals and
      academic centers)
2   When the basic data structure is constant (i.e., when the
      model of the data records does not change)
3   When there are continuous real-time additions, deletions and
      modifications of records by multiple users
4   When the computer staff (responsible for the data) prefers
      to work with databases.


   List 2.2.2. When are data files particularly useful?

1   When the dataset is relatively stable
2   When the data structure is relatively instable (i.e., when
      the fundamental model of the data records changes)
3   When XML is the native method of data representation
4   When the computer staff (responsible for the data) prefers
      to work with data files.


   List 2.3.1. Three properties of reality relevant to hospital databases.

1   Database records can be designed in such as manner as to
      corrupt the integrity of the database
2   Databases do not care if their integrity is corrupted
3   Modifications to the basic structures of database records
      almost always have negative (sometimes catastrophic)
      consequences.


   List 2.3.2. Common weaknesses of some hospital databases.

1   Inability to guarantee that every patient is uniquely
      identified within the database
2   Inability to classify types of data into groups with shared
      properties
3   Inability to extend data records to include data elements
      linked to other databases
4   Inability to organize data as simple collections of
      meaningful statements
5   Inability to produce self-describing data records (i.e.,
      including  in data records all the data necessary to fully
      describe the  meaning of the data record)


   List 2.3.3. Desiderata for hospital information systems.

1   Every patient must be uniquely identified within the system
2   Every report must be uniquely identified and associated with
      one patient
3   Data items contained in reports must be entered into reports
      once only
4   Data items must be well-defined and used in a consistent
      manner throughout the system
5   Data values must be bound to a unique identifier (see
      Glossary) and associated with a unique report
6   All data entered should be technically retrievable
7   Someone must have the authority to retrieve any and all data
      in the hospital information system
8   Data, once entered, should not be corrected or modified in
      any way without creating a visible transaction record of the
      modification
9   All electronic data related to an electronic record in the
      hospital information system should be included in the
      hospital information system.


   List 2.11.1. General classes of patents.

1   Utilities - new and useful methods, machines, items, or
      chemical compounds
2   Designs - a new appearance for a manufactured article
3   Plants - the invention or discovery of a plant variety that
      can be asexually reproduced


   List 2.12.1. Copyright Act of 1976, Title 17, U.S. Code, section 107. Limitations on exclusive rights: Fair use.

1   Notwithstanding the provisions of sections 106 and 106A, the
      fair use of a copyrighted work, including such use by
      reproduction in copies or phonorecords or by any other means
      specified by that section, for purposes such as criticism,
      comment, news reporting, teaching (including multiple copies
      for classroom use), scholarship, or research, is not an
      infringement of copyright. In determining whether the use
      made of a work in any particular case is a fair use the
      factors to be considered shall include -
2   (1) the purpose and character of the use, including whether
      such use is of a commercial nature or is for nonprofit
      educational purposes;
3        (2) the nature of the copyrighted work;
4   (3) the amount and substantiality of the portion used in
      relation to the copyrighted work as a whole; and
5   (4) the effect of the use upon the potential market for or
      value of the copyrighted work
6   The fact that a work is unpublished shall not itself bar a
      finding of fair use if such finding is made upon
      consideration of all the above factors. ()


   List 2.15.1. Tissues that are routinely destroyed by pathology departments.

1   Institutions regularly dispose of tissues removed during
      surgical procedures. When a large specimen, such as a colon,
      is received in a pathology department, samples are routinely
      embedded in paraffin and saved for at least 5 years. The
      unsampled colon (the bulk of the specimen) is saved for
      several weeks, sufficient time to ensure that the
      pathologist has rendered a final diagnosis on the specimen,
      and then the specimen is discarded
2   Institutions regularly dispose of archived paraffin-embedded
      tissues. Most institutions archive paraffin-embedded tissues
      for at least 5 years. At that time, some medical centers
      conclude that the tissues are no longer of any importance to
      the patient. To avoid the expense of continued storage, some
      institutions simply dispose of archived material after 5
      years.


   List 2.15.2. Questions that institutions should ask before transferring tissues and medical records to an external tissue repository.

1   Would the transfer to a third party constitute a sale of
      human tissue?
2   Would the transfer to a third party harm any of the patients
      from whom the tissue was excised?
3   Would the transfer to a third party benefit society?
4   Do any of the institutional staff encouraging the transfer
      of tissues and data have relevant conflicts of interest?


   List 2.16.1. Recent developments that have enhanced access to experimental datasets.

1   Online journals that invite authors to submit data files
2   Editor policies that require the submission of data files
      supporting assertions made in manuscripts
3   Technical ease of storing large datasets on publicly
      available servers
4   Technical ease of downloading large datasets from servers
      via the internet
5   Data sharing requirements issued by biomedical funding
      agencies
6   Expansion of Freedom of Information Act
7   Greater involvement of informaticians in biomedical research
8   Scientific advancements using publicly available datasets
9   Stunning power and scope of publicly available search
      engines, including Google (internet documents) and PubMed
      (medical abstracts)


   List 2.18.1. Some definitions of terms related to the open source movement.

1   Free software: The concept of free software, as popularized
      by the Free Software Foundation, refers to software that can
      be used freely, without restriction, and does not
      necessarily relate to the actual cost of the software. The
      generally acknowledged father of the free software movement
      is Richard Stallman, an MIT visionary who has led an
      energetic and unwavering campaign to create and freely
      distribute some of the most valued software applications in
      use today. The free software movement is similar to the open
      source software movement, but some of the features of free
      software (ability to modify and re-distribute software in a
      prescribed manner as discussed in the software license) are
      not always guaranteed in open source software (see List)
2   Open source - The Open Source Software movement is an
      offspring of the Free Software movement. The reason that the
      open source movement was created was, in part, to placate
      developers who wanted to sell software and felt the the term
      "free" as in "free software movement",
      would be misconstrued by prospective customers to mean that
      the developer requires no remuneration. Although a good deal
      of free software is no-cost software, the intended meaning
      of the term "free" is that the software can be
      used without restrictions. The term "open source"
      obviates the need to draw this distinction. The Open Source
      Initiative posts an open source definition () and a list of
      approved open source licenses ()
3   Open Access - In general, open access applies to text and
      data the same way that open source applies to software. In
      general, open access biomedical data is retrievable (i.e.,
      you can find it by using a PubMed search or through a search
      engine), and once you've found it, you can download it and
      read it. There are several closely-related consensus
      statements on the meaning of open access (), ()
4   Open source software license - The Open Source Initiative
      has an approval process for open source licenses. Software
      distributed under an approved license can include a
      declaration that the software is "OSI Certified Open
      Source Software." The GNU copyleft licenses have been
      certified as open source software licenses.


   List 2.19.1. Examples of undifferentiated software.

1   Basic algorithms
2   Fundamental laws of physics, chemistry, mathematics and
      biology
3   Free, cross-platform programming languages
4   TCP/IP internet protocol
5   HTML and XML.


   List 2.19.2. Examples of undifferentiated data.

1   Human genome
2   Standards documents
3   Nomenclatures
4   Biological classification systems.


   List 2.19.3. Examples of differentiated software.

1   Programming languages with special features such as a
      easy-to-use interfaces or integrated environment, or a
      specialized purpose
2   Neural network programs designed for specific types of data
      input
3   Complex software designed to support commercial devices,
      such as CT-scanners
4   Most hospital information systems and laboratory information
      systems.


   List 2.19.4. Examples of differentiated data.

1   Lexis/Nexis and other legal databases
2   Subscription journals
3   Codes for billable procedures
4   Science Citation Index
5   Chemical Abstracts (R) database.


   List 2.20.1. A few of the human databases that have been described in the Nucleic Acids Research Database Issue.

1   Androgen Receptor Gene Mutations Database
2   Atlas of Genetics and Cytogenetics in Oncology and
      Haematology
3   Atlas of Genetics and Cytogenetics in Oncology and
      Haematology
4   BGED - Brain Gene Expression Database
5   Cancer Chromosomes
6   Cancer gene databases
7   CGED - Cancer Gene Expression Database
8   Collagen Mutation Database
9   COSMIC - Catalogue Of Somatic Mutations In Cancer
10   Cypriot national mutation database
11  Cytokine Gene Polymorphism Database
12  Cytokine Gene Polymorphism Database
13  Cytokine Gene Polymorphism in Human Disease
14  Database of Genomic Variants
15  Database of Germline p53 Mutations
16  EICO DB - Expression-based Imprint Candidate Organiser
17  EpoDB - Erythropoiesis Database
18  ERGDB - Estrogen Responsive Genes Database
19  Gene-, system- or disease-specific databases
20  General polymorphism databases
21  GOLD.db - Genomics Of Lipid-associated Disorders
22  GRAP Mutant Databases
23  HAGR - Human Ageing Genomic Resources
24  HCAD - Human Chromosome Aberration Database
25  HemoPDB - Hematopoietic Promoter Database
26  HERVd - Human Endogenous Retrovirus database
27  HGMDr - Human Gene Mutation Database
28  HGMDr - Human Gene Mutation Database
29  HORDE - Human Olfactory Receptor Data Exploratorium
30  HPMR - Human Plasma Membrane Receptome
31  Human p53, human hprt, rodent lacI and rodent lacZ databases
32  Human PAX2 Allelic Variant Database
33  Human PAX6 Allelic Variant Database
34  IARC TP53 Database
35  Imprinted Gene Catalogue
36  IPD - Immuno Polymorphism Database
37  Lowe Syndrome Mutation Database
38  MTB - Mouse Tumor Biology Database
39  NCL Mutation Database
40  OMIM - Online Mendelian Inheritance in Man
41  Oral Cancer Gene Database
42  PTCH1 Mutation Database
43  RB1 Gene Mutation Database
44  RTCGD - Retroviral Tagged Cancer Gene Database
45  SNP500Cancer
46  SV40 Large T-Antigen Mutant Database
47  T1DBase - Type 1 Diabetes Database
48  The Autism Chromosome Rearrangement Database
49  The Lafora Database
50  The SNP Consortium database
51  TPMD - Taiwan polymorphic microsatellite marker database
52  Tumor Gene Family Databases (TGDBs)


   List 2.23.1. A record in Taxonomy.

1   ID : 50
2   PARENT ID : 49
3   RANK: genus
4   GC ID : 11
5   SCIENTIFIC NAME : Chondromyces
6   SYNONYM : Polycephalum
7   SYNONYM : Myxobotrys
8   SYNONYM : Chondromyces Berkeley and Curtis 1874
9   SYNONYM : "Polycephalum" Kalchbrenner and Cooke
      1880
10   SYNONYM : "Myxobotrys" Zukal 1896
11  MISSPELLING : Chrondromyces


   List 3.2.1. The types of human subject research risks.

1   The risk to life and health as a direct result of a medical
      intervention
2   The risk of loss of database functionality
3   The risk of loss of confidentiality resulting from
      participation in a medical study
4   The risk of loss of privacy resulting from participation in
      a medical study.


   List 3.8.1. Confidentiality issues for biomedical informaticians.

1   Demonstrating to the hospital's IRB (see Glossary) that the
      chosen methodology for anonymizing or de-identifying records
      is safe and reliable
2   Demonstrating to the hospital's IRB and to the hospital's
      information officers that the anonymization and
      de-identification processes can be performed automatically,
      without giving the informatician any access to the primary
      patient record and without opening any HIS vulnerabilities
      when data is transferred out of the system.


   List 3.9.1. Exemption 4 (E4) permitting unconsented research on de-identified medical records.


   List 3.9.2. Section 164.502(f) of the HIPAA Privacy Rule -- Deceased Individuals.

1   We proposed to extend privacy protections to the protected
      health information of a deceased individual for two years
      following the date of death. During the two-year time frame,
      we proposed in the definition of ``individual'' that the
      right to control the deceased individual's protected health
      information would be held by an executor or administrator,
      or other person (e.g., next of kin) authorized under
      applicable law to act on behalf of the decedent's estate.
      The only proposed exception to this standard allowed for
      uses and disclosures of a decedent's protected health
      information for research purposes without the authorization
      of a legal representative and without the Institutional
      Review Board (IRB) or privacy board approval required (in
      proposed Sec. 164.510(j)) for most other uses and
      disclosures for research
2   In the final rule (Sec. 164.502(f)), we modify the standard
      to extend protection of protected health information about
      deceased individuals for as long as the covered entity
      maintains the information. We retain the exception for uses
      and disclosures for research purposes, now part of Sec.
      164.512(i), but also require that the covered entity take
      certain verification measures prior to release of the
      decedent's protected health information for such purposes
      (see Secs. 164.514(h) and 164.512(i)(1)(iii))
3   We remove from the definition of ``individual'' the
      provision related to deceased persons...


   List 3.10.1. Five requirements for de-identifying medical records.

1   De-identification of data fields that specifically
      characterize the patient (name, social security number,
      hospital number, address, age, etc.)
2   Free-text data scrubbing, removing identifiers from the
      textual portion of medical reports
3   Free-text data privatizing, removing any information of a
      private nature that may be contained within the report
4   Rendering the dataset ambiguous, ensuring that patients
      cannot be identified by data records containing a unique set
      of characterizing information
5   Rendering the data non-complementary, ensuring that the data
      cannot be combined with data from other other databases or
      from multiple searches of the same database that can lead to
      the identification of records.


   List 3.12.1. Some possible consequences of Common Rule violations.

1   The loss to the institution of its funding for the grant in
      question
2   The loss to the institution of its Federal Assurance. The
      Office of Human Research Protections issues Assurances
      (currently called Worldwide Federal Assurances or WFAs) to
      institutions that have in-place processes for IRB reviews of
      research and for maintaining research standards. An
      institution must have an assurance registered with OHRP in
      order to receive federal funding for human subjects research
3   An institution-wide suspension of human subject research
      efforts
4   The imposition of grant-related restrictions imposed on the
      investigators (e.g. a prohibition from applying for federal
      grant funding).


   List 3.13.1. Section 1177 of the Act established civil and criminal penalties.

1   Civil Money Penalties. HHS may impose civil money penalties
      on a covered entity of $100 per failure to comply with a
      Privacy Rule requirement. Pub. L. 104-191; 42 U.S.C.
      1320d-5. That penalty may not exceed $25,000 per year for
      multiple violations of the identical Privacy Rule
      requirement in a calendar year. HHS may not impose a civil
      money penalty under specific circumstances, such as when a
      violation is due to reasonable cause and did not involve
      willful neglect and the covered entity corrected the
      violation within 30 days of when it knew or should have
      known of the violation
2   Criminal Penalties. A person who knowingly obtains or
      discloses individually identifiable health information in
      violation of HIPAA faces a fine of $50,000 and up to
      one-year imprisonment. Pub. L. 104-191; 42 U.S.C. 1320d-6.
      The criminal penalties increase to $100,000 and up to five
      years imprisonment if the wrongful conduct involves false
      pretenses, and to $250,000 and up to ten years imprisonment
      if the wrongful conduct involves the intent to sell,
      transfer, or use individually identifiable health
      information for commercial advantage, personal gain, or
      malicious harm. Criminal sanctions will be enforced by the
      Department of Justice.


   List 3.15.1. Questions related to consent tracking that institutions must be able to answer.

1   Does each consent form have an identifier and a locator, a
      study number, and a data element indicating that the consent
      form itself was approved by an IRB?
2   If needed, could you put your hands on the physical consent
      document?
3   Does your database indicate the specific study for which
      consent was approved?
4   Was the consent form sufficiently detailed, allowing the
      patient to approve certain uses of specimens/data and
      decline other uses?
5   Is each consent tagged with tracking data?
6   Was the consent approved or declined?
7   What day was the consent signed?
8   Does the institution have a policy that applies to
      situations wherein a subject cannot provide an informed
      consent (e.g., infants, patients with dementia)?
9   If the institution has a policy of excluding certain classes
      of patient from providing informed consent, has the
      institution received approval for the policy from its IRB?
10   For children and challenged subjects, was the informed
      consent document signed by a surrogate?
11  For children and challenged subjects, how is it determined
    who may act as a surrogate, and how is the identity of the
    surrogate recorded and tracked?
12  Did the consenting subject change her mind and withdraw
    consent after consent had been approved?
13  If consent was withdrawn, what date did this occur?
14  If consent was withdrawn, was consent withdrawn for a
    particular use of a specimen/data, or for all purposes
    described by the consent document?
15  If consent was withdrawn, does the withdrawal of consent
    apply to more than one consent form?


   List 3.16.1. Advantages of unconsented medical record research.

1   Saves money and time by eliminating the tedious and
      expensive process of obtaining individual consents
2   Sometimes favored by patient advocacy organizations who see
      unconsented research as a way of expediting medical progress
      and improving the chances of survival of the patients in
      their disease constituencies
3   De-identification requirements for most unconsented patient
      record research essentially guarantees that no harm will
      come to the patient
4   De-identified unconsented databases can be shared and used
      for multiple scientific efforts. Consented databases, in
      most cases, can be used only for the purposes specified in
      the consent form
5   De-identified unconsented databases pose no particular
      threat over time to patients. Consented databases often
      contain patient identifiers and may pose a confidentiality
      and privacy threat long after the consented research is
      concluded.


   List 4.1.1. Examples of dealt standards

1   The permitted levels of toxic substances in foods
2   TCP/IP (Transmission Control Protocol/Internet Protocol),
      the internet specification
3   IEEE 802.11, the wireless data transfer standard
4   Longitude and latitude assignments
5   Divisions of time (days, hours, minutes and seconds)
6   Statutes governing medical privacy


   List 4.2.1. Some causes of medical errors in the field of biomedical informatics.

1   Absence of standards (for describing clinical data)
2   Inadequate terminologies
3   Poorly written text
4   Inadequate object identifiers (e.g., identifiers for names,
      tests, reports)
5   Poor interoperability of software tools
6   Poor integration of biomedical databases
7   Poor documentation (of software, of medical devices, of
      protocols)
8   Poor annotation (of medical encounters and transactions)
9   Inadequate data structuring (of reports)
10   Sloppy data representation.


   List 4.2.2. Purposes of data standards.

1   Enhance interoperability of software
2   Enable data integration
3   Increase the efficiency of medical services
4   Increase the speed of medical research
5   Reduce medical errors.


   List 4.3.1. Why governments may choose to avoid creating biomedical standards.

1   Private entities that use a standard may be in the best
      position to create the best possible standard
2   Private entities that use a standard may be willing to pay
      for the standards development process
3   Private entities are more likely to adopt a new standard if
      they had a part in developing the standard
4   Governments may be unwilling to accept the responsibility of
      promoting a new standard
5   Governments know that many standards are never adopted by
      the public and do not want to waste their resources on a
      standard that will be ignored
6   Governments may be reluctant to face criticism for standards
      that may adversely effect certain segments of its
      population.


   List 4.4.1. Excerpt from RICO that may be applicable to standards developers.

1   "1951. Interference with commerce by threats or
      violence
2   (a) Whoever in any way or degree obstructs, delays, or
      affects commerce or the movement of any article or commodity
      in commerce, by robbery or extortion or attempts or
      conspires so to do, or commits or threatens physical
      violence to any person or property in furtherance of a plan
      or purpose to do anything in violation of this section shall
      be fined under this title or imprisoned not more than twenty
      years, or both
3   (b) As used in this section-
4   (1) The term "robbery" means the unlawful taking
      or obtaining of personal property from the person or in the
      presence of another, against his will, by means of actual or
      threatened force, or violence, or fear of injury, immediate
      or future, to his person or property, or property in his
      custody or possession, or the person or property of a
      relative or member of his family or of anyone in his company
      at the time of the taking or obtaining
5   (2) The term "extortion" means the obtaining of
      property from another, with his consent, induced by wrongful
      use of actual or threatened force, violence, or fear, or
      under color of official right."


   List 4.4.2. Disclaimer against hidden patents within standards

1   "The attention of adopters is directed to the
      possibility that compliance with or adoption of OMG
      specifications may require use of an invention covered by
      patent rights. OMG shall not be responsible for identifying
      patents for which a license may be required by any OMG
      specification, or for conducting legal inquiries into the
      legal validity or scope of those patents that are brought to
      its attention. OMG specifications are prospective and
      advisory only. Prospective users are responsible for
      protecting themselves against liability for infringement of
      patents. ()"


   List 4.4.3. Perceived risks of developing a new standard.

1   The standard may inadvertently contain intellectual property
      (particularly patented methods) resulting in a legal
      complaint against the creators of the standard
2   The standard may create loss of revenue or property to
      certain entities, resulting in legal actions taken against
      the creators of the standard
3   The standard may result in medical errors, resulting in
      injury to patients and subsequent legal actions taken
      against the creators of the standard
4   The standard may have been developed in a manner that
      excluded participation by an entity, resulting in a legal
      action


   List 4.5.1. Questions that should be asked prior to developing a new standard.

1   Is there a pre-existing standard that covers the same
      technology?
2   If there is a pre-existing standard, can it be enhanced or
      modified to provide a desired functionality?
3   How much will it cost to develop the standard?
4   How long will the standards development process take?
5   Will the intended beneficiaries of the standard pay for the
      standards development process?
6   Who will develop the standard? Are the selected developers
      competent to produce an adequate standard?
7   Are any of the developers conflicted?  Do they stand to
      profit if the standard is developed in a specific way?
8   Do any of the developers have proprietary software or data
      that they may wish to include in the standard?
9   Are the expected developers committed to work through the
      duration of the standards development process, and are they
      committed to providing all of the time and energy needed to
      develop the standard?
10   Will there be a mechanism whereby drafts of the standard are
      reviewed openly by the public?  Will the minutes of the
      working committee be made public? Will public comments be
      used to modify successive drafts of the standard?
11  Will the standard have dependencies on other standards? If
    so, are there intellectual property issues that must be
    resolved before development begins?  Will these issues
    require licenses or royalty agreements from the standards
    developers or the standards users?
12  Once created, is the standard likely to be adopted?  Is the
    anticipated standard easily implemented?
13  Who will be the adopters of the standard? Are the expected 
    standard adopters included in the development process for
    the standard?
14  Will the standard benefit a range of users beyond the
    standards developers?
15  What are the hazards that the standard may produce, and who
    might be hurt by the standard? In particular, will any
    entities be disadvantaged if they cannot readily adopt the 
    standard?
16  Is it necessary to have the standard approved by an external
    organization?
17  If so, who will pay for the extra costs of obtaining
    approval from an external standards organization?
18  Will the standard need to be continuously updated and
    modified? Is there a planned process for producing multiple
    versions of the standard?
19  Is it really important to have the standard?  Is it worth
    the effort?


   List 4.6.1. Organizations active in the field of biomedical standards.

1   ASTM, American Society of Testing and Materials
2   ANSI, American National Standards Institute (see Glossary)
3   HISB, Health Information Standards Board
4   IEEE,  Institute of Electrical and Electronics Engineers,
      Inc
5   ACR/NEMA, American College of Radiology (ACR) and National
      Electrical Manufacturers Association (NEMA), which oversees
      the DICOM (Digital Imaging and Communications in Medicine)
      image standard
6   NCPDP, National Council for Prescription Drug Programs, Inc
7   NIST, National Institute of Standards and Technology
8   ISO, International Organization for Standardization
9   IEC, International Electrotechnical Commission.


   List 4.6.2. Some American National Standards programming languages.

1   Mumps (ANSI approval 1977)
2   Basic (ANSI approval 1978)
3   ADA (ANSI approval 1983)
4   C (ANSI approval 1989)
5   Common Lisp (ANSI approval 1994)
6   ADA 95 (ANSI approval 1995)
7   Smalltalk (ANSI approval 1998)
8   C++ (ANSI approval 1999).


   List 4.7.1. New and future technologies that create biomedical data.

1   Gene Expression arrays (see Glossary)
2   Proteomic arrays
3   Tissue Microarrays
4   Metabolomic arrays
5   Image morphometric arrays.


   List 4.8.1. Problems created by the introduction of new standards.

1   New classes of data object requires a new standard for the
      new object class. (Examples Tissue Microarray Data, Gene
      Expression Array Data)
2   New standards require new implementations
3   Existing data standard require revision
4   Revisions of existing standards require retro-active
      implementation in data records conforming to the prior
      version of the standard
5   New data standards require harmonization with other existing
      standards. Otherwise multiple standards may compete for the
      standards-based data structures and data descriptors
      applicable to data elements common to multiple standards
6   Because standards often become the intellectual property of
      the  standards development organization, new standards
      cannot include parts of standards developed by other
      organizations. This means that redundant standards may
      describe the same objects.


   List 4.9.1. Fundamental properties of a specification.

1   The object specified must be defined and distinguished from
      all other objects. (i.e., one object cannot have two
      different specifications and one specification cannot apply
      equally to two non-equivalent objects)
2   The description must be organized in a way that is
      understandable and unambiguous. (i.e., a standard method of
      describing things, in the general sense, can be used.
      Languages are standard methods of describing things, but a
      better method might employ a formal semantic logic)
3   The descriptors must be well-defined in the context of the
      specification and not confused with descriptors of the same
      name but different meaning that may appear in other
      specifications (e.g., a "date"  may be a calendar
      notation in one standard and a type of dried fruit in
      another specification)
4   The measurements and descriptor values must be well-defined
      and not confused with measurements and values of the same
      alphanumeric value but different meaning that may appear in
      other specifications. (e.g., 10 pounds is not the same as 10
      Kg)
5   The specification must describe itself, include information 
      pertaining to its purpose, its creator, its ownership,  any
      restrictions on its uses, and any instructions necessary to
      interpret the specification.


   List 4.9.2. Logistical advantages of specifications over standards.

1   A specification need not be developed through a standards
      development process. A specification is basically a
      descriptive document and only requires fully unambiguous
      language. An individual can create a specification that
      everyone in the world can understand and use
2   Specifications do not require approval by any federal agency
       or organization. Standards have almost no meaning unless
      they are approved. In some cases, standards are enforced by
      authority of law
3   There are usually many different ways of specifying  things.
      The same object can be described by different
      specifications. Standards tend to impose a monolithic
      implementations
4   A specification is a general way of describing things and
      can be used for many different and new types of things.
      Standards are typically developed for specific items and
      cannot accommodate  new items without pursuing a development
      and approval process through a standards development
      organization. Biomedical informaticians who use research
      data will almost certainly find that existing standards will
      not keep pace with the arrival of new techniques and data 
      objects. The chair shown (see Figure) is a fully specified
      image created with Pov-Ray, a free, open source rendering
      program (see Appendix). It was created using a .pov file,
      which is a plain-text set of instructions written for the
      rendering application.


   List 4.9.3. Snippet from chair.pov rendering specification, modified from Matthias Opitz's public domain scene file.


   List 4.11.1. Parts of an LSID, from The LSID Resolution Protocol Project.

1   Network Identifier (NID)
2   root DNS name of the issuing authority
3   namespace chosen by the issuing authority
4   object id unique in that namespace and assigned locally
5   revision id for storing versioning information
      (optional)


   List 4.11.2. Examples of LSIDs, from The LSID Resolution Protocol Project.

1   urn:lsid:pdb.org:1AFT:1    This is the first version of the
      1AFT protein in the Protein Data Bank
2   urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434    References a
      PubMed article
3   urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2    Refers to the
      second version of an entry in GenBank


   List 4.13.1. Principles of unique object identification.

1   A unique object can be distinguished from all other unique
      objects
2   A unique object cannot be distinguished from itself
3   A class (or collection) of instances can be unique.


   List 4.13.2. Some registries that continually assign unique identifiers to requesting entities.

1   DOI, Digitial object identifier
2   PMID, PubMed identification number
3   LSID (Life Science Identifier)
4   HL7 OID  (Health Level 7 Object Identifier)
5   DICOM (Digital Imaging and Communications in Medicine)
      identifiers
6   ISSN (International Standard Serial Numbers)
7   Social Security Numbers (for U.S. population)
8   NPI, National Provider Identifier, for physicians
9   Clinical Trials Protocol Registration System
10   Office of Human Research Protections FederalWide Assurance
      number
11  Data Universal Numbering System (DUNS) number ()
12  DNS, Domain Name Service.


   List 4.13.3. Dependable computer systems that rely on unique object identifiers.

1   Google (relies on URLs)
2   PubMed (relies on PubMed identifiers)
3   Libraries (relies on ISSN, DOI)
4   Swiss banks (relies on unique account numbers).


   List 4.13.4. Some medical errors related to misidentification.

1   Correctly identified medication provided to incorrectly
      identified person
2   Incorrectly identified medication provided to correctly
      identified person
3   Incorrectly identified dosage of correct medication provided
      to correctly identified person
4   Blood transfused provided to incorrectly identified person
5   Report sent to incorrectly identified physician
6   Report identified with wrong person's name
7   Bill sent to incorrectly identified person
8   Report provided with diagnosis intended for different person
9   Wrong operation performed on incorrectly identified patient
10   Incorrectly identified patient treated for another patient's
      illness.


   List 4.15.1. Information deficiencies in the statement "John Smith has a blood glucose of 85".

1   No unique patient identifier (many people are named John
      Smith)
2   No unique time identifier (indicating when the test was
      performed and distinguishing the test results from other
      blood glucose values obtained from the patient at other
      times)
3   No unique test identifier (indicating the specific protocol
      used to measure blood glucose in this instance)
4   No unique identifier for the units of measurement
5   No unique report identifier (indicating that the report
      itself is a unique laboratory object that can be archived
      and retrieved)


   List 4.15.2. Three conditions for a meaningful assertion in informatics.

1   There is a specified object about which the statement is
      made. When the object is a unique object (such as a
      patient), the object must be specified in a manner that
      distinguishes the object from all other objects, and this is
      typically done with a unique object identifier
2   There is data that pertains to the specified object
3   There is metadata that describes the data (that pertains to
      the specified object.)


   List 4.15.3. Generalizable scientific statements.

1   f=ma -- Force is mass time acceleration
2   If a gas is held at constant temperature, its volume is
      inversely proportional to its pressure - Boyle's law
3   Ontogeny recapitulates phylogeny - fetal development follows
      the evolutionary path of the species (a false assertion)
4   There are 10 types of people, those who use binary notation
      and those who do not
5   (love of money) = (evil)x(evil) -- The love of money is the
      root of all evil.


   List 4.15.4. Algorithm for de-identifying with an identifier.

1   Collect data on unique object. "Joe Public has brown
      eyes."
2   Assign a unique identifier. "Joe Public has unique
      identifier, 77300183."
3   Substitute name of object with its identifier
4   Consistently use the identifier with data. "77300183
      has brown eyes."
5   Do not let anyone know that Joe Public is 77300183.


   List 5.2.1. Some questions that can be answered with short program scripts.

1   Strip all the private identifiers from a medical record
2   Find all the surgical procedures included in the dataset of
      surgical post-op notes, and annotate each procedure with its
      frequency of occurrence in the dataset
3   Index a book with the page location of all terms that are
      names of diseases
4   Find all the palindromes in a gene sequence database and
      arrange them by frequency of occurrence
5   Find the most common occurring sequence of octamers in the
      human genome database
6   Find all octamers that occur only once in the human genome
      database
7   Rank sequences from a gene expression array experiment based
      on levels of over-expression
8   From a patient database, find the diseases that have a
      chronologic relationship with another condition (e.g.
      chicken pox never occurs after shingles)
9   Find all tumors associated with a gene fusion mutation
10   Collect 100 histopathologic images of liver disease from the
      Web.


   List 5.2.2. The three programming tricks in medical informatics.

1   File parsing (opening a file and examining the contents of
      the file, one line at a time)
2   Pattern matching (finding a fragment of parsed text that
      matches a word, a phrase or a character pattern of interest)
3   Assigning data structures to hold numbers or textual data
      that can be operated on, with outputs placed in an external
      file.


   List 5.2.3. Pseudocode to collect all the lines from a file that contain the phrase "biomedical informatics".

1   1. Open a file for reading. (Verbose equivalent: Get a file
      from the hard drive that has a particular name and prepare
      it so that the data in the file can be extracted and put
      into holders in the computer's memory)
2   2. Parse the lines of the file. (Verbose equivalent: Grab
      the characters from the first line of the file and put it
      into a data holder that occupies a specific place in
      computer memory. Be prepared to repeat this for all the
      lines of the file.)
3   3. Collect all the lines that contain the phrase
      "biomedical informatics. (Verbose equivalent: As each
      line is placed in a holder in computer memory, determine
      whether the line contains  the string "biomedical
      informatics" and if it does, add the held data to a
      structure called an array, which can hold many character
      strings, in sequence.)
4   4. When the file is exhausted, empty all the matching lines
      into an external file, opened for writing, named
      "output.txt". (Verbose equivalent: At the end of
      the file parsing loop, take the array structure, and
      transfer all the character strings from the array, in
      sequence, into a newly created file that has been prepared
      to accept data.


   List 5.2.4. Reasons to program in Perl.

1   Perl can be obtained at no cost
2   Perl is available for virtually every operating system and
      comes bundled into Unix and Linux distributions
3   Perl is extremely popular among bionformaticians
4   It takes just a few hours to learn enough Perl to write your
      own biomedical informatics programs
5   Perl programs tend to be much shorter and easier to
      understand than programs written in C or Java
6   A Perl script written for your computer will probably work
      on any other computer loaded with a Perl interpreter, even
      if the other computer has a different operating system
7   Unlike C and C++, Perl comes with native pattern matching
      commands (so called regular expressions) which are used in
      virtually every program in the field of biomedical
      informatics
8   There are many thousands of freely available Perl tools that
      perform a wide range of useful operations that can extend
      the functionality of your own programs
9   Perl code can be written in a manner that looks much like
      simple narrative text (if you make the effort) making it
      easy for others to to read
10   Once you've learned Perl, you can migrate to almost any
      other programming language with ease.


   List 5.5.1. Contents of typical flat-file, "taxo.txt" extracted from "Taxonomy".

1   SYNONYM        : Bacillus aegyptius
2   SYNONYM        : Haemophilus aegyptius
3   SYNONYM        : Hemophilus conjunctivitidis
4   SYNONYM        : Haemophilus influenzae aegyptius
5   SYNONYM        : Bacillus conjunctivitidis
6   SYNONYM        : Bacterium aegyptiacum
7   SYNONYM        : Bacterium conjunctivitis
8   SYNONYM        : Bacterium pseudo conjunctivitidis


   List 5.5.2. Perl script, open1.pl, to open a file and read a file.

1        #!/usr/bin/perl
2        open(FILE, "taxo.txt");
3        $line = " ";
4        while ($line ne "")
5          {
6          $line = <FILE>;
7          print $line;
8          }
9        exit;


   List 5.5.3. Output of open1.pl.

1   C:\ftp>perl open1.pl
2   SYNONYM        : Bacillus aegyptius
3   SYNONYM        : Haemophilus aegyptius
4   SYNONYM        : Hemophilus conjunctivitidis
5   SYNONYM        : Haemophilus influenzae aegyptius
6   SYNONYM        : Bacillus conjunctivitidis
7   SYNONYM        : Bacterium aegyptiacum
8   SYNONYM        : Bacterium conjunctivitis
9   SYNONYM        : Bacterium pseudo conjunctivitidis


   List 5.10.1. Mwp.pl, a ridiculously short text editor, in Perl.

1        #!/usr/bin/perl
2        open (OUT, ">>mycumu.txt");
3        open (NEW, ">mynew.txt");
4        $line = " ";
5        until ($line eq "\n")
6          {
7           $line = <STDIN>;
8           print OUT $line;
9           print NEW $line;
10           }
11       exit;


   List 5.10.2. Until loop in Perl.

1             $line = " ";
2             until ($line eq "\n")  #loop stops when
      all you've entered is
3                                    #the return key
4               {
5               $line = <STDIN>;  #waits for the next line
      of input
6               print OUT $line;  #appends to the cumulative
      file
7               print NEW $line;  #writes to the current
      script-session file
8               }


   List 5.11.1. Common errors in Perl scripts.

1   Perl blocks must be balanced with curly brackets. Every 
      block (e.g., while, if, for, unless, foreach) must have a
      beginning curly bracket,"{" and a balanced closing
      curly bracket, "}". This can become hairy in
      scripts that have multi-nested blocks
2   Command lines must end with a semicolon
3   String variables must be pre-pended with a "$", 
      as in, $date
4   Spelling counts in scripts. Perl cannot interpret a
      misspelled command or variable
5   An uppercase character has a different ascii value than its
      lowercase equivalent. With few exceptions, you will find it
      useful to maintain case consistency in Perl scripts
6   Characters that serve as reserved Perl symbols must be 
      backslashed if they are used as string characters. For
      example, use \. \/ \\ \$  if you want to use ./\$ as 
      characters. There are exceptions to this rule: \n,\d, \w are
      reserved symbols and never refer to the letters, ndw. The
      strange and non-intuitive use of backslashes in Perl  takes
      some mental adjustment and accounts for the "leaning
      toothpick syndrome" in Perl scripts. Complex regular
      expressions often resemble toothpicks tossed amidst string
      characters
7   Certain operations must be enclosed by parentheses (e.g., if
      (1 == 2), not (if 1 == 2)
8   The "=" operator is assignes a value and does not
      test for equality. To test for equality, use "=="
      if you are comparing two numbers and use "eq" if
      you are comparing two strings. Remember that string
      comparison operators (eq, ne, lt, gt) are different from
      number  comparison operators (==, >, <)
9   Using an "=" operator when you really want to use
      the regex comparison operator, "=~".


   List 5.11.2. Summary of the first Perl programming section.

1   Perl scripts are simple text files. [Perl scripts should be
      named using the .pl extension [Perl is a quintessential
      command-line language. At the command prompt, run your
      scripts by typing perl, then the name of the script, then
      the return-key (on some systems, you needn't include the
      name perl)
2   Perl scripts start off with a header line
3   Perl commands end with a semicolon
4   Perl blocks are delineated by curly brackets ({ })
5   You can assign strings to variables by using the assignment
      operator, "="
6   You can read, write or append to files using the
      "open" command


   List 5.12.1. Pseudocode that outlines the general construction of a Perl script.

1        header (shebang) line;
2        input something;
3   if (something evaluates to true)
4          {
5          do something;
6     for or while (some condition)
7             {
8             do something;
9             }
10          do something;
11         do something;
12         }
13  for or while (some condition)
14         {
15         do something;
16    if (something evaluates to true)
17           {
18           do something;
19           do something;
20           do something;
21           }
22         output something;
23         }
24       exit;


   List 5.14.1. Perls script bigread.pl.

1        #!/usr/bin/perl
2        #bigread.pl
3        #This script lets you page through enormous files,
4        #20 lines at a time, with no file load time
5        print "What file do you want to read?";
6        $filename = <STDIN>;
7        chomp($filename);
8        open (TEXT, $filename)||die"Can't open file";
9        $line = " ";
10   while ($line ne "")   [#comment: while $line is
      not equal to empty
11           {
12           for ($count = 1; $count <= 20; $count++)
13            {
14            $line = <TEXT>;
15            print $line;
16            }
17           print "Type QUIT if you want to quit.
    Otherwise press any key\n";
18           $response = <STDIN>;
19           if ($response =~ /QUIT/i)
20             {
21             last;
22             }
23           }
24       exit;


   List 5.14.2. Output of File Reader.

1   C:\ftp>perl readbig.pl
2   What file do you want to read?e:\omim.txt
3   *RECORD*
4   *FIELD* NO [100050 [*FIELD* TI [100050 AARSKOG SYNDROME
      [*FIELD* TX [Grier et al. (1983) reported father and 2 sons
      with typical Aarskog [syndrome, including short stature,
      hypertelorism, and shawl scrotum. [
5   
6   
7   sons and that this suggested autosomal dominant inheritance.
      Actually,
8   the mother seemed less severely affected, compatible with
      X-linked
9   Type QUIT if you want to quit. Otherwise press any key


   List 5.14.3. Summary of the second Perl programming section.

1   How to open and read from files, line by line
2   How to prompt a user for input
3   Looping using for() and while()
      blocks
4   Evaluating if() blocks
5   Simple pattern matching


   List 5.15.1. Things you can do with a one-line Regular expression.

1   Collect the lines from a file that contain a specific word,
      phrase or number
2   Collect the lines from a file that contain any desired
      combination of the above
3   Substitute any alphanumeric character string for any other,
      for the entire file


   List 5.16.1. Using the match operator with regular expressions.

1   for all the lines of a given file
2          {
3          put the next line from the file into some variable;
4          check the line to see if it matches your regular
      expression;
5          {
6   if the line matches the regular expression
7          {
8          do something with it, like put it into another file;
9          or do an operation on the matching value;
10          }


   List 5.16.2. Using the substitution operator with regular expressions.

1   for all the lines of a given file
2          {
3          put the next line from the file into some variable;
4   do a substitution on all of the parts of the line that match
      your regular expression;
5          do something with the the revised line, like
      rearranging it and then putting the rearranged line into
      another file;
6          }


   List 5.17.1. Pattern match options.

1   g     Match globally, (find all occurrences)
2   i     Do case-insensitive pattern matching
3   m     Treat string as multiple lines
4   o     Compile pattern only once
5   s     Treat string as single line
6   x     Use extended regular expressions
7   ^     Match the beginning of the line
8   . Match any character (except newline)
9        $     Match the end of the line (or before newline at
      the end)
10   |     Alternation
11  ()    Grouping
12  []    Character class
13  *     Match 0 or more times
14  +     Match 1 or more times
15  ?     Match 1 or 0 times
16  {n}   Match exactly n times
17  {n,}  Match at least n times
18  {n,m} Match at least n but not more than m times
19  \n    newline(LF, NL)
20  \W    Match a non-word character
21  \s    Match a whitespace character
22  \S    Match a non-whitespace character
23  \d    Match a digit character
24  \D    Match a non-digit character.


   List 5.17.2. Sentence.pl Perl script, which creates a file wherein each new sentence begins on a new line.

1        #!/usr/local/bin/perl
2        open (TEXT, "1DFRE10.TXT")||die"Can't
      open file";
3        open (OUT,">1DFRE10.OUT")||die"Can't
      open file";
4        undef($/);
5        $string = <TEXT>;
6        $string =~ s/[\n]+/ /g;
7        $string =~ s/([^A-Z]+\.[ ]{1,2})([A-Z])/$1\n$2/g;
8        print OUT $string;
9        exit;


   List 5.18.1. Periods.pl, a Perl script for removing periods that do not delineate sentences.

1        #!/usr/bin/perl
2        #disbrev2.pl
3        #replaces periods with *, except when period marks end
      of sentence
4        $k = "Mr. P.I.N. Ph.D. M.D. 0.3 .4 5. 4.6.7.8.9
      end_of_sentence. Hello";
5        $firstvalue = $k;
6        $k =~ s/\b([ \w\d]*)\.+(?=[\w\d]*)(?!  [A-Z])/$1\*$2/g;
7        print "$firstvalue =>\n$k";
8        exit;


   List 5.18.2. Output of disbrev2.pl.

1   C:\ftp>perl disbrev2.pl
2        Mr. Dr. P.I.N. Ph.D. M.D. 0.3 .4 5. 4.6.7.8.9
      end_of_sentence. Hello =>
3   Mr* Dr* P*I*N* Ph*D* M*D* 0*3 *4 5*  4*6*7*8*9
      end_of_sentence. Hello
4        C:\ftp>
5   USEFUL REGULAR EXPRESSIONS


   List 5.18.3. Regex (regular expression) substitution examples.

1   $string =~ s/^ +//o; Removes leading spaces from a character
      string
2   $string =~ s/ +$//o; Removes trailing spaces from a
      character string
3   $string =~ s/ +/ /g; Changes all sequences of one or more
      spaces to just a single space
4   $string =~ s/\n//g; Gets rid of newline (sometimes called
      linebreak) characters in your string
5        $string =~ s/\b(\w+\.[ ]{1,2})([A-Z])/$1\n$2/g;
6   This finds the most common sentence delimiter (the end of a
      word followed by a period followed one or two spaces,
      followed by by an uppercase letter) and substitutes a
      newline character to that the each new sentence begins on a
      new line
7   $string =~ tr/A-Z/a-z/ Every uppercase letter is converted
      to a lowercase letter using the translate operator
      (tr/a-z/A-Z/ does the opposite)
8   $string = lc($string) Every uppercase letter is
      converted to a lowercase letter using the lc operator
      (uc($string) does the opposite)
9   $string =~ s/\b([A-Z0-9)\.[ \n/$1\*/g; makes sentence break
      at stand-alone single alphanumeric followed by a period
10   $string =~ s/\<[^\<+\>/ /g; removes angle-bracketed
      expressions, such as HTML or XML markup
11  $string =~ s/^([^ *) *([^ *)/$2 $1/; The first word in a
    string is switched with the second word
12  $string =~ s/\b(.+)(\@)(.+)\b/email/g; replaces email
    addresses with the word "email."
13  $string =~ s/\bhttp\:(.+)\b/webURL/ig; replaces http
    addresses with the word webURL
14  $string =~ tr/0-9a-zA-Z.\n' \-\)\(/ /c; replaces with a
    space everything that is not a letter, number period,
    line-break, apostrophe, space or parenthesis.


   List 5.19.1. Wc.pl Perl script, which counts the words in a file in 5 commands.

1        #!/usr/local/bin/perl
2        open (TEXT, "1DFRE10.TXT");
3        undef($/);
4        $all_text = <TEXT>;
5        @wordarray = split(/[\n\s]+/, $all_text);
6        print scalar(@wordarray);
7        exit;


   List 5.20.1. The Zipf distribution of the prior paragraph of the prior paragraph.

1   c:\ftp>perl zipf.pl
2   00007 of
3   00005 a
4   00004 the
5   00003 words
6   00003 is
7   00003 in
8   00002 zipf
9   00002 text
10   00002 occurrences
11  00002 distribution
12  00001 zipf's
13  00001 way
14  00001 this
15  00001 their
16  00001 that
17  00001 small
18  00001 shown
19  00001 see
20  00001 practical
21  00001 paragraph
22  00001 order
23  00001 most
24  00001 listing
25  00001 list
26  00001 law
27  00001 interpreting
28  00001 for
29  00001 different
30  00001 descending
31  00001 any
32  00001 amount
33  00001 account


   List 5.20.2. The first ten items in the Zipf distribution of The Decline and Fall of the Roman Empire.

1   26856 the
2   18032 of
3   09136 and
4   06026 to
5   04654 a
6   04155 in
7   03170 was
8   03081 his
9   02815 by
10   02391 that


   List 5.20.3. Zipf.pl, a Perl script that creates a Zipf distribution in 6 commands.

1        #!/usr/local/bin/perl
2        open (TEXT, "1DFRE10.TXT");
3        open (OUT, ">1DFRE10.OUT");
4        undef($/);
5        $all_text = <TEXT>;
6        $all_text = lc($all_text);
7        $all_text =~ s/[^a-z\-\']/ /g;
8        @wordarray = split(/[\n\s]+/, $all_text);
9        foreach $thing (@wordarray)
10          {
11         $freq{$thing}++;
12         }
13       #The Zipf list finished. The next lines just display
    the distribution
14       while ((my $key, my $value) = each(%freq))
15           {
16           $value = "00000" . $value;
17           $value = substr($value,-5,5);
18           push (@termarray, "$value $key")
19           }
20       @finalarray = reverse (sort (@termarray));
21       print join("\n",@finalarray);
22       exit;


   List 5.20.4. Example of an associative array, %patient_weight.

1        $patient_weight{"John Public"} = 155;
2        $patient_weight{"Mary Smith"} = 110;
3        $patient_weight{"Jules Berman"} = 195;
4        $patient_weight{"Jules Berman"}++; #evaluates
      to 196


   List 5.20.5. Summary of the third Perl programming section.

1   Creating and interpreting complex regular expressions
2   Looping through arrays with foreach blocks
3   Looping through associative arrays with while blocks
4   New Perl operators and commands split(), push(), lc(),
      sort(), join(), substr(), scalar(), undef(), incrementing
      values and concatenating strings
5   Advanced pattern substitution and substitution options


   List 5.21.1. A sample MESH record.

1   *NEWRECORD
2   RECTYPE = D
3   MH = Heparin
4   AQ = AA AD AE AG AN BI BL CF CH CL CS CT DF DU EC GE HI IM
      IP ME PD PH  [     PK PO RE SD SE ST TO TU UL UR
5   PRINT ENTRY = Heparinic Acid|T118|T121|T123|
6        NON|EQV|UNK (19XX)|800523|abbbcdef
7   PRINT ENTRY = alpha-Heparin|T118|T121|T123|NON|NRW|
8        UNK (19XX)|800523|abbbcdef
9   ENTRY = Liquaemin|T118|T121|T123|TRD|NRW|UNK
      (19XX)|861029|abbbcdef
10   ENTRY = Sodium Heparin|T118|T121|NON|NRW|UNK
      (19XX)|830330|abbcdef
11  ENTRY = Heparin, Sodium
12  ENTRY = alpha Heparin
13  MN = D09.698.373.400
14  PA = Anticoagulants
15  PA = Fibrinolytic Agents
16  EC = antagonists & inhibitors:Heparin Antagonists
17  MH_TH = BAN (19XX)
18  ST = T118
19  ST = T121
20  ST = T123
21  N1 = Heparin
22  RN = 9005-49-6
23  MS = A highly acidic mucopolysaccharide formed of equal 
    [parts of sulfated D-glucosamine and D-glucuronic acid with
24  sulfaminic bridges. The molecular weight ranges from six to 
    [twenty thousand. Heparin occurs in and is obtained from
    liver,
25  lung, mast cells, etc., of vertebrates. Its function is
    unknown,  [but it is used to prevent blood clotting in vivo
    and vitro, in
26  the form of many different salts
27  PM = /therapeutic use was HEPARIN, THERAPEUTIC 1965
28  HN = /therapeutic use was HEPARIN, THERAPEUTIC 1965
29  MED = *1635
30  MED = 3275
31  M90 = *2406
32  M94 = 4517
33  MR = 20040707
34  DA = 19990101
35  DC = 1
36  UI = D006493


   List 5.21.2. Creating a persistent database object from the MESH flat-file.

1        #!/usr/bin/perl
2        use Fcntl;
3        use SDBM_File;
4        tie%item, "SDBM_File", 'mesh',
      O_RDWR|O_CREAT|O_EXCL, 0644;
5        untie%item;     #these two lines simply create a file
6        open (TEXT, "d2002.bin")||die"Can't open
      file";
7        $/ = "*NEWRECORD";
8        $line = " ";
9        while ($line ne "")
10           {
11          tie%item, "SDBM_File", 'mesh', O_RDWR,
    0644;  #use the created file
12            $line = <TEXT>;
13            @linearray = split(/\n/,$line);
14            foreach $piece (@linearray)
15              {
16              if ($piece =~ /MN = /)
17                {
18                $meshno = $';
19                }
20              if ($piece =~ /ENTRY = /)
21                {
22                $entry = $';
23                if ($entry =~ /\|/o)
24                   {
25                   $entry = $`;
26                   }
27                $entry =~ s/s\b//g;
28                $entry = lc($entry);
29                push (@synonyms, $entry);
30                }
31              }
32            foreach $term (@synonyms)
33              {
34              $item{$term} = $meshno;
35              }
36           undef $meshno;
37           undef @synonyms;
38           untie%item;
39          }
40       undef(%item);
41       close TEXT;
42       exit;


   List 5.22.1. Retrieving a persistent database object from the MESH flat-file.

1        #!/usr/bin/perl
2        use Fcntl;
3        use SDBM_File;
4        tie%item, "SDBM_File", 'mesh', O_RDWR, 0644;
5        while(($key, $value) = each (%item))
6          {
7          print "$key => $value\n";
8          }
9        untie%item;
10        exit;


   List 5.23.1. Syntax rules for valid XML tags.

1   XML tags, unlike Perl variables, are case-sensitive
      ("Name" is different from "name").
      Parsers must preserve character case
2   Letters, underscores, hyphens, periods and numbers may be
      used in a tag
3   Only letters and underscores are eligible as the first
      character
4   Colons are allowed, but only as part of a declared namespace
      prefix. For all practical purposes, this means that only one
      colon is allowed in a tag, and the colon must appear in an
      internal location in the tag (not at the beginning or the
      end of a tag).


   List 5.23.2. Tagcheck.pl, a program that validates XML tags.

1        #!/usr/bin/perl
2   @elements = qw (gene 4gene gene:ncbi gene-autry ge::ne [    
                 gene&autry -gene _gene gene- gene: [         
            :gene ge:n:e  ge:ne: ge,ne ge.ne);
3        foreach $value (@elements)
4          {
5           if ($value =~
      /^[a-z\_][a-z0-9\-\.\_]*[\:]?[a-z0-9\-\.\_]*$/i)
6             {
7             print "$value is good\n";
8             }
9      else
10             {
11            print "$value is bad\n";
12            }
13         }
14       exit;


   List 5.23.3. Output of tagcheck.pl

1   c:\ftp>perl tagcheck.pl
2   gene is good
3   4gene is bad
4   gene:ncbi is good
5   gene-autry is good
6   ge::ne is bad
7   gene&autry is bad
8   -gene is bad
9   _gene is good
10   gene- is good
11  gene: is good
12  :gene is bad
13  ge:n:e is bad
14  ge:ne: is bad
15  ge,ne is bad
16  ge.ne is good


   List 5.24.1. What we have learned so far.

1   The =~ operator tells Perl to look for the pattern that
      follows the operator in the variable that precedes the
      operator. Regular Expressions are Perl's way of describing a
      pattern
2   You can create most of your patterns by following a few
      simple rules and by "borrowing" regular
      expressions from published listings
3   The most common usage for regular expressions are in scripts
      that examine a line (or all the lines) from a file and that
      perform a substitution or rearrangement or other operation
      on the line, based on the results of the pattern match
4   Regular expressions are a powerful and fast tool for
      modifying text or data records or finding exactly what you
      want in any text
5   Perl associative arrays can be tied to an external database
      object that persists even when the Perl script has finished
      executing.


   List 6.1.1. Some biomedical informatics tasks that can be accomplished with Perl.

1   Statistics
2   Mathematical Computations
3   Mathematical modeling
4   Web protocols (e.g., http and ftp)
5   Cryptographic techniques
6   Integrating data
7   Glue functions (e.g., calling subroutines written in C)
8   Digital Signal Processing (including Image analysis)
9   Bioinformatics methods (e.g. interfacing to Blast)
10   Database interfaces
11  Remote procedure calls and distributed computing
12  Middleware (see Glossary) [Software agents (via web
    services, GRID, SOAP (see Glossary), or related protocols)
13  Transformations to and from XML
14  XML data queries
15  Logical annotation of data (e.g., RDF)


   List 6.2.1. Creating an MD_5 one-way hash value for any provided string.

1        #!/usr/local/bin/perl
2        use MD5;
3        print "What words would you like to
      digest?\n";
4        $holdstring = <STDIN>;
5        chomp;
6        $hexhashstring = MD5->hexhash($holdstring);
7        print "md_5 hexhash => $hexhashstring\n";
8        exit;


   List 6.2.2. Three executions of the the MD_5 algorithm.

1   Execution 1:
2   c:\ftp>perl md5_word.pl
3   What words would you like to digest?
4   Jules Berman
5   md_5 hexhash => 0ab7ad79962fd2ea036cc8dbaade6f2a


   List 6.2.3. Creating an MD_5 one-way hash for a file.

1        #!/usr/local/bin/perl
2        use MD5;
3        print "What file would you like to
      digest?\n";
4        $holdfile = <STDIN>;
5        chomp;
6        open (TEXT,"$holdfile");
7        $context = new MD5;
8        $context->addfile(TEXT);
9        $digest = $context->digest();
10        print (unpack ("H*", $digest));
11       exit;


   List 6.3.1. Simple Perl script for computing the mean from an array of numbers.

1        #!/usr/bin/perl
2        #mean.pl
3        #computes the mean of an array of numbers
4        @numbersarray = (1,2,3,4,5,6,7,8,9,10);
5        $arraysize = scalar(@numbersarray);
6        print "The number of elements in our array is
      $arraysize\n";
7        $sum = 0;
8        foreach $value(@numbersarray)
9          {
10          $sum = $sum + $value;
11         }
12       $mean = $sum / $arraysize;
13       print "Your population number is
    $arraysize\n";
14       print "The array mean is $mean\n";
15       exit;


   List 6.3.2. General method of building an array that can be used in a statistical or mathematical Perl routine.

1   Open the file containing your records
2   Go through the file, one line (record) at a time
3   From a complex record, pick out the number you want using
      Regex
4   Add that number to your array variable (using the Perl push
      command)
5   Calculate the mean (or any other statistical test) on the
      array variable.


   List 6.3.3. Computing the mean of an array entered at keyboard.

1        #!/usr/bin/perl
2        #mean2.pl
3        #computes the mean of an array of numbers entered at
      keyboard
4        print "Type a bunch of numbers, pressing the
      return key\n";
5        print "after each number. Decimal numbers are
      allowed\n\n";
6        $number = " ";
7        until ($number eq "")
8          {
9          $number = <STDIN>;
10           $number =~ s/\n//o;  #deletes the newline character
11          if ($number eq "")
12            {
13            next;
14            }
15          if ($number !~ /[0-9]+/)     #the entry must contain
    at least one digit
16            {
17            print "You're only allowed to enter
    numbers...";
18            print " We just won't count this
    entry\n";
19            next;
20  } [   if ($number !~ /^[0-9

   List 6.3.4. Output of mean2.pl.

1   C:\ftp>perl mean2.pl
2   Type a bunch of numbers, pressing the return key after each
      number. Decimal numbers are allowed


   List 6.4.1. Some of the available Perl statistics modules ().

1   Statistics-Basic                          
      [Statistics-ChisqIndep
2   Statistics-ChiSquare                      
      [Statistics-Contingency
3   Statistics-ConwayLife                      [Statistics-DEA
4   Statistics-DependantTTest                
      [Statistics-Descriptive
5   Statistics-Descriptive-Discrete           
      [Statistics-Distributions
6   Statistics-Frequency                      
      [Statistics-GammaDistribution
7   Statistics-KruskalWallis                  
      [Statistics-LineFit
8   Statistics-Lite                           
      [Statistics-LogRank
9   Statistics-LSNoHistory                     [Statistics-LTU
10   Statistics-OLS                            
      [Statistics-RankCorrelation
11  Statistics-RankOrder                      
    [Statistics-Regression
12  Statistics-ROC                            
    [Statistics-SerialCorrelation
13  Statistics-Shannon                        
    [Statistics-Simpson
14  Statistics-Table-F                         [Statistics-Test
15  Statistics-TTest                            Before you can
    use these tests, you must download the appropriate module
    into your Perl installation. A sample installation of
    Statistics-Descriptive (by Colin Kuskie, Andrea Spinelli and
    Jason Kastner), through the ActiveState package manager is
    shown (see List). ppm> install statistics-descriptive
    ==================== Install 'statistics-descriptive'
    version 2.6 in ActivePerl 5.8.7.815. ====================
    Downloaded 10294 bytes. Extracting 5/5:
    blib/arch/auto/Statistics/Descriptive/.exists Installing
    C:\activepl\html\site\lib\Statistics\Descriptive.html
    Installing C:\activepl\site\lib\Statistics\Descriptive.pm
    Successfully installed statistics-descriptive version 2.6 in
    ActivePerl 5.8.7.815. Only the first line is input: 
    [ppm> install statistics-descriptive


   List 6.4.2. Perl script for calculating variance.

1        #/usr/local/bin/perl
2        use Statistics::Descriptive;
3        $stat = Statistics::Descriptive::Full->new();
4        $stat->add_data(1,2,3,4,5,6,7,8,9,10);
5        $mean = $stat->mean();
6        $var  = $stat->variance();
7        print "mean $mean\nvariance $var\n";
8        exit;


   List 6.4.3. Output of statistics script.

1   c:\ftp>perl stat.pl
2   mean 5.5
3   variance 9.16666666666667


   List 6.4.4. Perl script for computing the ChiSquare statistic.

1        #!/usr/bin/perl
2        use Statistics::ChiSquare;
3        print chisquare([1, 9, 1, 15, 4, 7]), "\n";
4        print chisquare([20, 20, 20, 30, 20, 20, 30 ]),
      "\n";
5        exit;


   List 6.4.5. Output of chi.pl.

1   C:\ftp>perl chi.pl
2   There's a <1% chance that this data is random
3   There's a >50% chance, and a <70% chance, that this
      data is random.


   List 6.5.1. Types of statistical errors.

1   Type 1 error. Rejecting the null hypothesis when the null
      hypothesis is correct (i.e., seeing an effect when there was
      none)
2   Type 2. Accepting the null hypotheses when the null
      hypothesis is false. (i.e. seeing no effect when there was
      one)
3   Type 3. Rejecting the null hypothesis correctly, but for the
      wrong reason, leading to an erroneous interpretation of the
      data in favor  of an incorrect affirmative statement
4   Type 4. Erroneous conclusion based on performing the wrong
      statistical test. The type 4 error is the most embarrassing
      and the least excusable. You cannot blame a type 4 error on
      the data. It's all on you. Considering the rich variety of
      exotic statistical tests available to the novice, the
      opportunities for type 4 errors are endless. One way of
      avoiding type 4 errors is to have a dedicated statistician
      analyze your data. For those informaticians who have access
      to the services of a trustworthy statistician, this may
      actually  be the best and most practical solution. There is,
      however, an alternate way approach: resampling. Resampling
      is a type of statistical analysis that uses computers to
      model experiments and then repeats the experiments thousands
      or millions of time to determine the occurrence frequencies
      for particular sets of  data. This area of statistics was
      popularized by Bradley Efron (), and may have particular
      interest for readers of this book (see List). [List. Reasons
      why resampling statistics are of interest to biomedical
      informaticians
5   Does not require any knowledge of statistical tests
6   Applicable to a wide range of problems, including clinical
      trial design and decision analyses
7   Easy to understand
8   Easy to program with Perl


   List 6.6.1. Randtest.pl, a Perl script that simulates 600,000 casts of the die.

1        #!/usr/bin/perl
2        #randtest.pl
3        #Simulation of a throw of a die
4        $count = 0;
5        while ($count < 600000)
6           {
7           $count++;
8           $one_of_six = (int(rand(6))+1);
9           $hash{$one_of_six}++;
10           }
11        while(($key, $value) = each (%hash))
12          {
13        print "$key => $value\n";
14          }
15       exit;


   List 6.6.2. Output of first test of randtest.pl.

1   C:\ftp>perl randtest.pl
2   1 => 100002
3   2 => 99902
4   3 => 99997
5   4 => 100103
6   5 => 99926
7   6 => 100070


   List 6.6.3. Output of second test of randtest.pl.

1   C:\ftp>perl randtest.pl
2   1 => 100766
3   2 => 99515
4   3 => 100157
5   4 => 99570
6   5 => 100092
7   6 => 99900


   List 6.6.4. Ranfile.pl, a Perl script that assigns random names to newly created files.

1        #!/usr/bin/perl
2        #ranfile.pl
3        #Makes 10 randomly named files, with 8 leading
      characters
4        #a period and three trailing characters
5        while ($count < 10)
6         {
7         $count++;
8         &ranfile;
9         }
10   [sub ranfile
11       {
12       my @listchar;
13       my $count;
14       for ($count = 1; $count <= 12; $count++)
15           {
16           push(@listchar, chr(int(rand(26))+65));
17           }
18       $listchar[8]= ".";
19       my $randomfilename = join("",@listchar);
20       print "Your filename is $randomfilename\n";
21       return $randomfilename;
22       }
23       exit;


   List 6.6.5. Output of ranfile.pl.

1   C:\ftp>perl ranfile.pl
2   Your filename is EKDUFKBR.YNX
3   Your filename is QVDKUVBY.QUI
4   Your filename is FNZXNKEE.MLV
5   Your filename is NRTXEHQI.VFX
6   Your filename is GWMOLKMX.AYU
7   Your filename is LZAKZQDW.RYR
8   Your filename is PRUAONQQ.OSJ
9   Your filename is XDEDHLKD.GAY
10   Your filename is RUSLNSXI.XVR
11  Your filename is IEPGAWDP.LEH


   List 6.7.1. Ai.pl, a Perl script that simulates clonal tumor growth.

1        #!/usr/bin/perl
2        #ai.pl
3        #Simulates the growth of a tumor from a single cells,
      with
4        #a cell death probability per generation as provided by
      the user
5        print "Enter the death probability for your
      simulation\n";
6        print "Number must be between zero and
      one.\n";
7        print "Most realistic numbers are .45 to
      .50\n";
8        $value = <STDIN>;
9        $value =~ s/\n//o;
10   if ($value > 1) [  {
11         print "Exiting... you must pick a number between
    zero and one\n";
12         end;
13         }
14       print "THE CELL DEATH PROBABILITY FOR THIS
    SIMULATION IS $value\n\n";
15       my $roundnumber = 1; #initiate the generation counter
16       &cycle;


   List 6.7.2. Output of ai.pl.

1   C:\ftp>perl ai.pl
2   Enter the death probability for your simulation [Number must
      be between zero and one. [Most realistic numbers are .45 to
      .50 [.46 [THE CELL DEATH PROBABILITY FOR THIS SIMULATION IS
      .46 [Starting with a single malignant cell, let's watch the
      clonal growth. Tumor terminated...good!
3   Starting with a single malignant cell, let's watch the
      clonal growth. 1 Tumor terminated...good!
4   Starting with a single malignant cell, let's watch the
      clonal growth. 2 1 4 2 1 1 1 Tumor terminated...good!
5   Starting with a single malignant cell, let's watch the
      clonal growth. 2 1 5 6 8 8 8 12 15 18 18 20 19 27 32 30 31
      20 14 16 23 30 30 36 38 34 50 52 67 75 97 114 133 143 150
      156 159 178 200 254 302 292 329 336 382 441 489 603 630 701
      770 862 923 1056 1084 1210 1369 1473 1664 1776 1959 2196
      2475 2862 3098 3327 3740 4095 4634 Bad news. Let's stop
      watching this malignancy
6   Starting with a single malignant cell, let's watch the
      clonal growth. Tumor terminated...good!
7   Starting with a single malignant cell, let's watch the
      clonal growth. Tumor terminated...good!
8   Starting with a single malignant cell, let's watch the
      clonal growth. 3 1 3 5 3 1 1 Tumor terminated...good!
9   Starting with a single malignant cell, let's watch the
      clonal growth. 4 6 5 6 3 3 1 Tumor terminated...good!
10   Starting with a single malignant cell, let's watch the
      clonal growth. 2 2 5 3 2 2 6 5 6 5 4 2 1 Tumor
      terminated...good!
11  Starting with a single malignant cell, let's watch the
    clonal growth. 3 5 3 7 6 3 3 1 1 Tumor terminated...good!
    I've seen enough!


   List 6.7.3. Perl snippet showing the algorithm that repeatedly assigns probabilitic outcomes to an event. 

1        while ($i < $sum +1)
2          {
3          $i++;
4          $randnum = int( rand(100) ) + 1;
5          if ($randnum > (100 * $value))
6             {
7             $sum = $sum + 1;
8             }
9          if ($randnum < ((100 * $value) -1))
10             {
11            $sum = $sum - 1;
12            }
13         }


   List 6.8.1. Run.pl, a resampling script in Perl, that simulates runs of errors.

1        #!/usr/local/bin/perl
2        $errorno = 0;
3        while ($count < 100001)
4          {
5          $count++;
6          $x = rand(100);
7          if ($x < 2)
8             #similates a 2% error rate
9             {
10             $errorno++;
11            }
12    else
13            {
14            $errorno = 0;
15            }
16          if ($errorno == 3)
17            {
18            print "Uh oh. 3 consecutive errors\n";
19            $errorno = 0;
20            }
21         }
22  exit;                                     The Perl script
    simulates 100,000 diagnoses, which is a fair estimate of the
    total number of diagnoses a pathologist might render in
    their entire career (at 4,000 diagnoses per year over 25
    years of service). Each diagnosis is assigned a random
    number between 0 and 100. The "diagnosis" loop is
    repeated 100,000 times. In each loop, if the  randomly
    assigned number is less than 2, the pathologist's error
    number is incremented by 1. If the next diagnosis is
    randomly assigned a number greater than 2, the error number
    is dropped back down to 0 (i.e. the diagnosis is correct and
    the run of errors is broken). If an error occurs on 3
    consecutive occasions, the event is printed to the computer
    monitor (see List). [List. Output of run.pl
23  c:\ftp>perl run.pl
24  Uh oh. 3 consecutive errors
25  Uh oh. 3 consecutive errors


   List 6.9.1. Snippet of Perl code to determine unbiased random selection.

1        open
      (HOLD,">holder.txt")||die"cannot";
2        while ($n < 1000000)
3          {
4   $x = int(rand(100)) + 1;  [#pick a number
      between 1 and a hundred
5        #make a hash of the numbers picked and the
6   #number of times each is picked            [ 
      $randhash{$x}++;
7   $n++;                     [  }
8        foreach $key (sort byval keys %randhash)
9          {
10          print HOLD "$randhash{$key} $key\n";
11         }
12  sub byval
13         {
14         $randhash{$a} <=> $randhash{$b};
15         }


   List 6.10.1. Output of montesw.pl.

1   C:\ftp>perl montesw.pl
2   6598
3   C:\ftp>perl monteno.pl
4   3408


   List 6.11.1. Ceil.pl, calling a POSIX function from a Perl script.

1        #!/usr/local/bin/perl
2        use POSIX qw(ceil floor);
3        $num = 11.3;
4        print "Floor is ", floor($num),
      "\n";
5        print "Ceil is ", ceil($num), "\n";
6        exit;


   List 6.11.2. Output of ceil.pl.

1   c:\ftp>perl ceil.pl
2   Floor is 11
3   Ceil is 12


   List 6.12.1. Using the ActiveState Programmer's Package Manager.

1   c:\ftp>ppm
2   ppm - programmer's package manager version 3.3
3   copyright (c) 2001 activestate corp. all rights reserved
4   activestate is a division of sophos.


   List 6.12.2. Simple example script for the Fast Fourier Transform Module.

1   #!/usr/local/bin/perl [use Math::FFT;
2        my $PI = 3.1415926539;
3        my $N = 8; #N can be any power of 2, such as 4,8,16,64
4        $series = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16];
      #could be anything
5        print "series " . join("
      ",@$series). "\n";
6        my $fft = new Math::FFT($series);
7        my $coeff = $fft->rdft();
8        print "coefficients \n @{$coeff}\n\n";
9        my $spectrum = $fft->spctrm;
10        print "spectrum \n @{$spectrum}\n";
11       exit;


   List 6.12.3. Output of Fast Fourier Transform script

1   C:\FTP>perl fft.pl
2   series 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


   List 6.13.1. A full concordance in 10 commands.

1        #!/usr/local/bin/perl
2        open (TEXT, "1DFRE10.TXT");
3        open (OUT, ">1DFRE10.OUT");
4        $line = " ";
5        while ($line ne "")
6           {
7           $cumline = "";
8           for($i=0;$i<100;$i++)
9              {
10              $line = <TEXT>;
11             $cumline = $cumline . $line;
12             }
13          $page++;
14          $cumline = lc($cumline);
15          $cumline =~ s/[^a-z\-\']/ /g;
16          @wordarray =  sort(split(/[\n\s]+/,$cumline));
17          @concordance = grep { $marked{$_}++; $marked{$_} ==
    1; } @wordarray;
18          undef %marked;
19          foreach $thing (@concordance)
20             {
21             $wordpage{$thing} = $wordpage{$thing} . "
    $page";
22             }
23          }
24       #The concordance is finished. The next lines just
    display it on screen
25       foreach $key (sort keys %wordpage)
26          {
27          print OUT "$key \= $wordpage{$key}\n";
28          }
29       exit;


   List 6.13.2. A full indexing script in 10 commands.

1        #!/usr/local/bin/perl
2        open (TEXT,
      "1DFRE10.TXT")||die"cannot";
3        open (OUT,
      ">1DFRE10.OUT")||die"cannot";
4        $line = " ";
5        @indextermarray = ("gaul","roman
      empire",&
      quot;emperor","village","england");
6        while ($line ne "")
7           {
8           $cumline = "";
9           for($i=0;$i<100;$i++)
10              {
11             $line = <TEXT>;
12             $cumline = $cumline . $line;
13             }
14          $page++;
15          $cumline = lc($cumline);
16          $cumline =~ s/[^a-z\-\']/ /g;
17          $cumline =~ s/ +/ /g;
18          foreach $thing (@indextermarray)
19              {
20              if ($cumline =~ /\b$thing\b/)
21                 {
22                 $wordpage{$thing} = $wordpage{$thing} .
    " $page";
23                 }
24              }
25          }
26       #The index is finished. The next lines just display it
    on screen
27       foreach $key (sort keys %wordpage)
28          {
29          print OUT "$key \= $wordpage{$key}\n";
30          print "$key \= $wordpage{$key}\n";
31          }
32       exit;


   List 6.13.3. An excerpted output for the indexing program, listing only the terms "england" and "village" and the pages on which they are found.

1   c:\ftp>perl indexer.pl
2   
3   
4   
5   england =  5 14 23 134 207 208 229 277
6   
7   
8   
9   village =  43 77 81 94 128 141 147 184 185 225 226 244


   List 6.13.4. Problems with human-based indexing.

1   Incredibly labor-intensive and time consuming
2   The index cannot be built until the book is in final form
      and the page numbers are known, delaying the publication of
      the book until the indexing is completed
3   If important phrases are omitted completely or if one or
      more of their locations are omitted, no one will likely
      catch the error
4   The indexing effort needs to be repeated if there are book
      revisions and pagination changes.


   List 6.13.5. Extracting candidate index phrases from a test file.

1   #!/usr/local/bin/perl [@stop = qw(
2   a about absent absence again all almost also although always
      among an
3   and another any are as at be because been before being
      between both
4   but by can could cm did do does done due during each either
      enough
5   especially etc for found from further had has have having
      here how
6   however i if in into is it its itself just kg km made mainly
      make may
7   mg might ml mm most mostly must nearly neither no nor
      observed
8   obtained often on our overall perhaps present presence quite
      rather
9   really regarding seem seen several should show showed shown
      shows
10   significantly since so some such than that the their theirs
      them then
11  there therefore these they this those through thus to upon
    use used
12  using various very was we were what when which while with
    within
13  without would or can't doesn't not
14       );
15       open
    (TEXT,"1DFRE10.TXT")||die"Cannot";
16       open
    (OUT,">1DFRE10.OUT")||die"Cannot";
17       undef($/);
18       $phrase = <TEXT>;
19       $phrase =~ s/\n/ /g;
20       $phrase = lc($phrase);
21       $phrase =~ s/[^a-z \']/ /g;
22       foreach $stopword (@stop)
23         {
24         $phrase =~ s/ $stopword / \# /g;
25         }
26       $phrase =~ s/[\s]+/ /g;
27       $phrase =~ s/ ?\# ?/\#/g;
28       @phraselist = sort (split("#",$phrase));
29  @phraselist = grep
30       {$i{$_}++;(($i{$_}==2)&&(scalar(split("
    ",$_))>1));}@phraselist;
31  print OUT join("\n",@phraselist); [exit;  [List.
    First 9 lines of output from phrase.pl script
32  abate fortis
33  abbe foucher
34  abdication of diocletian
35  abilities of
36  able leader
37  abolition of
38  absolute power
39  abuse of
40  academy of inscriptions


   List 6.14.1. Algorithm for regular expression searches of text files.

1   1. Asks you for a regular expression to search a file. If
      you're not adept at regular expressions, just enter any
      word. Remember, a word or phrase is always the simplest
      regular expression. In the output example, we'll search for
      the word "adenocarcinoma"
2   2. If you enter the return-key without entering a regular
      expression, it simply exits the script
3   3. Asks Perl to give you the current epoch time (number of
      seconds passed since some point in history)
4   4. Opens an enormous publicly available file (138 Mbytes)
      named MRCON (we'll learn a lot about this file in Biomedical
      Perl)
5   5. Reads every line of MRCON (about 2 million of them),
      testing each line to see if it contains a substring that
      matches the regular expression that you provided (step 1)
6   6. If it finds a match, it adds the line number and the line
      to an external file named regexout.txt
7   7. When it's finished reading the file, it asks Perl again
      for the epoch time, and determines the script execution time
      by subtracting the script's end time from the script's
      beginning time
8   8. It prints to the monitor the time spent executing the
      script, as well as the filename containing the output of all
      the lines from the MRCON file that matched your provided
      regular expression.


   List 6.14.2. Perl script for regular expression searches of text files.

1        #!/usr/bin/perl
2        #perlfind.pl
3        #11/20/01
4        #this will pull out all the matching lines for a
      prompted
5        #regular expression from any text file. This short
      script is incredibly
6        #powerful, but it requires the user to have facility
      creating
7        #regular expressions
8        open (OUT,
      ">regexout.txt")||die"Can't open file
      $value";
9        $filename = "regexout\.txt";
10         print "What's your search regex?\n";
11        $regex = <STDIN>;
12        $regex =~ s/\n//o;
13        if ($regex eq "")
14          {
15          close TEXT;
16          close OUT;
17          print "\nYou didn't give a
    regex...Goodby\n";
18          }
19        #$re = qr/$regex/oi;
20        $start = time();
21        &searchsub;
22        $end = time() - $start;
23        print "Retrieval time is $end seconds.\n";
24        print "Your search results are in file
    $filename.";


   List 6.14.3. Output of regular expression search.

1   C:\ftp>perl perlfind.pl
2   What's your search regex?
3   adenocarcinoma
4   Retrieval time is 5 seconds
5   Your search results are in file regexout.txt.


   List 6.15.1. A short script that performs a binary search on a file.

1        #!/usr/bin/local/perl
2        open (TEXT, "find_bin.txt");
3        seek(TEXT, 0, 2);
4        print "What word would you like to find?\n";
5        $findword = <STDIN>;
6        $findword =~ s/\n$//o;
7        $filesize = tell (TEXT);
8        for($i=1;$i<129;$i++)
9           {
10           $portion = int(($filesize * $i)/128);
11          push(@portionarray,$portion);
12          }
13       seek(TEXT, 0, 0);
14       $arraynumber = 64;
15       foreach $division (4,8,16,32,64,128)
16          {
17          $place = $portionarray[$arraynumber-1];
18          seek(TEXT, $place, 0);
19          $line = <TEXT>;
20          $line = <TEXT>;
21          $line =~ /^([a-z]+) /;
22          $estimate_word = $1;
23          if ($estimate_word gt $findword)
24            {
25            $arraynumber = $arraynumber - (128/$division);
26            }
27     else
28            {
29            $arraynumber = $arraynumber + (128/$division);
30            }
31          }
32       undef ($/);
33       seek(TEXT,($place - 10000), 0);
34       read(TEXT,$holder,20000);
35       if ($holder =~ /\n($findword)[0-9\= ]+\n/)
36          {
37          print $&;
38          }
39  else
40          {
41          print "Sorry. Couldn't find $findword in the
    index.\n";
42          }
43       exit;


   List 6.16.1. Cluster.pl, a Perl script demonstrating clustering algorithm.

1        #!/usr/local/bin/perl
2        use Algorithm::Cluster;


   List 6.16.2. Output of cluster.pl script.

1   c:\ftp>perl cluster.pl
2   Row0 => Cluster 1
3   Row1 => Cluster 2
4   Row2 => Cluster 2
5   Row3 => Cluster 1
6   Row4 => Cluster 0
7   Row5 => Cluster 0


   List 6.17.1. Example of a very simple program using the LWP (Library for WWW in Perl).

1        #!/usr/bin/perl
2        use LWP::Simple;
3        print (get "http://www.nih.gov");
4        exit;


   List 6.18.1. Some Perl books in bioinformatics (a very different field from biomedical informatics)

1   Beginning Perl for Bioinformatics, by James Tisdall
2   Mastering Perl for Bioinformatics, by James Tisdall
3   Genomic Perl: From Bioinformatics Basics to Working Code by
      Rex A. Dwyer
4   Perl Programming for Biologists, by D. Curtis Jamison
5   Developing Bioinformatics Computer Skills by Per Jambeck and
      Cynthia Gibas
6   Bioinformatics Biocomputing and Perl: An Introduction to
      Bioinformatics Computing Skills, by Michael Moorhouse and
      Paul Barry


   List 6.18.2. A simple DNA palindrome, GAATTC.


   List 6.18.3. Perl script for finding palindromes in a gene sequence.

1        #!/usr/bin/perl
2        $filename = "sample";
3        open (TEXT, "sample")||die"Cannot";
4        $line = " ";
5        $count = 0;
6        for $n (5..20)
7           {
8           $re = qr /[CAGT]{$n}/;
9           $regexes[$n-5]= $re;
10           }
11       NEXTLINE: while ($count < 1000)
12          {
13          $line = <TEXT> ;
14          $count++;
15          foreach my $value (@regexes)
16             {
17             $start = 0;
18             while ($line =~ /$value/g)
19                {
20                $endline = $';
21                $match = $&;
22                $revmatch = reverse($match);
23                $revmatch =~ tr/CAGT/GTCA/;
24                if ($endline =~ /^([CAGT]{0,15})($revmatch)/)
25                   {
26                   $start = 1;
27                   $palindrome = $match . "*" . $1 .
    "*" . $2;
28                   $palhash{$palindrome}++;
29                   }
30                }
31             if ($start == 0)
32                {
33                goto NEXTLINE;
34                }
35             }
36          }
37       close TEXT;
38       while(($key, $value) = each (%palhash))
39          {
40          print "$key => $value\n";
41          }
42       exit;


   List 6.18.4. Input of sample.pl (line-breaks omitted from original file).

1   ATGAGCGAAGAAAGCTTATTCGAGTCTTCTCCACAGAAGATGGAGTACGAAATTACAAAC
2   TACTCAGAAAGACATACAGAACTTCCAGGTCATTTCATTGGCCTCAATACAGTAGATAAA
3   
4   
5   
6   
7   AAGATCAGAAGCGACCATGACAATGCTATTGATGGATTATCTGAAGTTATCAAGATGTTA
8   TCTACCGATGATAAAGAAAAATTGTTGAAGACTTTGAAATAA


   List 6.18.5. Output of sample.pl.

1   (* separates the spacer region from the flanking palindromic
      regions)
2   C:\FTP>perl sample.pl
3   CTTTG*TCAGGATGGGC*CAAAG => 1
4   AGTAT*T*ATACT => 1
5   GAAATC**GATTTC => 1
6   AGTTT*GGCATCC*AAACT => 1
7   CCTTA*CCCTGT*TAAGG => 1
8   CTTCT*GGAGATTGAGA*AGAAG => 1
9   
10   
11  
12  GATGG*ATTCAAG*CCATC => 1
13  GTTTGG*CAT*CCAAAC => 1
14  CTTCT*CCAC*AGAAG => 1


   List 6.21.1. Examples of software utility functions.

1   Archiving utilities
2   Calculator utility
3   Compression/decompression utilities
4   Conversion utilities - Converts files (text, images, sound,
      video) to and from different formats
5   Database utilities
6   Directory searching
7   Email service
8   Encryption/decryption utilities
9   File copying utilities
10   File reading and parsing utilities
11  FTP file retrieval
12  Indexing utilities
13  Sorting utilities
14  Searching utilities
15  Telnet remote computer access
16  Text editing
17  Web retrieval utilities


   List 6.22.1. Types of software of possible interest to the FDA.

1   Software used as a component, part, or accessory of a
      medical device
2   Software that is itself a medical device (e.g., blood
      establishment software)
3   Software used in the production of a device (e.g.,
      programmable logic controllers in manufacturing equipment)
4   Software used in implementation of the device manufacturer's
      quality system (e.g., software that records and maintains
      the device history record) .


   List 6.22.2. Features of software that buyers want.

1   Easy installation
2   Simple instructions and documentation
3   Friendly graphic user interface
4   Functionality that supports the user's goals
5   Transparency (no need for user to understand the underlying
      assumptions, algorithms and data structures upon which the
      functionality of the software is based)
6   Compatibility with operating system and other software
      residing on the user's computer
7   Good user support services


   List 6.22.3. Features of good software (that serious biomedical informaticians need).

1   Extensibility. The functionality of the software and the
      data can be modified and expanded
2   Scalability. Should work with any size of inputs
3   Standardization of all data (input and output)
4   Open source code
5   Open access data
6   Self-describing software
7   Cross-platform functionality. Software should operate in
      multiple operating systems
8   Interoperability
9   Availability of updates
10   Full documentation of methods and algorithms


   List 6.22.4. Some properties of valid software, modeled on FDA Principles of