List 0.0.0. What every determined reader will learn from this book.
1 How to acquire and organize biomedical data even when the
data is received in the form of unstructured text
2 How to merge and share biomedical data even when the data
is confidential or comes from seemingly incompatible
sources
3 How to write your own programs in Perl that will allow you
to perform common informatics tasks with just a few lines
of code
4 How to automatically index biomedical text and code text
using freely available biological and medical nomenclatures
5 How to use metadata to provide structure and meaning to
biomedical datasets
6 How to use confidential medical data while obeying current
law and protecting patients
7 How to reduce the complexity of biomedical data and
biomedical software
8 How to evaluate ethical problems related to intellectual
property, privacy, human subjects research (see Glossary),
data sharing (see Glossary), and software development.
|
List 0.0.1. People who will benefit from reading this book.
1 Bioinformaticians
2 Biomedical scientists
3 Clinical trialists
4 Computer scientists who need cross-over skills in the
biomedical sciences
5 Government officials at any of the health-related federal
agencies
6 Healthcare graduate students and professionals who use large
biomedical datasets, who need to have data/software
interoperability, or who need to comply with federal, state,
or institutional data requirements
7 Hospital staff, including medical students, physicians,
nurses, technicians, hospital administrators, information
officers
8 Lawyers who handle intellectual property (see Glossary)
cases related to biomedicine
9 Library scientists
10 Medical ethicists
11 Medical software developers and vendors
12 Medical transcriptionists
13 Members of IRBs (Institutional Review Boards, see Glossary)
and Privacy Boards (see Glossary)
14 Privacy experts who work with medical scientists
|
List 1.1.1. Roles of the biomedical informatician. 1 Biologist 2 Healthcare professional 3 Lawyer 4 Software programmer 5 Computer scientist 6 Cryptographer 7 Metadata expert 8 Linguist 9 Statistician 10 Diplomat |
List 1.2.1. Pre-1955 biomedical advances resulting in increased longevity in many developed countries.
1 Antisepsis
2 Refrigeration of food
3 Standards for the hygienic preparation of food
4 Eradication of insect vectors for yellow fever and malaria
5 Potable public drinking water
6 Antibiotics effective against many bacterial infections
including syphilis, gonorrhea and tuberculosis
7 Vaccines against smallpox and polio
8 The virtual elimination of iodine-deficiency associated
goiter
9 The near elimination of vitamin deficiency diseases
10 The marked reduction of cervical cancer in women thanks to
cytologic screening of cervical smears
11 The prevailing blood tests and quantitative blood cell
analyses used to monitor deviations from normal function
12 The correction of diabetic hyperglycemia with insulin
13 The introduction of radiologic imaging
14 The treatment of hypertension with large variety of
effective drugs
15 The recognition of the association between cigarette use and
cancer
16 The role of diet and of cigarettes in the progression of
vascular diseases.
|
List 1.2.2. Medical setbacks since 1955.
1 The global spread of AIDS
2 Diminished access to potable water in much of the world
population
3 The emergence of multiple antibiotic resistant strains of S.
aureus and other previously treatable organisms
4 Increased number of cancer patients due primarily to an
absolute increase in the number of senior citizens at
highest risk for cancer
5 The re-emergence of tuberculosis
6 The re-emergence of insect and other vectors carrying viral
and parasitic diseases
7 The astronomical costs of new, effective medications for
chronic diseases, including cancer
8 High quality, long-term health care attainable for only a
small fraction of the earth's population
9 The rising incidence of obesity and sequelae disorders,
worldwide
10 The rapid geographic spread of outbreaks of new strains of
influenza and other evolving viruses, including HIV and
hemorrhagic fever viruses
11 The threat of destructive and pathogenic species of plants,
insects and animals that have been introduced to new
habitats through acts of human negligence or error
12 Weakening of the earth's ozone layer, increasing human
exposure to ultraviolet radiation
13 The political uses of toxic agents, endemic diseases, and
public health infrastructures.
|
List 1.3.1. Immediate consequences of Semelweis' prevention of puerperal fever deaths.
1 The medical students were opposed to being forced to wash
their hands
2 Semmelweis' superior, Johann Klein, was likewise opposed,
considering the clinical trial a criticism of his
performance
3 Other obstetricians agreed that Semmelweis' measures were an
attack on their professional conduct
4 The maternity patients were opposed as well, interpreting
sanitary measures as a criticism of their personal
hygiene
|
List 1.3.2. Beliefs held by biomedical informaticians.
1 Medical progress requires the integration of biological data
and clinical data
2 Aggregate clinical data has value beyond its use in guiding
the treatment of individual patients
3 Researchers need methods to acquire clinical data without
harming patients
4 To be useful, biological and clinical data need to be
organized in a standard manner that permits seamless data
integration (see Glossary)
5 Classifications (see Glossary) drive down the complexity of
clinical and biological data)
6 Important new testable hypotheses may derive from
pre-existing biological and clinical datasets, but only if
the datasets are made available to scientists
7 The primary data that supports scientific assertions should
be be made publicly available, whenever feasible
8 Data analysis is an inexpensive science, particularly if you
know how to program.
|
List 1.4.1. Some important bottlenecks in translational research.
1 Access to clinically annotated tissues collected from human
subjects
2 Access to electronic medical records and other electronic
archives of human clinical data
3 Methods to organize data in a manner that permits the data
to be meaningful and comparable from laboratory to
laboratory and institution to institution
4 Methods to draw clinically valid conclusions from large
datasets containing heterogeneous types of data (e.g.
molecular data and clinical test data).
|
List 1.5.1. Basic skills and activities in biomedical informatics.
1 An understanding of a computer's file and subdirectory
system
2 The ability to download, install and use popular software
applications and utilities
3 An awareness of the differences between structured and
unstructured data
4 Basic understanding of XML (see Glossary) and metadata
annotation
5 Basic appreciation of computer algorithms
6 Some familiarity with data privacy rules and how these rules
relate to the research uses of medical data. Most countries
have such privacy regulations for biomedical data. In the
U.S., this would be the HIPAA privacy rules (), and in the
United Kingdom, it would be the Data Protection act ()
7 A general understanding of concepts of medical record
de-identification
8 Familiarity with the publicly available biological search
engines, databases and tools, including PubMed and
GenBank.
|
List 1.5.2. Advanced skills and activities in biomedical informatics.
1 Programming at a moderate level in at least one programming
language
2 Experience choosing and implementing a laboratory or
hospital information system
3 Knowledge of regulations pertaining to the use of identified
medical data in research
4 Participation in an effort seeking FDA approval for a device
or technology developed from a biomedical informatics effort
5 Participation on a standards committee
6 Intermediate level understanding of XML
7 Basic understanding of RDF (see Glossary)
8 Experience as a member of an IRB or Privacy Board
9 Competing for funding for a biomedical informatics grant
(see Glossary) or contract.
|
List 1.7.1. Steps in gold mining (or data mining).
1 Physical access to mine
2 Legal rights of access to mine
3 Acquire tools to find desired items in mine
4 Acquire tools to extract desired items in mine
5 Acquire tools to refine desired items
6 Acquire tools to certify the purity and quantity of desired
items
7 Transform the desired items into a standard format
8 Transport the desired items to an intended recipient
9 Arrang payment for the desired items
10 Store the desired items
11 Protect the stored items.
|
List 1.8.1. Realistic uses of Biomedical Informatics.
1 Store, share, search, retrieve and analyze heterogeneous
data sources. This entire process is vastly enhanced by our
current ability to send any type of data anywhere at
anytime, cheaply
2 Create large comprehensive databases (millions of cases)
that allow you to ask questions that could not be asked of
small or non-comprehensive databases
3 Drive down the complexity of biomedical data by using data
specifications (see Glossary) and classifications
4 Track data collected in hospital information systems and
dispatch automatic clinical alerts when data values fall
outside an expected range of behavior or when values violate
the expected properties of data classes
5 Develop new hypotheses by examining and correlating
biological and clinical observations
6 Validate new clinical tests and treatments by examining the
correlations between test values, treatment choices and
clinical outcomes.
|
List 1.8.2. Unrealistic uses of Biomedical informatics.
1 Replace physicians with computers. Doctors are trained to
make diagnoses, and they don't desire or use software that
purports to does this for them
2 Create superdoctors through the use of computer tools. The
practice of medicine is learned through personal
experiences. Doctors do not need simulations of reality
3 Vastly improve upon books and traditional teaching
strategies. Books are an adequate method of conveying
knowledge. Computers can certainly provide some improvements
to book learning, but there is no reason to think that a
system of learning based on printed literature, that works
perfectly fine, can be vastly improved
4 Solve subtle or complex problems via the use of medical
ontologies (see Glossary). Complex systems are inherently
chaotic, and inferences reached through a logical ontology
modeling a complex system are likely to be misleading
5 Create, within the next decade, comprehensive medical
records for all U.S. citizens that can be accessed and
annotated by all authorized care-givers. This holy grail of
U.S. medical informatics is a worthy long-term pursuit, but
there is no reason to expect that it can be achieved within
a decade or even two decades.
|
List 2.2.1. When are databases particularly useful?
1 When the stored data is complex (e.g., hospitals and
academic centers)
2 When the basic data structure is constant (i.e., when the
model of the data records does not change)
3 When there are continuous real-time additions, deletions and
modifications of records by multiple users
4 When the computer staff (responsible for the data) prefers
to work with databases.
|
List 2.2.2. When are data files particularly useful?
1 When the dataset is relatively stable
2 When the data structure is relatively instable (i.e., when
the fundamental model of the data records changes)
3 When XML is the native method of data representation
4 When the computer staff (responsible for the data) prefers
to work with data files.
|
List 2.3.1. Three properties of reality relevant to hospital databases.
1 Database records can be designed in such as manner as to
corrupt the integrity of the database
2 Databases do not care if their integrity is corrupted
3 Modifications to the basic structures of database records
almost always have negative (sometimes catastrophic)
consequences.
|
List 2.3.2. Common weaknesses of some hospital databases.
1 Inability to guarantee that every patient is uniquely
identified within the database
2 Inability to classify types of data into groups with shared
properties
3 Inability to extend data records to include data elements
linked to other databases
4 Inability to organize data as simple collections of
meaningful statements
5 Inability to produce self-describing data records (i.e.,
including in data records all the data necessary to fully
describe the meaning of the data record)
|
List 2.3.3. Desiderata for hospital information systems.
1 Every patient must be uniquely identified within the system
2 Every report must be uniquely identified and associated with
one patient
3 Data items contained in reports must be entered into reports
once only
4 Data items must be well-defined and used in a consistent
manner throughout the system
5 Data values must be bound to a unique identifier (see
Glossary) and associated with a unique report
6 All data entered should be technically retrievable
7 Someone must have the authority to retrieve any and all data
in the hospital information system
8 Data, once entered, should not be corrected or modified in
any way without creating a visible transaction record of the
modification
9 All electronic data related to an electronic record in the
hospital information system should be included in the
hospital information system.
|
List 2.11.1. General classes of patents.
1 Utilities - new and useful methods, machines, items, or
chemical compounds
2 Designs - a new appearance for a manufactured article
3 Plants - the invention or discovery of a plant variety that
can be asexually reproduced
|
List 2.12.1. Copyright Act of 1976, Title 17, U.S. Code, section 107. Limitations on exclusive rights: Fair use.
1 Notwithstanding the provisions of sections 106 and 106A, the
fair use of a copyrighted work, including such use by
reproduction in copies or phonorecords or by any other means
specified by that section, for purposes such as criticism,
comment, news reporting, teaching (including multiple copies
for classroom use), scholarship, or research, is not an
infringement of copyright. In determining whether the use
made of a work in any particular case is a fair use the
factors to be considered shall include -
2 (1) the purpose and character of the use, including whether
such use is of a commercial nature or is for nonprofit
educational purposes;
3 (2) the nature of the copyrighted work;
4 (3) the amount and substantiality of the portion used in
relation to the copyrighted work as a whole; and
5 (4) the effect of the use upon the potential market for or
value of the copyrighted work
6 The fact that a work is unpublished shall not itself bar a
finding of fair use if such finding is made upon
consideration of all the above factors. ()
|
List 2.15.1. Tissues that are routinely destroyed by pathology departments.
1 Institutions regularly dispose of tissues removed during
surgical procedures. When a large specimen, such as a colon,
is received in a pathology department, samples are routinely
embedded in paraffin and saved for at least 5 years. The
unsampled colon (the bulk of the specimen) is saved for
several weeks, sufficient time to ensure that the
pathologist has rendered a final diagnosis on the specimen,
and then the specimen is discarded
2 Institutions regularly dispose of archived paraffin-embedded
tissues. Most institutions archive paraffin-embedded tissues
for at least 5 years. At that time, some medical centers
conclude that the tissues are no longer of any importance to
the patient. To avoid the expense of continued storage, some
institutions simply dispose of archived material after 5
years.
|
List 2.15.2. Questions that institutions should ask before transferring tissues and medical records to an external tissue repository.
1 Would the transfer to a third party constitute a sale of
human tissue?
2 Would the transfer to a third party harm any of the patients
from whom the tissue was excised?
3 Would the transfer to a third party benefit society?
4 Do any of the institutional staff encouraging the transfer
of tissues and data have relevant conflicts of interest?
|
List 2.16.1. Recent developments that have enhanced access to experimental datasets.
1 Online journals that invite authors to submit data files
2 Editor policies that require the submission of data files
supporting assertions made in manuscripts
3 Technical ease of storing large datasets on publicly
available servers
4 Technical ease of downloading large datasets from servers
via the internet
5 Data sharing requirements issued by biomedical funding
agencies
6 Expansion of Freedom of Information Act
7 Greater involvement of informaticians in biomedical research
8 Scientific advancements using publicly available datasets
9 Stunning power and scope of publicly available search
engines, including Google (internet documents) and PubMed
(medical abstracts)
|
List 2.18.1. Some definitions of terms related to the open source movement.
1 Free software: The concept of free software, as popularized
by the Free Software Foundation, refers to software that can
be used freely, without restriction, and does not
necessarily relate to the actual cost of the software. The
generally acknowledged father of the free software movement
is Richard Stallman, an MIT visionary who has led an
energetic and unwavering campaign to create and freely
distribute some of the most valued software applications in
use today. The free software movement is similar to the open
source software movement, but some of the features of free
software (ability to modify and re-distribute software in a
prescribed manner as discussed in the software license) are
not always guaranteed in open source software (see List)
2 Open source - The Open Source Software movement is an
offspring of the Free Software movement. The reason that the
open source movement was created was, in part, to placate
developers who wanted to sell software and felt the the term
"free" as in "free software movement",
would be misconstrued by prospective customers to mean that
the developer requires no remuneration. Although a good deal
of free software is no-cost software, the intended meaning
of the term "free" is that the software can be
used without restrictions. The term "open source"
obviates the need to draw this distinction. The Open Source
Initiative posts an open source definition () and a list of
approved open source licenses ()
3 Open Access - In general, open access applies to text and
data the same way that open source applies to software. In
general, open access biomedical data is retrievable (i.e.,
you can find it by using a PubMed search or through a search
engine), and once you've found it, you can download it and
read it. There are several closely-related consensus
statements on the meaning of open access (), ()
4 Open source software license - The Open Source Initiative
has an approval process for open source licenses. Software
distributed under an approved license can include a
declaration that the software is "OSI Certified Open
Source Software." The GNU copyleft licenses have been
certified as open source software licenses.
|
List 2.19.1. Examples of undifferentiated software.
1 Basic algorithms
2 Fundamental laws of physics, chemistry, mathematics and
biology
3 Free, cross-platform programming languages
4 TCP/IP internet protocol
5 HTML and XML.
|
List 2.19.2. Examples of undifferentiated data. 1 Human genome 2 Standards documents 3 Nomenclatures 4 Biological classification systems. |
List 2.19.3. Examples of differentiated software.
1 Programming languages with special features such as a
easy-to-use interfaces or integrated environment, or a
specialized purpose
2 Neural network programs designed for specific types of data
input
3 Complex software designed to support commercial devices,
such as CT-scanners
4 Most hospital information systems and laboratory information
systems.
|
List 2.19.4. Examples of differentiated data. 1 Lexis/Nexis and other legal databases 2 Subscription journals 3 Codes for billable procedures 4 Science Citation Index 5 Chemical Abstracts (R) database. |
List 2.20.1. A few of the human databases that have been described in the Nucleic Acids Research Database Issue.
1 Androgen Receptor Gene Mutations Database
2 Atlas of Genetics and Cytogenetics in Oncology and
Haematology
3 Atlas of Genetics and Cytogenetics in Oncology and
Haematology
4 BGED - Brain Gene Expression Database
5 Cancer Chromosomes
6 Cancer gene databases
7 CGED - Cancer Gene Expression Database
8 Collagen Mutation Database
9 COSMIC - Catalogue Of Somatic Mutations In Cancer
10 Cypriot national mutation database
11 Cytokine Gene Polymorphism Database
12 Cytokine Gene Polymorphism Database
13 Cytokine Gene Polymorphism in Human Disease
14 Database of Genomic Variants
15 Database of Germline p53 Mutations
16 EICO DB - Expression-based Imprint Candidate Organiser
17 EpoDB - Erythropoiesis Database
18 ERGDB - Estrogen Responsive Genes Database
19 Gene-, system- or disease-specific databases
20 General polymorphism databases
21 GOLD.db - Genomics Of Lipid-associated Disorders
22 GRAP Mutant Databases
23 HAGR - Human Ageing Genomic Resources
24 HCAD - Human Chromosome Aberration Database
25 HemoPDB - Hematopoietic Promoter Database
26 HERVd - Human Endogenous Retrovirus database
27 HGMDr - Human Gene Mutation Database
28 HGMDr - Human Gene Mutation Database
29 HORDE - Human Olfactory Receptor Data Exploratorium
30 HPMR - Human Plasma Membrane Receptome
31 Human p53, human hprt, rodent lacI and rodent lacZ databases
32 Human PAX2 Allelic Variant Database
33 Human PAX6 Allelic Variant Database
34 IARC TP53 Database
35 Imprinted Gene Catalogue
36 IPD - Immuno Polymorphism Database
37 Lowe Syndrome Mutation Database
38 MTB - Mouse Tumor Biology Database
39 NCL Mutation Database
40 OMIM - Online Mendelian Inheritance in Man
41 Oral Cancer Gene Database
42 PTCH1 Mutation Database
43 RB1 Gene Mutation Database
44 RTCGD - Retroviral Tagged Cancer Gene Database
45 SNP500Cancer
46 SV40 Large T-Antigen Mutant Database
47 T1DBase - Type 1 Diabetes Database
48 The Autism Chromosome Rearrangement Database
49 The Lafora Database
50 The SNP Consortium database
51 TPMD - Taiwan polymorphic microsatellite marker database
52 Tumor Gene Family Databases (TGDBs)
|
List 2.23.1. A record in Taxonomy.
1 ID : 50
2 PARENT ID : 49
3 RANK: genus
4 GC ID : 11
5 SCIENTIFIC NAME : Chondromyces
6 SYNONYM : Polycephalum
7 SYNONYM : Myxobotrys
8 SYNONYM : Chondromyces Berkeley and Curtis 1874
9 SYNONYM : "Polycephalum" Kalchbrenner and Cooke
1880
10 SYNONYM : "Myxobotrys" Zukal 1896
11 MISSPELLING : Chrondromyces
|
List 3.2.1. The types of human subject research risks.
1 The risk to life and health as a direct result of a medical
intervention
2 The risk of loss of database functionality
3 The risk of loss of confidentiality resulting from
participation in a medical study
4 The risk of loss of privacy resulting from participation in
a medical study.
|
List 3.8.1. Confidentiality issues for biomedical informaticians.
1 Demonstrating to the hospital's IRB (see Glossary) that the
chosen methodology for anonymizing or de-identifying records
is safe and reliable
2 Demonstrating to the hospital's IRB and to the hospital's
information officers that the anonymization and
de-identification processes can be performed automatically,
without giving the informatician any access to the primary
patient record and without opening any HIS vulnerabilities
when data is transferred out of the system.
|
List 3.9.1. Exemption 4 (E4) permitting unconsented research on de-identified medical records. |
List 3.9.2. Section 164.502(f) of the HIPAA Privacy Rule -- Deceased Individuals.
1 We proposed to extend privacy protections to the protected
health information of a deceased individual for two years
following the date of death. During the two-year time frame,
we proposed in the definition of ``individual'' that the
right to control the deceased individual's protected health
information would be held by an executor or administrator,
or other person (e.g., next of kin) authorized under
applicable law to act on behalf of the decedent's estate.
The only proposed exception to this standard allowed for
uses and disclosures of a decedent's protected health
information for research purposes without the authorization
of a legal representative and without the Institutional
Review Board (IRB) or privacy board approval required (in
proposed Sec. 164.510(j)) for most other uses and
disclosures for research
2 In the final rule (Sec. 164.502(f)), we modify the standard
to extend protection of protected health information about
deceased individuals for as long as the covered entity
maintains the information. We retain the exception for uses
and disclosures for research purposes, now part of Sec.
164.512(i), but also require that the covered entity take
certain verification measures prior to release of the
decedent's protected health information for such purposes
(see Secs. 164.514(h) and 164.512(i)(1)(iii))
3 We remove from the definition of ``individual'' the
provision related to deceased persons...
|
List 3.10.1. Five requirements for de-identifying medical records.
1 De-identification of data fields that specifically
characterize the patient (name, social security number,
hospital number, address, age, etc.)
2 Free-text data scrubbing, removing identifiers from the
textual portion of medical reports
3 Free-text data privatizing, removing any information of a
private nature that may be contained within the report
4 Rendering the dataset ambiguous, ensuring that patients
cannot be identified by data records containing a unique set
of characterizing information
5 Rendering the data non-complementary, ensuring that the data
cannot be combined with data from other other databases or
from multiple searches of the same database that can lead to
the identification of records.
|
List 3.12.1. Some possible consequences of Common Rule violations.
1 The loss to the institution of its funding for the grant in
question
2 The loss to the institution of its Federal Assurance. The
Office of Human Research Protections issues Assurances
(currently called Worldwide Federal Assurances or WFAs) to
institutions that have in-place processes for IRB reviews of
research and for maintaining research standards. An
institution must have an assurance registered with OHRP in
order to receive federal funding for human subjects research
3 An institution-wide suspension of human subject research
efforts
4 The imposition of grant-related restrictions imposed on the
investigators (e.g. a prohibition from applying for federal
grant funding).
|
List 3.13.1. Section 1177 of the Act established civil and criminal penalties.
1 Civil Money Penalties. HHS may impose civil money penalties
on a covered entity of $100 per failure to comply with a
Privacy Rule requirement. Pub. L. 104-191; 42 U.S.C.
1320d-5. That penalty may not exceed $25,000 per year for
multiple violations of the identical Privacy Rule
requirement in a calendar year. HHS may not impose a civil
money penalty under specific circumstances, such as when a
violation is due to reasonable cause and did not involve
willful neglect and the covered entity corrected the
violation within 30 days of when it knew or should have
known of the violation
2 Criminal Penalties. A person who knowingly obtains or
discloses individually identifiable health information in
violation of HIPAA faces a fine of $50,000 and up to
one-year imprisonment. Pub. L. 104-191; 42 U.S.C. 1320d-6.
The criminal penalties increase to $100,000 and up to five
years imprisonment if the wrongful conduct involves false
pretenses, and to $250,000 and up to ten years imprisonment
if the wrongful conduct involves the intent to sell,
transfer, or use individually identifiable health
information for commercial advantage, personal gain, or
malicious harm. Criminal sanctions will be enforced by the
Department of Justice.
|
List 3.15.1. Questions related to consent tracking that institutions must be able to answer.
1 Does each consent form have an identifier and a locator, a
study number, and a data element indicating that the consent
form itself was approved by an IRB?
2 If needed, could you put your hands on the physical consent
document?
3 Does your database indicate the specific study for which
consent was approved?
4 Was the consent form sufficiently detailed, allowing the
patient to approve certain uses of specimens/data and
decline other uses?
5 Is each consent tagged with tracking data?
6 Was the consent approved or declined?
7 What day was the consent signed?
8 Does the institution have a policy that applies to
situations wherein a subject cannot provide an informed
consent (e.g., infants, patients with dementia)?
9 If the institution has a policy of excluding certain classes
of patient from providing informed consent, has the
institution received approval for the policy from its IRB?
10 For children and challenged subjects, was the informed
consent document signed by a surrogate?
11 For children and challenged subjects, how is it determined
who may act as a surrogate, and how is the identity of the
surrogate recorded and tracked?
12 Did the consenting subject change her mind and withdraw
consent after consent had been approved?
13 If consent was withdrawn, what date did this occur?
14 If consent was withdrawn, was consent withdrawn for a
particular use of a specimen/data, or for all purposes
described by the consent document?
15 If consent was withdrawn, does the withdrawal of consent
apply to more than one consent form?
|
List 3.16.1. Advantages of unconsented medical record research.
1 Saves money and time by eliminating the tedious and
expensive process of obtaining individual consents
2 Sometimes favored by patient advocacy organizations who see
unconsented research as a way of expediting medical progress
and improving the chances of survival of the patients in
their disease constituencies
3 De-identification requirements for most unconsented patient
record research essentially guarantees that no harm will
come to the patient
4 De-identified unconsented databases can be shared and used
for multiple scientific efforts. Consented databases, in
most cases, can be used only for the purposes specified in
the consent form
5 De-identified unconsented databases pose no particular
threat over time to patients. Consented databases often
contain patient identifiers and may pose a confidentiality
and privacy threat long after the consented research is
concluded.
|
List 4.1.1. Examples of dealt standards
1 The permitted levels of toxic substances in foods
2 TCP/IP (Transmission Control Protocol/Internet Protocol),
the internet specification
3 IEEE 802.11, the wireless data transfer standard
4 Longitude and latitude assignments
5 Divisions of time (days, hours, minutes and seconds)
6 Statutes governing medical privacy
|
List 4.2.1. Some causes of medical errors in the field of biomedical informatics.
1 Absence of standards (for describing clinical data)
2 Inadequate terminologies
3 Poorly written text
4 Inadequate object identifiers (e.g., identifiers for names,
tests, reports)
5 Poor interoperability of software tools
6 Poor integration of biomedical databases
7 Poor documentation (of software, of medical devices, of
protocols)
8 Poor annotation (of medical encounters and transactions)
9 Inadequate data structuring (of reports)
10 Sloppy data representation.
|
List 4.2.2. Purposes of data standards. 1 Enhance interoperability of software 2 Enable data integration 3 Increase the efficiency of medical services 4 Increase the speed of medical research 5 Reduce medical errors. |
List 4.3.1. Why governments may choose to avoid creating biomedical standards.
1 Private entities that use a standard may be in the best
position to create the best possible standard
2 Private entities that use a standard may be willing to pay
for the standards development process
3 Private entities are more likely to adopt a new standard if
they had a part in developing the standard
4 Governments may be unwilling to accept the responsibility of
promoting a new standard
5 Governments know that many standards are never adopted by
the public and do not want to waste their resources on a
standard that will be ignored
6 Governments may be reluctant to face criticism for standards
that may adversely effect certain segments of its
population.
|
List 4.4.1. Excerpt from RICO that may be applicable to standards developers.
1 "1951. Interference with commerce by threats or
violence
2 (a) Whoever in any way or degree obstructs, delays, or
affects commerce or the movement of any article or commodity
in commerce, by robbery or extortion or attempts or
conspires so to do, or commits or threatens physical
violence to any person or property in furtherance of a plan
or purpose to do anything in violation of this section shall
be fined under this title or imprisoned not more than twenty
years, or both
3 (b) As used in this section-
4 (1) The term "robbery" means the unlawful taking
or obtaining of personal property from the person or in the
presence of another, against his will, by means of actual or
threatened force, or violence, or fear of injury, immediate
or future, to his person or property, or property in his
custody or possession, or the person or property of a
relative or member of his family or of anyone in his company
at the time of the taking or obtaining
5 (2) The term "extortion" means the obtaining of
property from another, with his consent, induced by wrongful
use of actual or threatened force, violence, or fear, or
under color of official right."
|
List 4.4.2. Disclaimer against hidden patents within standards
1 "The attention of adopters is directed to the
possibility that compliance with or adoption of OMG
specifications may require use of an invention covered by
patent rights. OMG shall not be responsible for identifying
patents for which a license may be required by any OMG
specification, or for conducting legal inquiries into the
legal validity or scope of those patents that are brought to
its attention. OMG specifications are prospective and
advisory only. Prospective users are responsible for
protecting themselves against liability for infringement of
patents. ()"
|
List 4.4.3. Perceived risks of developing a new standard.
1 The standard may inadvertently contain intellectual property
(particularly patented methods) resulting in a legal
complaint against the creators of the standard
2 The standard may create loss of revenue or property to
certain entities, resulting in legal actions taken against
the creators of the standard
3 The standard may result in medical errors, resulting in
injury to patients and subsequent legal actions taken
against the creators of the standard
4 The standard may have been developed in a manner that
excluded participation by an entity, resulting in a legal
action
|
List 4.5.1. Questions that should be asked prior to developing a new standard.
1 Is there a pre-existing standard that covers the same
technology?
2 If there is a pre-existing standard, can it be enhanced or
modified to provide a desired functionality?
3 How much will it cost to develop the standard?
4 How long will the standards development process take?
5 Will the intended beneficiaries of the standard pay for the
standards development process?
6 Who will develop the standard? Are the selected developers
competent to produce an adequate standard?
7 Are any of the developers conflicted? Do they stand to
profit if the standard is developed in a specific way?
8 Do any of the developers have proprietary software or data
that they may wish to include in the standard?
9 Are the expected developers committed to work through the
duration of the standards development process, and are they
committed to providing all of the time and energy needed to
develop the standard?
10 Will there be a mechanism whereby drafts of the standard are
reviewed openly by the public? Will the minutes of the
working committee be made public? Will public comments be
used to modify successive drafts of the standard?
11 Will the standard have dependencies on other standards? If
so, are there intellectual property issues that must be
resolved before development begins? Will these issues
require licenses or royalty agreements from the standards
developers or the standards users?
12 Once created, is the standard likely to be adopted? Is the
anticipated standard easily implemented?
13 Who will be the adopters of the standard? Are the expected
standard adopters included in the development process for
the standard?
14 Will the standard benefit a range of users beyond the
standards developers?
15 What are the hazards that the standard may produce, and who
might be hurt by the standard? In particular, will any
entities be disadvantaged if they cannot readily adopt the
standard?
16 Is it necessary to have the standard approved by an external
organization?
17 If so, who will pay for the extra costs of obtaining
approval from an external standards organization?
18 Will the standard need to be continuously updated and
modified? Is there a planned process for producing multiple
versions of the standard?
19 Is it really important to have the standard? Is it worth
the effort?
|
List 4.6.1. Organizations active in the field of biomedical standards.
1 ASTM, American Society of Testing and Materials
2 ANSI, American National Standards Institute (see Glossary)
3 HISB, Health Information Standards Board
4 IEEE, Institute of Electrical and Electronics Engineers,
Inc
5 ACR/NEMA, American College of Radiology (ACR) and National
Electrical Manufacturers Association (NEMA), which oversees
the DICOM (Digital Imaging and Communications in Medicine)
image standard
6 NCPDP, National Council for Prescription Drug Programs, Inc
7 NIST, National Institute of Standards and Technology
8 ISO, International Organization for Standardization
9 IEC, International Electrotechnical Commission.
|
List 4.6.2. Some American National Standards programming languages. 1 Mumps (ANSI approval 1977) 2 Basic (ANSI approval 1978) 3 ADA (ANSI approval 1983) 4 C (ANSI approval 1989) 5 Common Lisp (ANSI approval 1994) 6 ADA 95 (ANSI approval 1995) 7 Smalltalk (ANSI approval 1998) 8 C++ (ANSI approval 1999). |
List 4.7.1. New and future technologies that create biomedical data. 1 Gene Expression arrays (see Glossary) 2 Proteomic arrays 3 Tissue Microarrays 4 Metabolomic arrays 5 Image morphometric arrays. |
List 4.8.1. Problems created by the introduction of new standards.
1 New classes of data object requires a new standard for the
new object class. (Examples Tissue Microarray Data, Gene
Expression Array Data)
2 New standards require new implementations
3 Existing data standard require revision
4 Revisions of existing standards require retro-active
implementation in data records conforming to the prior
version of the standard
5 New data standards require harmonization with other existing
standards. Otherwise multiple standards may compete for the
standards-based data structures and data descriptors
applicable to data elements common to multiple standards
6 Because standards often become the intellectual property of
the standards development organization, new standards
cannot include parts of standards developed by other
organizations. This means that redundant standards may
describe the same objects.
|
List 4.9.1. Fundamental properties of a specification.
1 The object specified must be defined and distinguished from
all other objects. (i.e., one object cannot have two
different specifications and one specification cannot apply
equally to two non-equivalent objects)
2 The description must be organized in a way that is
understandable and unambiguous. (i.e., a standard method of
describing things, in the general sense, can be used.
Languages are standard methods of describing things, but a
better method might employ a formal semantic logic)
3 The descriptors must be well-defined in the context of the
specification and not confused with descriptors of the same
name but different meaning that may appear in other
specifications (e.g., a "date" may be a calendar
notation in one standard and a type of dried fruit in
another specification)
4 The measurements and descriptor values must be well-defined
and not confused with measurements and values of the same
alphanumeric value but different meaning that may appear in
other specifications. (e.g., 10 pounds is not the same as 10
Kg)
5 The specification must describe itself, include information
pertaining to its purpose, its creator, its ownership, any
restrictions on its uses, and any instructions necessary to
interpret the specification.
|
List 4.9.2. Logistical advantages of specifications over standards.
1 A specification need not be developed through a standards
development process. A specification is basically a
descriptive document and only requires fully unambiguous
language. An individual can create a specification that
everyone in the world can understand and use
2 Specifications do not require approval by any federal agency
or organization. Standards have almost no meaning unless
they are approved. In some cases, standards are enforced by
authority of law
3 There are usually many different ways of specifying things.
The same object can be described by different
specifications. Standards tend to impose a monolithic
implementations
4 A specification is a general way of describing things and
can be used for many different and new types of things.
Standards are typically developed for specific items and
cannot accommodate new items without pursuing a development
and approval process through a standards development
organization. Biomedical informaticians who use research
data will almost certainly find that existing standards will
not keep pace with the arrival of new techniques and data
objects. The chair shown (see Figure) is a fully specified
image created with Pov-Ray, a free, open source rendering
program (see Appendix). It was created using a .pov file,
which is a plain-text set of instructions written for the
rendering application.
|
List 4.9.3. Snippet from chair.pov rendering specification, modified from Matthias Opitz's public domain scene file. |
List 4.11.1. Parts of an LSID, from The LSID Resolution Protocol Project.
1 Network Identifier (NID)
2 root DNS name of the issuing authority
3 namespace chosen by the issuing authority
4 object id unique in that namespace and assigned locally
5 revision id for storing versioning information
(optional)
|
List 4.11.2. Examples of LSIDs, from The LSID Resolution Protocol Project.
1 urn:lsid:pdb.org:1AFT:1 This is the first version of the
1AFT protein in the Protein Data Bank
2 urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434 References a
PubMed article
3 urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2 Refers to the
second version of an entry in GenBank
|
List 4.13.1. Principles of unique object identification.
1 A unique object can be distinguished from all other unique
objects
2 A unique object cannot be distinguished from itself
3 A class (or collection) of instances can be unique.
|
List 4.13.2. Some registries that continually assign unique identifiers to requesting entities.
1 DOI, Digitial object identifier
2 PMID, PubMed identification number
3 LSID (Life Science Identifier)
4 HL7 OID (Health Level 7 Object Identifier)
5 DICOM (Digital Imaging and Communications in Medicine)
identifiers
6 ISSN (International Standard Serial Numbers)
7 Social Security Numbers (for U.S. population)
8 NPI, National Provider Identifier, for physicians
9 Clinical Trials Protocol Registration System
10 Office of Human Research Protections FederalWide Assurance
number
11 Data Universal Numbering System (DUNS) number ()
12 DNS, Domain Name Service.
|
List 4.13.3. Dependable computer systems that rely on unique object identifiers. 1 Google (relies on URLs) 2 PubMed (relies on PubMed identifiers) 3 Libraries (relies on ISSN, DOI) 4 Swiss banks (relies on unique account numbers). |
List 4.13.4. Some medical errors related to misidentification.
1 Correctly identified medication provided to incorrectly
identified person
2 Incorrectly identified medication provided to correctly
identified person
3 Incorrectly identified dosage of correct medication provided
to correctly identified person
4 Blood transfused provided to incorrectly identified person
5 Report sent to incorrectly identified physician
6 Report identified with wrong person's name
7 Bill sent to incorrectly identified person
8 Report provided with diagnosis intended for different person
9 Wrong operation performed on incorrectly identified patient
10 Incorrectly identified patient treated for another patient's
illness.
|
List 4.15.1. Information deficiencies in the statement "John Smith has a blood glucose of 85".
1 No unique patient identifier (many people are named John
Smith)
2 No unique time identifier (indicating when the test was
performed and distinguishing the test results from other
blood glucose values obtained from the patient at other
times)
3 No unique test identifier (indicating the specific protocol
used to measure blood glucose in this instance)
4 No unique identifier for the units of measurement
5 No unique report identifier (indicating that the report
itself is a unique laboratory object that can be archived
and retrieved)
|
List 4.15.2. Three conditions for a meaningful assertion in informatics.
1 There is a specified object about which the statement is
made. When the object is a unique object (such as a
patient), the object must be specified in a manner that
distinguishes the object from all other objects, and this is
typically done with a unique object identifier
2 There is data that pertains to the specified object
3 There is metadata that describes the data (that pertains to
the specified object.)
|
List 4.15.3. Generalizable scientific statements.
1 f=ma -- Force is mass time acceleration
2 If a gas is held at constant temperature, its volume is
inversely proportional to its pressure - Boyle's law
3 Ontogeny recapitulates phylogeny - fetal development follows
the evolutionary path of the species (a false assertion)
4 There are 10 types of people, those who use binary notation
and those who do not
5 (love of money) = (evil)x(evil) -- The love of money is the
root of all evil.
|
List 4.15.4. Algorithm for de-identifying with an identifier.
1 Collect data on unique object. "Joe Public has brown
eyes."
2 Assign a unique identifier. "Joe Public has unique
identifier, 77300183."
3 Substitute name of object with its identifier
4 Consistently use the identifier with data. "77300183
has brown eyes."
5 Do not let anyone know that Joe Public is 77300183.
|
List 5.2.1. Some questions that can be answered with short program scripts.
1 Strip all the private identifiers from a medical record
2 Find all the surgical procedures included in the dataset of
surgical post-op notes, and annotate each procedure with its
frequency of occurrence in the dataset
3 Index a book with the page location of all terms that are
names of diseases
4 Find all the palindromes in a gene sequence database and
arrange them by frequency of occurrence
5 Find the most common occurring sequence of octamers in the
human genome database
6 Find all octamers that occur only once in the human genome
database
7 Rank sequences from a gene expression array experiment based
on levels of over-expression
8 From a patient database, find the diseases that have a
chronologic relationship with another condition (e.g.
chicken pox never occurs after shingles)
9 Find all tumors associated with a gene fusion mutation
10 Collect 100 histopathologic images of liver disease from the
Web.
|
List 5.2.2. The three programming tricks in medical informatics.
1 File parsing (opening a file and examining the contents of
the file, one line at a time)
2 Pattern matching (finding a fragment of parsed text that
matches a word, a phrase or a character pattern of interest)
3 Assigning data structures to hold numbers or textual data
that can be operated on, with outputs placed in an external
file.
|
List 5.2.3. Pseudocode to collect all the lines from a file that contain the phrase "biomedical informatics".
1 1. Open a file for reading. (Verbose equivalent: Get a file
from the hard drive that has a particular name and prepare
it so that the data in the file can be extracted and put
into holders in the computer's memory)
2 2. Parse the lines of the file. (Verbose equivalent: Grab
the characters from the first line of the file and put it
into a data holder that occupies a specific place in
computer memory. Be prepared to repeat this for all the
lines of the file.)
3 3. Collect all the lines that contain the phrase
"biomedical informatics. (Verbose equivalent: As each
line is placed in a holder in computer memory, determine
whether the line contains the string "biomedical
informatics" and if it does, add the held data to a
structure called an array, which can hold many character
strings, in sequence.)
4 4. When the file is exhausted, empty all the matching lines
into an external file, opened for writing, named
"output.txt". (Verbose equivalent: At the end of
the file parsing loop, take the array structure, and
transfer all the character strings from the array, in
sequence, into a newly created file that has been prepared
to accept data.
|
List 5.2.4. Reasons to program in Perl.
1 Perl can be obtained at no cost
2 Perl is available for virtually every operating system and
comes bundled into Unix and Linux distributions
3 Perl is extremely popular among bionformaticians
4 It takes just a few hours to learn enough Perl to write your
own biomedical informatics programs
5 Perl programs tend to be much shorter and easier to
understand than programs written in C or Java
6 A Perl script written for your computer will probably work
on any other computer loaded with a Perl interpreter, even
if the other computer has a different operating system
7 Unlike C and C++, Perl comes with native pattern matching
commands (so called regular expressions) which are used in
virtually every program in the field of biomedical
informatics
8 There are many thousands of freely available Perl tools that
perform a wide range of useful operations that can extend
the functionality of your own programs
9 Perl code can be written in a manner that looks much like
simple narrative text (if you make the effort) making it
easy for others to to read
10 Once you've learned Perl, you can migrate to almost any
other programming language with ease.
|
List 5.5.1. Contents of typical flat-file, "taxo.txt" extracted from "Taxonomy". 1 SYNONYM : Bacillus aegyptius 2 SYNONYM : Haemophilus aegyptius 3 SYNONYM : Hemophilus conjunctivitidis 4 SYNONYM : Haemophilus influenzae aegyptius 5 SYNONYM : Bacillus conjunctivitidis 6 SYNONYM : Bacterium aegyptiacum 7 SYNONYM : Bacterium conjunctivitis 8 SYNONYM : Bacterium pseudo conjunctivitidis |
List 5.5.2. Perl script, open1.pl, to open a file and read a file.
1 #!/usr/bin/perl
2 open(FILE, "taxo.txt");
3 $line = " ";
4 while ($line ne "")
5 {
6 $line = <FILE>;
7 print $line;
8 }
9 exit;
|
List 5.5.3. Output of open1.pl. 1 C:\ftp>perl open1.pl 2 SYNONYM : Bacillus aegyptius 3 SYNONYM : Haemophilus aegyptius 4 SYNONYM : Hemophilus conjunctivitidis 5 SYNONYM : Haemophilus influenzae aegyptius 6 SYNONYM : Bacillus conjunctivitidis 7 SYNONYM : Bacterium aegyptiacum 8 SYNONYM : Bacterium conjunctivitis 9 SYNONYM : Bacterium pseudo conjunctivitidis |
List 5.10.1. Mwp.pl, a ridiculously short text editor, in Perl.
1 #!/usr/bin/perl
2 open (OUT, ">>mycumu.txt");
3 open (NEW, ">mynew.txt");
4 $line = " ";
5 until ($line eq "\n")
6 {
7 $line = <STDIN>;
8 print OUT $line;
9 print NEW $line;
10 }
11 exit;
|
List 5.10.2. Until loop in Perl.
1 $line = " ";
2 until ($line eq "\n") #loop stops when
all you've entered is
3 #the return key
4 {
5 $line = <STDIN>; #waits for the next line
of input
6 print OUT $line; #appends to the cumulative
file
7 print NEW $line; #writes to the current
script-session file
8 }
|
List 5.11.1. Common errors in Perl scripts.
1 Perl blocks must be balanced with curly brackets. Every
block (e.g., while, if, for, unless, foreach) must have a
beginning curly bracket,"{" and a balanced closing
curly bracket, "}". This can become hairy in
scripts that have multi-nested blocks
2 Command lines must end with a semicolon
3 String variables must be pre-pended with a "$",
as in, $date
4 Spelling counts in scripts. Perl cannot interpret a
misspelled command or variable
5 An uppercase character has a different ascii value than its
lowercase equivalent. With few exceptions, you will find it
useful to maintain case consistency in Perl scripts
6 Characters that serve as reserved Perl symbols must be
backslashed if they are used as string characters. For
example, use \. \/ \\ \$ if you want to use ./\$ as
characters. There are exceptions to this rule: \n,\d, \w are
reserved symbols and never refer to the letters, ndw. The
strange and non-intuitive use of backslashes in Perl takes
some mental adjustment and accounts for the "leaning
toothpick syndrome" in Perl scripts. Complex regular
expressions often resemble toothpicks tossed amidst string
characters
7 Certain operations must be enclosed by parentheses (e.g., if
(1 == 2), not (if 1 == 2)
8 The "=" operator is assignes a value and does not
test for equality. To test for equality, use "=="
if you are comparing two numbers and use "eq" if
you are comparing two strings. Remember that string
comparison operators (eq, ne, lt, gt) are different from
number comparison operators (==, >, <)
9 Using an "=" operator when you really want to use
the regex comparison operator, "=~".
|
List 5.11.2. Summary of the first Perl programming section.
1 Perl scripts are simple text files. [Perl scripts should be
named using the .pl extension [Perl is a quintessential
command-line language. At the command prompt, run your
scripts by typing perl, then the name of the script, then
the return-key (on some systems, you needn't include the
name perl)
2 Perl scripts start off with a header line
3 Perl commands end with a semicolon
4 Perl blocks are delineated by curly brackets ({ })
5 You can assign strings to variables by using the assignment
operator, "="
6 You can read, write or append to files using the
"open" command
|
List 5.12.1. Pseudocode that outlines the general construction of a Perl script.
1 header (shebang) line;
2 input something;
3 if (something evaluates to true)
4 {
5 do something;
6 for or while (some condition)
7 {
8 do something;
9 }
10 do something;
11 do something;
12 }
13 for or while (some condition)
14 {
15 do something;
16 if (something evaluates to true)
17 {
18 do something;
19 do something;
20 do something;
21 }
22 output something;
23 }
24 exit;
|
List 5.14.1. Perls script bigread.pl.
1 #!/usr/bin/perl
2 #bigread.pl
3 #This script lets you page through enormous files,
4 #20 lines at a time, with no file load time
5 print "What file do you want to read?";
6 $filename = <STDIN>;
7 chomp($filename);
8 open (TEXT, $filename)||die"Can't open file";
9 $line = " ";
10 while ($line ne "") [#comment: while $line is
not equal to empty
11 {
12 for ($count = 1; $count <= 20; $count++)
13 {
14 $line = <TEXT>;
15 print $line;
16 }
17 print "Type QUIT if you want to quit.
Otherwise press any key\n";
18 $response = <STDIN>;
19 if ($response =~ /QUIT/i)
20 {
21 last;
22 }
23 }
24 exit;
|
List 5.14.2. Output of File Reader.
1 C:\ftp>perl readbig.pl
2 What file do you want to read?e:\omim.txt
3 *RECORD*
4 *FIELD* NO [100050 [*FIELD* TI [100050 AARSKOG SYNDROME
[*FIELD* TX [Grier et al. (1983) reported father and 2 sons
with typical Aarskog [syndrome, including short stature,
hypertelorism, and shawl scrotum. [
5
6
7 sons and that this suggested autosomal dominant inheritance.
Actually,
8 the mother seemed less severely affected, compatible with
X-linked
9 Type QUIT if you want to quit. Otherwise press any key
|
List 5.14.3. Summary of the second Perl programming section. 1 How to open and read from files, line by line 2 How to prompt a user for input 3 Looping using |
List 5.15.1. Things you can do with a one-line Regular expression.
1 Collect the lines from a file that contain a specific word,
phrase or number
2 Collect the lines from a file that contain any desired
combination of the above
3 Substitute any alphanumeric character string for any other,
for the entire file
|
List 5.16.1. Using the match operator with regular expressions.
1 for all the lines of a given file
2 {
3 put the next line from the file into some variable;
4 check the line to see if it matches your regular
expression;
5 {
6 if the line matches the regular expression
7 {
8 do something with it, like put it into another file;
9 or do an operation on the matching value;
10 }
|
List 5.16.2. Using the substitution operator with regular expressions.
1 for all the lines of a given file
2 {
3 put the next line from the file into some variable;
4 do a substitution on all of the parts of the line that match
your regular expression;
5 do something with the the revised line, like
rearranging it and then putting the rearranged line into
another file;
6 }
|
List 5.17.1. Pattern match options.
1 g Match globally, (find all occurrences)
2 i Do case-insensitive pattern matching
3 m Treat string as multiple lines
4 o Compile pattern only once
5 s Treat string as single line
6 x Use extended regular expressions
7 ^ Match the beginning of the line
8 . Match any character (except newline)
9 $ Match the end of the line (or before newline at
the end)
10 | Alternation
11 () Grouping
12 [] Character class
13 * Match 0 or more times
14 + Match 1 or more times
15 ? Match 1 or 0 times
16 {n} Match exactly n times
17 {n,} Match at least n times
18 {n,m} Match at least n but not more than m times
19 \n newline(LF, NL)
20 \W Match a non-word character
21 \s Match a whitespace character
22 \S Match a non-whitespace character
23 \d Match a digit character
24 \D Match a non-digit character.
|
List 5.17.2. Sentence.pl Perl script, which creates a file wherein each new sentence begins on a new line.
1 #!/usr/local/bin/perl
2 open (TEXT, "1DFRE10.TXT")||die"Can't
open file";
3 open (OUT,">1DFRE10.OUT")||die"Can't
open file";
4 undef($/);
5 $string = <TEXT>;
6 $string =~ s/[\n]+/ /g;
7 $string =~ s/([^A-Z]+\.[ ]{1,2})([A-Z])/$1\n$2/g;
8 print OUT $string;
9 exit;
|
List 5.18.1. Periods.pl, a Perl script for removing periods that do not delineate sentences.
1 #!/usr/bin/perl
2 #disbrev2.pl
3 #replaces periods with *, except when period marks end
of sentence
4 $k = "Mr. P.I.N. Ph.D. M.D. 0.3 .4 5. 4.6.7.8.9
end_of_sentence. Hello";
5 $firstvalue = $k;
6 $k =~ s/\b([ \w\d]*)\.+(?=[\w\d]*)(?! [A-Z])/$1\*$2/g;
7 print "$firstvalue =>\n$k";
8 exit;
|
List 5.18.2. Output of disbrev2.pl.
1 C:\ftp>perl disbrev2.pl
2 Mr. Dr. P.I.N. Ph.D. M.D. 0.3 .4 5. 4.6.7.8.9
end_of_sentence. Hello =>
3 Mr* Dr* P*I*N* Ph*D* M*D* 0*3 *4 5* 4*6*7*8*9
end_of_sentence. Hello
4 C:\ftp>
5 USEFUL REGULAR EXPRESSIONS
|
List 5.18.3. Regex (regular expression) substitution examples.
1 $string =~ s/^ +//o; Removes leading spaces from a character
string
2 $string =~ s/ +$//o; Removes trailing spaces from a
character string
3 $string =~ s/ +/ /g; Changes all sequences of one or more
spaces to just a single space
4 $string =~ s/\n//g; Gets rid of newline (sometimes called
linebreak) characters in your string
5 $string =~ s/\b(\w+\.[ ]{1,2})([A-Z])/$1\n$2/g;
6 This finds the most common sentence delimiter (the end of a
word followed by a period followed one or two spaces,
followed by by an uppercase letter) and substitutes a
newline character to that the each new sentence begins on a
new line
7 $string =~ tr/A-Z/a-z/ Every uppercase letter is converted
to a lowercase letter using the translate operator
(tr/a-z/A-Z/ does the opposite)
8 $string = |
List 5.19.1. Wc.pl Perl script, which counts the words in a file in 5 commands. 1 #!/usr/local/bin/perl 2 open (TEXT, "1DFRE10.TXT"); 3 undef($/); 4 $all_text = <TEXT>; 5 @wordarray = split(/[\n\s]+/, $all_text); 6 print scalar(@wordarray); 7 exit; |
List 5.20.1. The Zipf distribution of the prior paragraph of the prior paragraph. 1 c:\ftp>perl zipf.pl 2 00007 of 3 00005 a 4 00004 the 5 00003 words 6 00003 is 7 00003 in 8 00002 zipf 9 00002 text 10 00002 occurrences 11 00002 distribution 12 00001 zipf's 13 00001 way 14 00001 this 15 00001 their 16 00001 that 17 00001 small 18 00001 shown 19 00001 see 20 00001 practical 21 00001 paragraph 22 00001 order 23 00001 most 24 00001 listing 25 00001 list 26 00001 law 27 00001 interpreting 28 00001 for 29 00001 different 30 00001 descending 31 00001 any 32 00001 amount 33 00001 account |
List 5.20.2. The first ten items in the Zipf distribution of The Decline and Fall of the Roman Empire. 1 26856 the 2 18032 of 3 09136 and 4 06026 to 5 04654 a 6 04155 in 7 03170 was 8 03081 his 9 02815 by 10 02391 that |
List 5.20.3. Zipf.pl, a Perl script that creates a Zipf distribution in 6 commands.
1 #!/usr/local/bin/perl
2 open (TEXT, "1DFRE10.TXT");
3 open (OUT, ">1DFRE10.OUT");
4 undef($/);
5 $all_text = <TEXT>;
6 $all_text = lc($all_text);
7 $all_text =~ s/[^a-z\-\']/ /g;
8 @wordarray = split(/[\n\s]+/, $all_text);
9 foreach $thing (@wordarray)
10 {
11 $freq{$thing}++;
12 }
13 #The Zipf list finished. The next lines just display
the distribution
14 while ((my $key, my $value) = |
List 5.20.4. Example of an associative array, %patient_weight.
1 $patient_weight{"John Public"} = 155;
2 $patient_weight{"Mary Smith"} = 110;
3 $patient_weight{"Jules Berman"} = 195;
4 $patient_weight{"Jules Berman"}++; #evaluates
to 196
|
List 5.20.5. Summary of the third Perl programming section.
1 Creating and interpreting complex regular expressions
2 Looping through arrays with foreach blocks
3 Looping through associative arrays with while blocks
4 New Perl operators and commands split(), push(), lc(),
sort(), join(), substr(), scalar(), undef(), incrementing
values and concatenating strings
5 Advanced pattern substitution and substitution options
|
List 5.21.1. A sample MESH record.
1 *NEWRECORD
2 RECTYPE = D
3 MH = Heparin
4 AQ = AA AD AE AG AN BI BL CF CH CL CS CT DF DU EC GE HI IM
IP ME PD PH [ PK PO RE SD SE ST TO TU UL UR
5 PRINT ENTRY = Heparinic Acid|T118|T121|T123|
6 NON|EQV|UNK (19XX)|800523|abbbcdef
7 PRINT ENTRY = alpha-Heparin|T118|T121|T123|NON|NRW|
8 UNK (19XX)|800523|abbbcdef
9 ENTRY = Liquaemin|T118|T121|T123|TRD|NRW|UNK
(19XX)|861029|abbbcdef
10 ENTRY = Sodium Heparin|T118|T121|NON|NRW|UNK
(19XX)|830330|abbcdef
11 ENTRY = Heparin, Sodium
12 ENTRY = alpha Heparin
13 MN = D09.698.373.400
14 PA = Anticoagulants
15 PA = Fibrinolytic Agents
16 EC = antagonists & inhibitors:Heparin Antagonists
17 MH_TH = BAN (19XX)
18 ST = T118
19 ST = T121
20 ST = T123
21 N1 = Heparin
22 RN = 9005-49-6
23 MS = A highly acidic mucopolysaccharide formed of equal
[parts of sulfated D-glucosamine and D-glucuronic acid with
24 sulfaminic bridges. The molecular weight ranges from six to
[twenty thousand. Heparin occurs in and is obtained from
liver,
25 lung, mast cells, etc., of vertebrates. Its function is
unknown, [but it is used to prevent blood clotting in vivo
and vitro, in
26 the form of many different salts
27 PM = /therapeutic use was HEPARIN, THERAPEUTIC 1965
28 HN = /therapeutic use was HEPARIN, THERAPEUTIC 1965
29 MED = *1635
30 MED = 3275
31 M90 = *2406
32 M94 = 4517
33 MR = 20040707
34 DA = 19990101
35 DC = 1
36 UI = D006493
|
List 5.21.2. Creating a persistent database object from the MESH flat-file.
1 #!/usr/bin/perl
2 use Fcntl;
3 use SDBM_File;
4 tie%item, "SDBM_File", 'mesh',
O_RDWR|O_CREAT|O_EXCL, 0644;
5 untie%item; #these two lines simply create a file
6 open (TEXT, "d2002.bin")||die"Can't open
file";
7 $/ = "*NEWRECORD";
8 $line = " ";
9 while ($line ne "")
10 {
11 tie%item, "SDBM_File", 'mesh', O_RDWR,
0644; #use the created file
12 $line = <TEXT>;
13 @linearray = split(/\n/,$line);
14 foreach $piece (@linearray)
15 {
16 if ($piece =~ /MN = /)
17 {
18 $meshno = $';
19 }
20 if ($piece =~ /ENTRY = /)
21 {
22 $entry = $';
23 if ($entry =~ /\|/o)
24 {
25 $entry = $`;
26 }
27 $entry =~ s/s\b//g;
28 $entry = lc($entry);
29 push (@synonyms, $entry);
30 }
31 }
32 foreach $term (@synonyms)
33 {
34 $item{$term} = $meshno;
35 }
36 undef $meshno;
37 undef @synonyms;
38 untie%item;
39 }
40 undef(%item);
41 close TEXT;
42 exit;
|
List 5.22.1. Retrieving a persistent database object from the MESH flat-file.
1 #!/usr/bin/perl
2 use Fcntl;
3 use SDBM_File;
4 tie%item, "SDBM_File", 'mesh', O_RDWR, 0644;
5 while(($key, $value) = each (%item))
6 {
7 print "$key => $value\n";
8 }
9 untie%item;
10 exit;
|
List 5.23.1. Syntax rules for valid XML tags.
1 XML tags, unlike Perl variables, are case-sensitive
("Name" is different from "name").
Parsers must preserve character case
2 Letters, underscores, hyphens, periods and numbers may be
used in a tag
3 Only letters and underscores are eligible as the first
character
4 Colons are allowed, but only as part of a declared namespace
prefix. For all practical purposes, this means that only one
colon is allowed in a tag, and the colon must appear in an
internal location in the tag (not at the beginning or the
end of a tag).
|
List 5.23.2. Tagcheck.pl, a program that validates XML tags.
1 #!/usr/bin/perl
2 @elements = qw (gene 4gene gene:ncbi gene-autry ge::ne [
gene&autry -gene _gene gene- gene: [
:gene ge:n:e ge:ne: ge,ne ge.ne);
3 foreach $value (@elements)
4 {
5 if ($value =~
/^[a-z\_][a-z0-9\-\.\_]*[\:]?[a-z0-9\-\.\_]*$/i)
6 {
7 print "$value is good\n";
8 }
9 else
10 {
11 print "$value is bad\n";
12 }
13 }
14 exit;
|
List 5.23.3. Output of tagcheck.pl 1 c:\ftp>perl tagcheck.pl 2 gene is good 3 4gene is bad 4 gene:ncbi is good 5 gene-autry is good 6 ge::ne is bad 7 gene&autry is bad 8 -gene is bad 9 _gene is good 10 gene- is good 11 gene: is good 12 :gene is bad 13 ge:n:e is bad 14 ge:ne: is bad 15 ge,ne is bad 16 ge.ne is good |
List 5.24.1. What we have learned so far.
1 The =~ operator tells Perl to look for the pattern that
follows the operator in the variable that precedes the
operator. Regular Expressions are Perl's way of describing a
pattern
2 You can create most of your patterns by following a few
simple rules and by "borrowing" regular
expressions from published listings
3 The most common usage for regular expressions are in scripts
that examine a line (or all the lines) from a file and that
perform a substitution or rearrangement or other operation
on the line, based on the results of the pattern match
4 Regular expressions are a powerful and fast tool for
modifying text or data records or finding exactly what you
want in any text
5 Perl associative arrays can be tied to an external database
object that persists even when the Perl script has finished
executing.
|
List 6.1.1. Some biomedical informatics tasks that can be accomplished with Perl.
1 Statistics
2 Mathematical Computations
3 Mathematical modeling
4 Web protocols (e.g., http and ftp)
5 Cryptographic techniques
6 Integrating data
7 Glue functions (e.g., calling subroutines written in C)
8 Digital Signal Processing (including Image analysis)
9 Bioinformatics methods (e.g. interfacing to Blast)
10 Database interfaces
11 Remote procedure calls and distributed computing
12 Middleware (see Glossary) [Software agents (via web
services, GRID, SOAP (see Glossary), or related protocols)
13 Transformations to and from XML
14 XML data queries
15 Logical annotation of data (e.g., RDF)
|
List 6.2.1. Creating an MD_5 one-way hash value for any provided string.
1 #!/usr/local/bin/perl
2 use MD5;
3 print "What words would you like to
digest?\n";
4 $holdstring = <STDIN>;
5 chomp;
6 $hexhashstring = MD5->hexhash($holdstring);
7 print "md_5 hexhash => $hexhashstring\n";
8 exit;
|
List 6.2.2. Three executions of the the MD_5 algorithm. 1 Execution 1: 2 c:\ftp>perl md5_word.pl 3 What words would you like to digest? 4 Jules Berman 5 md_5 hexhash => 0ab7ad79962fd2ea036cc8dbaade6f2a |
List 6.2.3. Creating an MD_5 one-way hash for a file.
1 #!/usr/local/bin/perl
2 use MD5;
3 print "What file would you like to
digest?\n";
4 $holdfile = <STDIN>;
5 chomp;
6 open (TEXT,"$holdfile");
7 $context = new MD5;
8 $context->addfile(TEXT);
9 $digest = $context->digest();
10 print (unpack ("H*", $digest));
11 exit;
|
List 6.3.1. Simple Perl script for computing the mean from an array of numbers.
1 #!/usr/bin/perl
2 #mean.pl
3 #computes the mean of an array of numbers
4 @numbersarray = (1,2,3,4,5,6,7,8,9,10);
5 $arraysize = scalar(@numbersarray);
6 print "The number of elements in our array is
$arraysize\n";
7 $sum = 0;
8 foreach $value(@numbersarray)
9 {
10 $sum = $sum + $value;
11 }
12 $mean = $sum / $arraysize;
13 print "Your population number is
$arraysize\n";
14 print "The array mean is $mean\n";
15 exit;
|
List 6.3.2. General method of building an array that can be used in a statistical or mathematical Perl routine.
1 Open the file containing your records
2 Go through the file, one line (record) at a time
3 From a complex record, pick out the number you want using
Regex
4 Add that number to your array variable (using the Perl push
command)
5 Calculate the mean (or any other statistical test) on the
array variable.
|
List 6.3.3. Computing the mean of an array entered at keyboard.
1 #!/usr/bin/perl
2 #mean2.pl
3 #computes the mean of an array of numbers entered at
keyboard
4 print "Type a bunch of numbers, pressing the
return key\n";
5 print "after each number. Decimal numbers are
allowed\n\n";
6 $number = " ";
7 until ($number eq "")
8 {
9 $number = <STDIN>;
10 $number =~ s/\n//o; #deletes the newline character
11 if ($number eq "")
12 {
13 next;
14 }
15 if ($number !~ /[0-9]+/) #the entry must contain
at least one digit
16 {
17 print "You're only allowed to enter
numbers...";
18 print " We just won't count this
entry\n";
19 next;
20 } [ if ($number !~ /^[0-9
|
List 6.3.4. Output of mean2.pl.
1 C:\ftp>perl mean2.pl
2 Type a bunch of numbers, pressing the return key after each
number. Decimal numbers are allowed
|
List 6.4.1. Some of the available Perl statistics modules ().
1 Statistics-Basic
[Statistics-ChisqIndep
2 Statistics-ChiSquare
[Statistics-Contingency
3 Statistics-ConwayLife [Statistics-DEA
4 Statistics-DependantTTest
[Statistics-Descriptive
5 Statistics-Descriptive-Discrete
[Statistics-Distributions
6 Statistics-Frequency
[Statistics-GammaDistribution
7 Statistics-KruskalWallis
[Statistics-LineFit
8 Statistics-Lite
[Statistics-LogRank
9 Statistics-LSNoHistory [Statistics-LTU
10 Statistics-OLS
[Statistics-RankCorrelation
11 Statistics-RankOrder
[Statistics-Regression
12 Statistics-ROC
[Statistics-SerialCorrelation
13 Statistics-Shannon
[Statistics-Simpson
14 Statistics-Table-F [Statistics-Test
15 Statistics-TTest Before you can
use these tests, you must download the appropriate module
into your Perl installation. A sample installation of
Statistics-Descriptive (by Colin Kuskie, Andrea Spinelli and
Jason Kastner), through the ActiveState package manager is
shown (see List). ppm> install statistics-descriptive
==================== Install 'statistics-descriptive'
version 2.6 in ActivePerl 5.8.7.815. ====================
Downloaded 10294 bytes. Extracting 5/5:
blib/arch/auto/Statistics/Descriptive/.exists Installing
C:\activepl\html\site\lib\Statistics\Descriptive.html
Installing C:\activepl\site\lib\Statistics\Descriptive.pm
Successfully installed statistics-descriptive version 2.6 in
ActivePerl 5.8.7.815. Only the first line is input:
[ppm> install statistics-descriptive
|
List 6.4.2. Perl script for calculating variance. 1 #/usr/local/bin/perl 2 use Statistics::Descriptive; 3 $stat = Statistics::Descriptive::Full->new(); 4 $stat->add_data(1,2,3,4,5,6,7,8,9,10); 5 $mean = $stat->mean(); 6 $var = $stat->variance(); 7 print "mean $mean\nvariance $var\n"; 8 exit; |
List 6.4.3. Output of statistics script. 1 c:\ftp>perl stat.pl 2 mean 5.5 3 variance 9.16666666666667 |
List 6.4.4. Perl script for computing the ChiSquare statistic.
1 #!/usr/bin/perl
2 use Statistics::ChiSquare;
3 print chisquare([1, 9, 1, 15, 4, 7]), "\n";
4 print chisquare([20, 20, 20, 30, 20, 20, 30 ]),
"\n";
5 exit;
|
List 6.4.5. Output of chi.pl.
1 C:\ftp>perl chi.pl
2 There's a <1% chance that this data is random
3 There's a >50% chance, and a <70% chance, that this
data is random.
|
List 6.5.1. Types of statistical errors.
1 Type 1 error. Rejecting the null hypothesis when the null
hypothesis is correct (i.e., seeing an effect when there was
none)
2 Type 2. Accepting the null hypotheses when the null
hypothesis is false. (i.e. seeing no effect when there was
one)
3 Type 3. Rejecting the null hypothesis correctly, but for the
wrong reason, leading to an erroneous interpretation of the
data in favor of an incorrect affirmative statement
4 Type 4. Erroneous conclusion based on performing the wrong
statistical test. The type 4 error is the most embarrassing
and the least excusable. You cannot blame a type 4 error on
the data. It's all on you. Considering the rich variety of
exotic statistical tests available to the novice, the
opportunities for type 4 errors are endless. One way of
avoiding type 4 errors is to have a dedicated statistician
analyze your data. For those informaticians who have access
to the services of a trustworthy statistician, this may
actually be the best and most practical solution. There is,
however, an alternate way approach: resampling. Resampling
is a type of statistical analysis that uses computers to
model experiments and then repeats the experiments thousands
or millions of time to determine the occurrence frequencies
for particular sets of data. This area of statistics was
popularized by Bradley Efron (), and may have particular
interest for readers of this book (see List). [List. Reasons
why resampling statistics are of interest to biomedical
informaticians
5 Does not require any knowledge of statistical tests
6 Applicable to a wide range of problems, including clinical
trial design and decision analyses
7 Easy to understand
8 Easy to program with Perl
|
List 6.6.1. Randtest.pl, a Perl script that simulates 600,000 casts of the die.
1 #!/usr/bin/perl
2 #randtest.pl
3 #Simulation of a throw of a die
4 $count = 0;
5 while ($count < 600000)
6 {
7 $count++;
8 $one_of_six = (int(rand(6))+1);
9 $hash{$one_of_six}++;
10 }
11 while(($key, $value) = each (%hash))
12 {
13 print "$key => $value\n";
14 }
15 exit;
|
List 6.6.2. Output of first test of randtest.pl. 1 C:\ftp>perl randtest.pl 2 1 => 100002 3 2 => 99902 4 3 => 99997 5 4 => 100103 6 5 => 99926 7 6 => 100070 |
List 6.6.3. Output of second test of randtest.pl. 1 C:\ftp>perl randtest.pl 2 1 => 100766 3 2 => 99515 4 3 => 100157 5 4 => 99570 6 5 => 100092 7 6 => 99900 |
List 6.6.4. Ranfile.pl, a Perl script that assigns random names to newly created files.
1 #!/usr/bin/perl
2 #ranfile.pl
3 #Makes 10 randomly named files, with 8 leading
characters
4 #a period and three trailing characters
5 while ($count < 10)
6 {
7 $count++;
8 &ranfile;
9 }
10 [sub ranfile
11 {
12 my @listchar;
13 my $count;
14 for ($count = 1; $count <= 12; $count++)
15 {
16 push(@listchar, chr(int(rand(26))+65));
17 }
18 $listchar[8]= ".";
19 my $randomfilename = join("",@listchar);
20 print "Your filename is $randomfilename\n";
21 return $randomfilename;
22 }
23 exit;
|
List 6.6.5. Output of ranfile.pl. 1 C:\ftp>perl ranfile.pl 2 Your filename is EKDUFKBR.YNX 3 Your filename is QVDKUVBY.QUI 4 Your filename is FNZXNKEE.MLV 5 Your filename is NRTXEHQI.VFX 6 Your filename is GWMOLKMX.AYU 7 Your filename is LZAKZQDW.RYR 8 Your filename is PRUAONQQ.OSJ 9 Your filename is XDEDHLKD.GAY 10 Your filename is RUSLNSXI.XVR 11 Your filename is IEPGAWDP.LEH |
List 6.7.1. Ai.pl, a Perl script that simulates clonal tumor growth.
1 #!/usr/bin/perl
2 #ai.pl
3 #Simulates the growth of a tumor from a single cells,
with
4 #a cell death probability per generation as provided by
the user
5 print "Enter the death probability for your
simulation\n";
6 print "Number must be between zero and
one.\n";
7 print "Most realistic numbers are .45 to
.50\n";
8 $value = <STDIN>;
9 $value =~ s/\n//o;
10 if ($value > 1) [ {
11 print "Exiting... you must pick a number between
zero and one\n";
12 end;
13 }
14 print "THE CELL DEATH PROBABILITY FOR THIS
SIMULATION IS $value\n\n";
15 my $roundnumber = 1; #initiate the generation counter
16 &cycle;
|
List 6.7.2. Output of ai.pl.
1 C:\ftp>perl ai.pl
2 Enter the death probability for your simulation [Number must
be between zero and one. [Most realistic numbers are .45 to
.50 [.46 [THE CELL DEATH PROBABILITY FOR THIS SIMULATION IS
.46 [Starting with a single malignant cell, let's watch the
clonal growth. Tumor terminated...good!
3 Starting with a single malignant cell, let's watch the
clonal growth. 1 Tumor terminated...good!
4 Starting with a single malignant cell, let's watch the
clonal growth. 2 1 4 2 1 1 1 Tumor terminated...good!
5 Starting with a single malignant cell, let's watch the
clonal growth. 2 1 5 6 8 8 8 12 15 18 18 20 19 27 32 30 31
20 14 16 23 30 30 36 38 34 50 52 67 75 97 114 133 143 150
156 159 178 200 254 302 292 329 336 382 441 489 603 630 701
770 862 923 1056 1084 1210 1369 1473 1664 1776 1959 2196
2475 2862 3098 3327 3740 4095 4634 Bad news. Let's stop
watching this malignancy
6 Starting with a single malignant cell, let's watch the
clonal growth. Tumor terminated...good!
7 Starting with a single malignant cell, let's watch the
clonal growth. Tumor terminated...good!
8 Starting with a single malignant cell, let's watch the
clonal growth. 3 1 3 5 3 1 1 Tumor terminated...good!
9 Starting with a single malignant cell, let's watch the
clonal growth. 4 6 5 6 3 3 1 Tumor terminated...good!
10 Starting with a single malignant cell, let's watch the
clonal growth. 2 2 5 3 2 2 6 5 6 5 4 2 1 Tumor
terminated...good!
11 Starting with a single malignant cell, let's watch the
clonal growth. 3 5 3 7 6 3 3 1 1 Tumor terminated...good!
I've seen enough!
|
List 6.7.3. Perl snippet showing the algorithm that repeatedly assigns probabilitic outcomes to an event.
1 while ($i < $sum +1)
2 {
3 $i++;
4 $randnum = int( |
List 6.8.1. Run.pl, a resampling script in Perl, that simulates runs of errors.
1 #!/usr/local/bin/perl
2 $errorno = 0;
3 while ($count < 100001)
4 {
5 $count++;
6 $x = rand(100);
7 if ($x < 2)
8 #similates a 2% error rate
9 {
10 $errorno++;
11 }
12 else
13 {
14 $errorno = 0;
15 }
16 if ($errorno == 3)
17 {
18 print "Uh oh. 3 consecutive errors\n";
19 $errorno = 0;
20 }
21 }
22 exit; The Perl script
simulates 100,000 diagnoses, which is a fair estimate of the
total number of diagnoses a pathologist might render in
their entire career (at 4,000 diagnoses per year over 25
years of service). Each diagnosis is assigned a random
number between 0 and 100. The "diagnosis" loop is
repeated 100,000 times. In each loop, if the randomly
assigned number is less than 2, the pathologist's error
number is incremented by 1. If the next diagnosis is
randomly assigned a number greater than 2, the error number
is dropped back down to 0 (i.e. the diagnosis is correct and
the run of errors is broken). If an error occurs on 3
consecutive occasions, the event is printed to the computer
monitor (see List). [List. Output of run.pl
23 c:\ftp>perl run.pl
24 Uh oh. 3 consecutive errors
25 Uh oh. 3 consecutive errors
|
List 6.9.1. Snippet of Perl code to determine unbiased random selection.
1 open
(HOLD,">holder.txt")||die"cannot";
2 while ($n < 1000000)
3 {
4 $x = |
List 6.10.1. Output of montesw.pl. 1 C:\ftp>perl montesw.pl 2 6598 3 C:\ftp>perl monteno.pl 4 3408 |
List 6.11.1. Ceil.pl, calling a POSIX function from a Perl script.
1 #!/usr/local/bin/perl
2 use POSIX qw(ceil floor);
3 $num = 11.3;
4 print "Floor is ", floor($num),
"\n";
5 print "Ceil is ", ceil($num), "\n";
6 exit;
|
List 6.11.2. Output of ceil.pl. 1 c:\ftp>perl ceil.pl 2 Floor is 11 3 Ceil is 12 |
List 6.12.1. Using the ActiveState Programmer's Package Manager. 1 c:\ftp>ppm 2 ppm - programmer's package manager version 3.3 3 copyright (c) 2001 activestate corp. all rights reserved 4 activestate is a division of sophos. |
List 6.12.2. Simple example script for the Fast Fourier Transform Module.
1 #!/usr/local/bin/perl [use Math::FFT;
2 my $PI = 3.1415926539;
3 my $N = 8; #N can be any power of 2, such as 4,8,16,64
4 $series = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16];
#could be anything
5 print "series " . join("
",@$series). "\n";
6 my $fft = new Math::FFT($series);
7 my $coeff = $fft->rdft();
8 print "coefficients \n @{$coeff}\n\n";
9 my $spectrum = $fft->spctrm;
10 print "spectrum \n @{$spectrum}\n";
11 exit;
|
List 6.12.3. Output of Fast Fourier Transform script 1 C:\FTP>perl fft.pl 2 series 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
List 6.13.1. A full concordance in 10 commands.
1 #!/usr/local/bin/perl
2 open (TEXT, "1DFRE10.TXT");
3 open (OUT, ">1DFRE10.OUT");
4 $line = " ";
5 while ($line ne "")
6 {
7 $cumline = "";
8 |
List 6.13.2. A full indexing script in 10 commands.
1 #!/usr/local/bin/perl
2 open (TEXT,
"1DFRE10.TXT")||die"cannot";
3 open (OUT,
">1DFRE10.OUT")||die"cannot";
4 $line = " ";
5 @indextermarray = ("gaul","roman
empire",&
quot;emperor","village","england");
6 while ($line ne "")
7 {
8 $cumline = "";
9 |
List 6.13.3. An excerpted output for the indexing program, listing only the terms "england" and "village" and the pages on which they are found. 1 c:\ftp>perl indexer.pl 2 3 4 5 england = 5 14 23 134 207 208 229 277 6 7 8 9 village = 43 77 81 94 128 141 147 184 185 225 226 244 |
List 6.13.4. Problems with human-based indexing.
1 Incredibly labor-intensive and time consuming
2 The index cannot be built until the book is in final form
and the page numbers are known, delaying the publication of
the book until the indexing is completed
3 If important phrases are omitted completely or if one or
more of their locations are omitted, no one will likely
catch the error
4 The indexing effort needs to be repeated if there are book
revisions and pagination changes.
|
List 6.13.5. Extracting candidate index phrases from a test file.
1 #!/usr/local/bin/perl [@stop = qw(
2 a about absent absence again all almost also although always
among an
3 and another any are as at be because been before being
between both
4 but by can could cm did do does done due during each either
enough
5 especially etc for found from further had has have having
here how
6 however i if in into is it its itself just kg km made mainly
make may
7 mg might ml mm most mostly must nearly neither no nor
observed
8 obtained often on our overall perhaps present presence quite
rather
9 really regarding seem seen several should show showed shown
shows
10 significantly since so some such than that the their theirs
them then
11 there therefore these they this those through thus to upon
use used
12 using various very was we were what when which while with
within
13 without would or can't doesn't not
14 );
15 open
(TEXT,"1DFRE10.TXT")||die"Cannot";
16 open
(OUT,">1DFRE10.OUT")||die"Cannot";
17 undef($/);
18 $phrase = <TEXT>;
19 $phrase =~ s/\n/ /g;
20 $phrase = lc($phrase);
21 $phrase =~ s/[^a-z \']/ /g;
22 foreach $stopword (@stop)
23 {
24 $phrase =~ s/ $stopword / \# /g;
25 }
26 $phrase =~ s/[\s]+/ /g;
27 $phrase =~ s/ ?\# ?/\#/g;
28 @phraselist = sort (split("#",$phrase));
29 @phraselist = grep
30 {$i{$_}++;(($i{$_}==2)&&(scalar(split("
",$_))>1));}@phraselist;
31 print OUT join("\n",@phraselist); [exit; [List.
First 9 lines of output from phrase.pl script
32 abate fortis
33 abbe foucher
34 abdication of diocletian
35 abilities of
36 able leader
37 abolition of
38 absolute power
39 abuse of
40 academy of inscriptions
|
List 6.14.1. Algorithm for regular expression searches of text files.
1 1. Asks you for a regular expression to search a file. If
you're not adept at regular expressions, just enter any
word. Remember, a word or phrase is always the simplest
regular expression. In the output example, we'll search for
the word "adenocarcinoma"
2 2. If you enter the return-key without entering a regular
expression, it simply exits the script
3 3. Asks Perl to give you the current epoch time (number of
seconds passed since some point in history)
4 4. Opens an enormous publicly available file (138 Mbytes)
named MRCON (we'll learn a lot about this file in Biomedical
Perl)
5 5. Reads every line of MRCON (about 2 million of them),
testing each line to see if it contains a substring that
matches the regular expression that you provided (step 1)
6 6. If it finds a match, it adds the line number and the line
to an external file named regexout.txt
7 7. When it's finished reading the file, it asks Perl again
for the epoch time, and determines the script execution time
by subtracting the script's end time from the script's
beginning time
8 8. It prints to the monitor the time spent executing the
script, as well as the filename containing the output of all
the lines from the MRCON file that matched your provided
regular expression.
|
List 6.14.2. Perl script for regular expression searches of text files.
1 #!/usr/bin/perl
2 #perlfind.pl
3 #11/20/01
4 #this will pull out all the matching lines for a
prompted
5 #regular expression from any text file. This short
script is incredibly
6 #powerful, but it requires the user to have facility
creating
7 #regular expressions
8 open (OUT,
">regexout.txt")||die"Can't open file
$value";
9 $filename = "regexout\.txt";
10 print "What's your search regex?\n";
11 $regex = <STDIN>;
12 $regex =~ s/\n//o;
13 if ($regex eq "")
14 {
15 close TEXT;
16 close OUT;
17 print "\nYou didn't give a
regex...Goodby\n";
18 }
19 #$re = qr/$regex/oi;
20 $start = time();
21 &searchsub;
22 $end = |
List 6.14.3. Output of regular expression search. 1 C:\ftp>perl perlfind.pl 2 What's your search regex? 3 adenocarcinoma 4 Retrieval time is 5 seconds 5 Your search results are in file regexout.txt. |
List 6.15.1. A short script that performs a binary search on a file.
1 #!/usr/bin/local/perl
2 open (TEXT, "find_bin.txt");
3 seek(TEXT, 0, 2);
4 print "What word would you like to find?\n";
5 $findword = <STDIN>;
6 $findword =~ s/\n$//o;
7 $filesize = tell (TEXT);
8 |
List 6.16.1. Cluster.pl, a Perl script demonstrating clustering algorithm. 1 #!/usr/local/bin/perl 2 use Algorithm::Cluster; |
List 6.16.2. Output of cluster.pl script. 1 c:\ftp>perl cluster.pl 2 Row0 => Cluster 1 3 Row1 => Cluster 2 4 Row2 => Cluster 2 5 Row3 => Cluster 1 6 Row4 => Cluster 0 7 Row5 => Cluster 0 |
List 6.17.1. Example of a very simple program using the LWP (Library for WWW in Perl). 1 #!/usr/bin/perl 2 use LWP::Simple; 3 print (get "http://www.nih.gov"); 4 exit; |
List 6.18.1. Some Perl books in bioinformatics (a very different field from biomedical informatics)
1 Beginning Perl for Bioinformatics, by James Tisdall
2 Mastering Perl for Bioinformatics, by James Tisdall
3 Genomic Perl: From Bioinformatics Basics to Working Code by
Rex A. Dwyer
4 Perl Programming for Biologists, by D. Curtis Jamison
5 Developing Bioinformatics Computer Skills by Per Jambeck and
Cynthia Gibas
6 Bioinformatics Biocomputing and Perl: An Introduction to
Bioinformatics Computing Skills, by Michael Moorhouse and
Paul Barry
|
List 6.18.2. A simple DNA palindrome, GAATTC. |
List 6.18.3. Perl script for finding palindromes in a gene sequence.
1 #!/usr/bin/perl
2 $filename = "sample";
3 open (TEXT, "sample")||die"Cannot";
4 $line = " ";
5 $count = 0;
6 for $n (5..20)
7 {
8 $re = qr /[CAGT]{$n}/;
9 $regexes[$n-5]= $re;
10 }
11 NEXTLINE: while ($count < 1000)
12 {
13 $line = <TEXT> ;
14 $count++;
15 foreach my $value (@regexes)
16 {
17 $start = 0;
18 while ($line =~ /$value/g)
19 {
20 $endline = $';
21 $match = $&;
22 $revmatch = reverse($match);
23 $revmatch =~ tr/CAGT/GTCA/;
24 if ($endline =~ /^([CAGT]{0,15})($revmatch)/)
25 {
26 $start = 1;
27 $palindrome = $match . "*" . $1 .
"*" . $2;
28 $palhash{$palindrome}++;
29 }
30 }
31 if ($start == 0)
32 {
33 goto NEXTLINE;
34 }
35 }
36 }
37 close TEXT;
38 while(($key, $value) = each (%palhash))
39 {
40 print "$key => $value\n";
41 }
42 exit;
|
List 6.18.4. Input of sample.pl (line-breaks omitted from original file). 1 ATGAGCGAAGAAAGCTTATTCGAGTCTTCTCCACAGAAGATGGAGTACGAAATTACAAAC 2 TACTCAGAAAGACATACAGAACTTCCAGGTCATTTCATTGGCCTCAATACAGTAGATAAA 3 4 5 6 7 AAGATCAGAAGCGACCATGACAATGCTATTGATGGATTATCTGAAGTTATCAAGATGTTA 8 TCTACCGATGATAAAGAAAAATTGTTGAAGACTTTGAAATAA |
List 6.18.5. Output of sample.pl.
1 (* separates the spacer region from the flanking palindromic
regions)
2 C:\FTP>perl sample.pl
3 CTTTG*TCAGGATGGGC*CAAAG => 1
4 AGTAT*T*ATACT => 1
5 GAAATC**GATTTC => 1
6 AGTTT*GGCATCC*AAACT => 1
7 CCTTA*CCCTGT*TAAGG => 1
8 CTTCT*GGAGATTGAGA*AGAAG => 1
9
10
11
12 GATGG*ATTCAAG*CCATC => 1
13 GTTTGG*CAT*CCAAAC => 1
14 CTTCT*CCAC*AGAAG => 1
|
List 6.21.1. Examples of software utility functions.
1 Archiving utilities
2 Calculator utility
3 Compression/decompression utilities
4 Conversion utilities - Converts files (text, images, sound,
video) to and from different formats
5 Database utilities
6 Directory searching
7 Email service
8 Encryption/decryption utilities
9 File copying utilities
10 File reading and parsing utilities
11 FTP file retrieval
12 Indexing utilities
13 Sorting utilities
14 Searching utilities
15 Telnet remote computer access
16 Text editing
17 Web retrieval utilities
|
List 6.22.1. Types of software of possible interest to the FDA.
1 Software used as a component, part, or accessory of a
medical device
2 Software that is itself a medical device (e.g., blood
establishment software)
3 Software used in the production of a device (e.g.,
programmable logic controllers in manufacturing equipment)
4 Software used in implementation of the device manufacturer's
quality system (e.g., software that records and maintains
the device history record) .
|
List 6.22.2. Features of software that buyers want.
1 Easy installation
2 Simple instructions and documentation
3 Friendly graphic user interface
4 Functionality that supports the user's goals
5 Transparency (no need for user to understand the underlying
assumptions, algorithms and data structures upon which the
functionality of the software is based)
6 Compatibility with operating system and other software
residing on the user's computer
7 Good user support services
|
List 6.22.3. Features of good software (that serious biomedical informaticians need).
1 Extensibility. The functionality of the software and the
data can be modified and expanded
2 Scalability. Should work with any size of inputs
3 Standardization of all data (input and output)
4 Open source code
5 Open access data
6 Self-describing software
7 Cross-platform functionality. Software should operate in
multiple operating systems
8 Interoperability
9 Availability of updates
10 Full documentation of methods and algorithms
|
List 6.22.4. Some properties of valid software, modeled on FDA Principles of Software Validation.
1 Verified "Software verification looks for consistency,
completeness, and correctness of the software and its
supporting documentation, as it is being developed, and
provides support for a subsequent conclusion that software
is validated."
2 Tested "confirm that software development output meets
its input requirements." Includes testing at user site
3 Validated "confirmation by examination and provision of
objective evidence that software specifications conform to
user needs and intended uses, and that the particular
requirements implemented through software can be
consistently fulfilled."
4 Risk-assessed The safety risks posed by the software should
be specified
5 Requirement-Documented "The software and system
requirements must be fully documented. ()" The
validation step requires an analysis of compliance with
documented requirements
6 Controlled development process. Bugs are often introduced
during the software development process. A controlled
development process ensures that changes in the design of
software are tracked and evaluated, documented and corrected
when necessary
7 Design review "Design reviews are documented,
comprehensive, and systematic examinations of a design to
evaluate the adequacy of the design requirements, to
evaluate the capability of the design to meet these
requirements, and to identify problems."
|
List 6.22.5. Conditions that can make software difficult to evaluate. 1 Intrinsic complexity of the software () 2 Use of off-the-shelf software (see Glossary) 3 Use of external software components 4 Witheld source code and poor documentation. |
List 7.1.1. Equivalent terms for the Concept identifier C4863000.
1 C4863000 prostate with adenoca [C4863000 adenoca arising
in prostate
2 C4863000 adenoca involving prostate [C4863000 adenoca
arising from prostate
3 C4863000 adenoca of prostate [C4863000 adenoca of the
prostate
4 C4863000 prostate with adenocarcinoma [C4863000
adenocarcinoma arising in prostate
5 C4863000 adenocarcinoma involving prostate [C4863000
adenocarcinoma arising from prostate
6 C4863000 adenocarcinoma of prostate [C4863000
adenocarcinoma of the prostate
7 C4863000 adenocarcinoma arising in the prostate [C4863000
adenocarcinoma involving the prostate
8 C4863000 adenocarcinoma arising from the prostate
[C4863000 prostate with ca
9 C4863000 ca arising in prostate [C4863000 ca involving
prostate
10 C4863000 ca arising from prostate [C4863000 ca of
prostate
11 C4863000 ca of the prostate [C4863000 prostate with
cancer
12 C4863000 cancer arising in prostate [C4863000 cancer
involving prostate
13 C4863000 cancer arising from prostate [C4863000 cancer of
prostate
14 C4863000 cancer of the prostate [C4863000 cancer arising
in the prostate
15 C4863000 cancer involving the prostate [C4863000 cancer
arising from the prostate
16 C4863000 prostate with carcinoma [C4863000 carcinoma
arising in prostate
17 C4863000 carcinoma involving prostate [C4863000 carcinoma
arising from prostate
18 C4863000 carcinoma of prostate [C4863000 carcinoma of the
prostate
19 C4863000 carcinoma arising in the prostate [C4863000
carcinoma involving the prostate
20 C4863000 carcinoma arising from the prostate [C4863000
prostate adenoca
21 C4863000 prostate adenocarcinoma [C4863000 prostate ca
22 C4863000 prostate cancer [C4863000 prostate carcinoma
23 C4863000 prostatic cancer [C4863000 prostatic carcinoma
24 C4863000 prostatic adenocarcinoma [C4863000 prostate
gland adenocarcinoma
25 C4863000 adenocarcinoma of the prostate gland [C4863000
adenocarcinoma of prostate gland
26 C4863000 prostate gland carcinoma [C4863000 carcinoma of
the prostate gland
27 C4863000 carcinoma of prostate gland When a nomenclature
collects synonymous terms to unique concept identifiers,
medical text containing any of the terms corresponding to a
single the term can be assigned the unique concept code.
When all the terms in a medical database have been coded,
they can be retrieved through a concept search that collects
all synonymous terms by their unifying concept identifier. A
medical sentence can be coded many different ways (see
List). [List. Example of a sentence coded using a neoplasm
nomenclature
28 Primary synovial sarcoma of the mediastinum a
clinicopathologic immunohistochemical and ultrastructural
study of 15 cases
29 term="sarcoma of the mediastinum"
code="C6606000"
30 term="synovial sarcoma" code="C3400000"
31 term="synovial sarcoma of the mediastinum"
code="C6618000"
32 term="primary synovial sarcoma"
code="C8826000"
|
List 7.1.2. Synonyms for the rare tumor, nasopharyngeal carcinoma. 1 Regaud tumor 2 nasopharyngeal carcinoma 3 lymphoepithelial carcinoma 4 Schmincke tumor |
List 7.2.1. Advantages of standardized nomenclatures.
1 Permanence. [Vetting - review and approval by experts and
community stakeholders
2 Wide use and universal recognition. [Comprehensive over a
knowledge domain that extends over many specialized areas
3 Mapping (between standards). [Relationships between
different domains
4 Proven utility. [Development costs often transferred to
users or to funding agencies.
|
List 7.2.2. Advantages of small, specialized nomenclatures.
1 1. Rapid addition of new terms. [2. Complete vocabulary for
a narrow knowledge domain
2 3. Immediate availability of frequently updated versions of
nomenclature
3 4. Data model appropriate for specialized uses of
nomenclature
4 5. Comprehensible by experts in the field. [6. Inexpensive
to create, and often available free to users.
|
List 7.3.1. How names of diseases are chosen.
1 As an an expression of a characteristic pathologic process
(e.g., muscular dystrophy)
2 For the physical agent that produced the disease (e.g.,
plumbism)
3 For a group of people who were at high risk for the disease
(e.g., Legionnaires' Disease, named after a group of
conventioneers who succombed in an early outbreak)
4 For a molecule found in diseased cells (e.g. amyloidosis,
prion disease
5 For a geographic region in which the disease occurs (e.g.,
Tangier Disease from Tangier Island, Maryland)
6 For a geographic region from which an epidemic emanated
(Lyme disease from Lyme, New York
7 For a striking clinical feature of the disease (e.g.,
sleeping sickness
8 As a crude and insensitive comparison to an inanimate
object (e.g., gargoylism)
9 As a literary metaphor (e.g., Pickwickian syndrome, Mad
Hatter's disease)
10 For a striking morphologic feature (e.g., sickle cell
anemia)
11 For a patient who had the disease (e.g., Lou Gehrig disease)
12 For physician or scientist who treated, described or
researched the disease (e.g., Hodgkin disease, Cushing
disease, Kaposi sarcoma)
13 As a clueless acronomym (e.g. CATCH 22, cardiac abnormality,
abnormal facies, t-cell deficit due to thymic hypoplasia,
cleft palate, hypocalcemia resulting from a deletion on
chromosome 22)
14 As a trope from any existing language (e.g., Moyamoya
disease derives from "moyamoya" meaning
"puff of smoke" in Japanese, for the
characteristic tangle of tiny cerebral vessels seen on
x-ray)
15 As an homage to Greek and Latin scholarship (e.g.,
pityriasis lichenoides et varioliformis acuta)
16 As inscrutable combinations of one or more of the above (My
personal favorite inscrutable disease name is the
wistful-sounding "floating-harbor syndrome". This
disease was named by combining the hospital in which one of
the first case appeared, Boston Floating Hospital, and for a
second hospital in which another case appeared, Harbor
General Hospital in Torrance, California.)
|
List 7.3.2. Examples of offensive medical terms.
1 Gargoylism. The name invites comparison of a patient with a
monster
2 Mongolism and Mongoloid idiot (naming a disease after the
peoples that the doctor believes look most like the person
with the disease)
3 Monster. The name suggests that the individual is not human
4 Cretinism. The name links a patient with a pejorative term
(cretin).
|
List 7.3.3. Tasks of the ancient curator.
1 Select canonical (best) terms for concepts
2 Delete obsolete or otherwise denigrated terms
3 Prepare precise definitions for the included terms
4 Prepare revised versions of the nomenclature at intervals,
perhaps once each century
5 Prepare nomenclature in an academic language, such as Latin,
that limits access to scholars.
|
List 7.3.4. Tasks of the modern curator.
1 Add new terms to the nomenclature when they occur in the
domain literature
2 Group synonymous terms under a unique concept code
3 Determine the relationships among the different terms in the
nomenclature and provide links or ontologic classes that
express these relationships
4 Comply with standards for representing the terms in a
nomenclature that will support data integration with other
nomenclatures
5 Ensure the logical consistency of the nomenclature
6 Update and release revised versions of the nomenclature at
intervals, perhaps daily
7 Develop methods for representing variations in the manner
that terms are interpreted and used
8 Post the nomenclature to the Internet, as an Open Access
document
9 Prepare a legal "use" disclaimer
10 Develop methodology for linking concept codes to annotative
data on the internet.
|
List 7.3.5. Good curation practices for Medical Nomenclatures.
1 General characteristics relate to utility and
appropriateness in clinical applications, including that
concepts are not vague, ambiguous, or redundant; purpose and
scope are clear; coverage is in-depth, explicit, and
comprehensive; there are systematic and formal definitions
of all concepts; and the concepts are built into a reference
vocabulary
2 Structure of the vocabulary model determines the ease with
which practical and useful interfaces for term navigation,
entry, or retrieval can be supported
3 Maintenance characteristics provide the technical choices
which impact the capacity of a vocabulary to evolve, change,
and remain usable over time, including context-free
identifiers, persistence of identifiers, and version control
4 Evaluation criteria address how a vocabulary should be
evaluated, and include a clear statement of purpose and
scope, availability of tools for mapping, and usability.
|
List 7.4.1. Doublet method for finding candidate terms from text.
1 1. Collect all the doublets that occur in the entire
nomenclature (i.e., accumulate a list of the doublets from
every term in the nomenclature)
2 2. Parse text into an ordered collection of overlapping
doublets. As an example, "serous borderline ovarian
tumor" would be parsed as "serous borderline,
borderline ovarian, ovarian tumor"
3 3. Compare each consecutive text doublet against the array
of doublets from the nomenclature to determine whether the
doublet exists somewhere in the nomenclature
4 4. If the doublet from the text does not exist in the
nomenclature, it can be deleted. If it exists in the
nomenclature, it is concatenated with the following doublet
if the following doublet exists in the nomenclature.
Otherwise, it is deleted. This process continues,
concatenating doublets that exist somewhere in the
nomenclature. Extraneous leading words (the, in, of, with,
and) and trailer words, (the, and, with, from, a) are
automatically deleted from the final concatenated sequence.
Final concatenated sequences of two or greater consecutive
doublets that match to doublets from the nomenclature are
saved as candidate terms.
|
List 7.4.2. Snippet of Perl code for the doublet method.
1
2 @hoparray = split(/ /,$line);
3 my $olddoublet = "";
4 for ($i=0;$i<(scalar(@hoparray)-1);$i++)
5 {
6 $doublet =
"$hoparray[$i]$hoparray[$i+1]";
7 if (exists $doubhash{$doublet})
8 {
9 if ($englishline ne "")
10 {
11 $englishline = $englishline . "
$hoparray[$i+1]";
12 }
13 else
14 {
15 $englishline = $doublet;
16 }
17 }
|
List 7.4.3. Criteria for including a phrase as a candidate new term.
1 Candidate phrases are composed of concatenated strings of
word doublets that are contained in terms found in an
existing nomenclature
2 Candidate phrases do not already occur in the
nomenclature.
|
List 8.2.1. Output of fuzzy spelling match. 1 c:\ftp>perl spell.pl 2 What word would you like to approximate? 3 hemocromatosis 4 Approximate matches 5 haemochromatosis 6 hemochromatoses 7 hemochromatosis |
List 8.2.2. One-letter (mostly) differences among properly spelled words. 1 arteritis <=> arthritis 2 auxiliary => axillary 3 brachial <=> branchial 4 callous <=> callus 5 chlorpromazine <=> chlorpropamide 6 chorionic <=> chronic 7 coitus <=> colitis 8 colic <=> colonic 9 costal <=> coastal 10 cygnet <=> signet 11 diploic <=> diploid 12 disc <=> disk 13 disease <=> decease 14 dyskaryosis <=> dyskeratosis 15 ectatic <=> ecstatic 16 enema <=> anemia 17 facial <=> fascial 18 facies <=> feces 19 faeces <=> facies 20 fascial <=> facial 21 fetal <=> fatal 22 firearm <=> forearm 23 hallux <=> helicis 24 helicis <=> hallux 25 herpangina <=> herpetic 26 herpetic <=> herpangina 27 hydatid <=> hydatidiform 28 hydatidiform <=> hydatid 29 ileitis <=> iliitis 30 ileum <=> ilium 31 isotope <=> isotrope 32 keratin => kerasin 33 keratinocytic => keratinolytic 34 keratosis <=> ketosis 35 lipoma <=> lymphoma 36 live <=> liver 37 lover <=> liver 38 malleolus <=> malleus 39 milia <=> milium 40 mucous <=> mucus 41 myelofibrosis <=> myofibrosis 42 oncology <=> ontology 43 osteoblastoma <=> osteoclastoma 44 paleodontology <=> paleontology 45 paleontology => paleodontology 46 palette <=> palate 47 palpation => palpitation 48 parental <=> parenteral 49 penal <=> penile 50 penicillamine <=> penicillin 51 penile <=> penal 52 perineal <=> peroneal 53 pleural <=> plural 54 porphyria <=> porphyruria 55 prostate <=> prostrate 56 protuberant <=> protruberant 57 quinidine <=> quinine 58 rachischisis <=> rachitis 59 rachischitic <=> rachitic 60 ret <=> rett 61 rosacea <=>rosea 62 semantic <=> somatic 63 silicon <=> silicone 64 taenia <=> tinea 65 thecoma <=> thekeoma 66 tinnitus <=> tinnitis 67 trichinosis <=> trichosis 68 ureteral <=> urethral 69 vagitis <=> vaginitis |
List 8.2.3. Drugs with similar names. 1 acetazolamide <=> acetahexamide 2 ambien <=> amen 3 amiodarone <=> amrinone 4 cardene sr <=> cardizem sr 5 chlorpropamide <=> chlorpromazine 6 clonidine <=> klonipin 7 clozapine <=> olanzapine 8 feldene <=> seldane 9 flomax <=> volmax 10 flutamide <=> flumadine 11 imipenem <=> omnipen 12 lodine <=> codeine 13 methadone <=> methylphenidate 14 ms contin <=> oxycontin 15 oruvail <=> clinoril 16 penicillin <=> penicillamine 17 prilosec <=> prozac 18 quinidine <=> quinine 19 retrovir <=> ritonavir 20 zocor <=> cozaar |
List 8.2.4. Common misspellings appearing pathology reports.
1 abcess (should be abscess)
2 anastamosis (anastomosis)
3 bissected (dissections are done with bisections)
4 caricnoma (the most commonly occurring terms are commonly
misspelled)
5 casset
6 cassett (both cassette and casette are permissible)
7 debridment
8 entirley
9 formlain illeocecal (one "l" please)] lymphnode (a
lymph node is two words)]
10 membraneous (the noune, "membrane" has an
adjective, "membranous")
11 mesentary
12 palmer
13 spleenic (the noun is "spleen" but the adjective
is "splenic")
14 tannish ("ish" is a popular but unnecessary
suffix)
15 uretheral (ureteral is a word and so is urethral, but
uretheral is not)
|
List 8.2.5. Permissible alternate spellings. 1 anonymization = anonymisation 2 artifact = artefact 3 cassette = casette 4 catheterisation = catheterization 5 dilatation = dilation 6 exotropia = exotrophia 7 preventative = preventive 8 sulfate = sulphate 9 sulfur = sulphur 10 sulfuric = sulphuric 11 travelling = traveling |
List 8.2.6. Dually occurring orthographic variants in UMLS that are probably not proper equivalences. 1 neurilemmoma and neurilemoma 2 sacroiliitis and sacroileitis 3 costalchondritis, costochondritis and costal chondritis 4 azoospermia and azospermia 5 Bartter's Disease and Barter's Disease 6 in situ and insitu 7 gall bladder and gallbladder. |
List 8.3.1. Disease homonyms.
1 cervical carcinoma (of neck or of uterus?)
2 medullary carcinoma (can refer to medullary carcinoma of
breast, or thyroid or of adrenal medulla)
3 Paget's disease (can refer to different diseases involving
either breast or bone)
4 Bowen's disease (can refer to different diseases in skin and
nipple
|
List 8.17.1. Dangerous pathology abbreviations.
1 abg aortic bifurcation graft, or aortobifemoral graft
2 aha acquired hemolytic anemia, or autoimmune hemolytic
anemia
3 ascvd arteriosclerotic cardiovascular disease, or
arteriosclerotic cerebrovascular disease
4 chd congenital heart disease, or congestive heart disease,
or coronary heart disease
5 doa date of admission, or dead on arrival
6 edc estimated date of conception, or estimated date of
confinement ("due date" means almost the opposite
of "conception date")
7 hzo herpes zoster ophthalmicus, or herpes zoster oticus
8 ibd inflammatory bowel disease, or irritable bowel disease
9 lll left lower lid, or left lower lip, or left lower lobe,
or left lower lung
10 mcgn mesangiocapillary glomerulonephritis or minimal change
glomerulonephritis
11 mvr mitral valve regurgitation, or mitral valve repair, or
mitral valve replacement
12 nc no change, or noncontributory
13 nkda no known drug allergies, or nonketotic diabetic
acidosis
14 pe pulmonary effusion, or pulmonary edema, or pulmonary
embolectomy or pulmonary embolism
15 sk seborrheic keratosis, or solar keratosis
16 uvf ureterovaginal fistula, or urethrovaginal fistula
|
List 8.18.1. JCAHO "do not use" abbreviations (minimum list, effective january 1, 2004).
1 U (for unit) Reason: "U" visually mistaken as 0 or
. Write "unit."
2 IU (for international unit) Reason: Mistaken as IV
(intravenous or the number 4), or the number 10. Write
"international unit."
3 Q.D., Q.O.D. (Latin abbreviation for once daily and every
other day) Reason: Mistaken for each other. The period after
the Q can be mistaken for an "I" and the
"O" can be mistaken for "I." Write
"daily" and "every other day."
4 Trailing zero (X.0 mg), Lack of leading zero (.X mg) Reason:
Decimal point is missed. Never write a zero by itself after
a decimal point (X mg), and always use a zero before a
decimal point (0.X mg).]
5 MS, MSO4, MgSO4 Reason: Confused for one another. Can mean
morphine sulfate or magnesium sulfate. Write "morphine
sulfate" or "magnesium sulfate."
6 mg (for microgram) Reason: Mistaken for mg (milligrams),
resulting in a 1000-fold dosing overdose. Write
"mcg."
7 H.S. (for half-strength or Latin abbreviation for bedtime),
q.H.S. Reason: Mistaken for either half-strength or hour of
sleep (at bedtime). q.H.S. mistaken for every hour. All can
result in a dosing error. Write out
"half-strength" or "at bedtime."
8 T.I.W. (for three times a week) Reason: Mistaken for three
times a day or twice weekly, resulting in an overdose. Write
"3 times weekly" or "three times
weekly."
9 S.C. or S.Q. (for subcutaneous) Reason: Mistaken as SL for
sublingual, or "5 every." Write "Sub-Q,"
"subQ," or "subcutaneously."
10 D/C (for discharge) Reason: Interpreted as discontinue
whatever medications follow (typically discharge meds).
Write "discharge."
11 cc (for cubic centimeter) Reason: Mistaken for U (units)
when poorly written. Write "ml" for milliliters
12 A.S., A.D., A.U. (Latin abbreviation for left, right, or
both ears) O.S., O.D., O.U. (Latin abbreviation for left,
right, or both eyes) Reason: Mistaken for each other (e.g.,
AS for OS, AD for OD, AU for OU, etc.). Write "left
ear," "right ear," or "both ears;"
"left eye," "right eye," or "both
eyes."
|
List 9.1.1. Seven interpretations of "I didn't say you lied to me".
1 "I didn't say you lied to me." Stressing
"I," the sentence means that somebody else said
that you lied to me
2 "I didn't say you lied to me." Stressing
"didn't," the sentence means that I had nothing to
do with it
3 "I didn't say you lied to me." Stressing
"say," the sentence means that I didn't speak the
assertion but I may have made the assertion in a written or
other non-verbal communication. ["I didn't say you lied
to me." Stressing "you," the sentence means
that someone else lied to me. ["I didn't say you lied
to me." Stressing "lied," the sentence means
that I say you did something to me (other than lying)
4 "I didn't say you lied to me." Stressing
"to," the sentence means that you lied but not to
my face
5 "I didn't say you lied to me." Stressing
"me," the sentence means that you lied to someone
else.
|
List 9.1.2. Common problems that reduce the meaning of narrative text. 1 Complex or run-on sentences 2 Inscrutable use of negations 3 Polysemous words and terms 4 Idiomatic phrases 5 Indiscriminate use of abbreviations 6 Ambiguous pronouns 7 Misspellings. |
List 9.1.3. The following have been widely distributed over the web and purportedly came from real medical charts.
1 "The baby was delivered, the cord clamped and cut, and
handed to the pediatrician, who breathed and cried
immediately."
2 "The patient had waffles for breakfast and anorexia for
lunch."
3 "The patient lives at home with his mother, father, and
pet turtle, who is presently enrolled in day care three
times a week."
4 "Bleeding started in the rectal area and continued all
the way to Los Angeles."
5 "Coming from Detroit, this man has no children."
6 "Examination reveals a well-developed male lying in bed
with his family in no distress."
|
List 9.2.1. Some steps in machine translation.
1 Parsing sentences into grammatic structures
2 Identifying idiomatic expressions
3 Disambiguating polysemous terms (based on sentence context)
4 Re-ordering terms based on grammar rules
5 Providing gender, tense and specialized language structures
that may be absent in the source language
6 Determining grammar rule exceptions existing for words and
terms in the source and target languages
7 Mapping between two different vocabularies.
|
List 9.2.2. Some controlled English rules ().
1 Each word in the text may convey only one meaning (e.g., if
iris is an anatomic part of the eye, it cannot also be a
flower)
2 For each meaning, only one term may be used (e.g., if you
use the term "tumor" you should not use the terms
"neoplasm, neoplastic growth, or mass" when you
want to convey the same conceptual meaning as
"tumor."
3 Each word is used in only one word class (e.g., if
"report" is a noun, as in surgical pathology
report, it cannot be used as a verb, as in "Please
report the pathology results.")
|
List 9.2.3. Regular English version of excerpt from the Winston Churchill's |
List 9.2.4. Basic English version of excerpt from the Winston Churchill's |
List 9.2.5. Suggestions for controlling medical text.
1 Sentences should be short and declarative, with an
unambiguous sentence terminator
2 Negations should include the word "not" and double
negations should never be used
3 Abbreviations and acronyms should be represented as
all-uppercase letters and should not contain periods, except
when they occur at the end of a sentence. Abbreviations can
be made plural by adding a lowercase "s".
|
List 9.3.1. Research value of archived surgical pathology data.
1 All biopsied disease entities are included in the database,
representing every category of biopsied disease (e.g.,
metabolic/toxic, traumatic, genetic/congenital, neoplastic,
degenerative, inflammatory, infectious)
2 Specimens can be characterized not only by diagnosis but
also by descriptive terminology that may relate to
prognostic or treatment categories
3 Database entries correspond to archived material (glass
slides and paraffin blocks) that can be recovered for
research purposes
4 Preparing and coding reports is an established and required
activity of surgical pathology departments.
|
List 9.4.1. Five common coding errors in human-coded reports.
1 Factually correct but unhelpful codes (e.g., coding all
benign lesions as `negative for tumor'
2 Inconsistent codes (coding `dysplasia' on Monday and
`atypia' on Tuesday)
3 Idiosyncratic codes (using a mnemonic for a lesion, often
inscrutable to other people, such as coding all fungal
infections as "fungus ball," under the morphology
axis, rather than taking the time to assign a specific code
from the infection axis, and remembering that the now
private code "fungus ball" must be used for any
future fungal searches)
4 Entry errors (e.g., entering `lipoma' when one intends to
enter `lymphoma' and accepting the wrong code matched by the
software)
5 Incomplete coding due to impatience or laziness.
|
List 9.5.1. Algorithm for the doublet autocoder.
1 1. Each phrase (term) in the nomenclature (neocl.xml) is
converted into intercalated doublets, and each doublet is
assigned a consecutive number
2 2. Each nomenclature phrase is assigned the concatenated
list of numbers that represent the ordered doublets
composing the phrase
3 3. Every text record (pubmed abstract in this case) is split
into an array consisting of the consecutive words in the
text record
4 4. The text array is parsed as intercalated doublets.
Intercalated doublets from the text that match doublets
found anywhere in the nomenclature are assigned their
numeric values (from the doublet index created for the
nomenclature). Runs of consecutive doublets from the text
that match doublets from the nomenclature are built into
concatenated strings of doublet values. The occurrence of a
text doublet that does not match any doublet in the
nomenclature cannot possibly be part of a nomenclature term.
Such text doublets serve as "stop" doublets
between candidate runs of text doublets that match
nomenclature doublets
5 5. The runs of matching doublets are tested to see if they
match any of the runs of doublets that compose nomenclature
terms or if they contain any subsumed terms that match
nomenclature terms
6 6. The array of doublet runs extracted from the text that
match nomenclature terms are cached in an external file.
|
List 9.8.1. Algorithm for code-based searches without pre-annotation.
1 The user enters a query term
2 All the terms from a preferred nomenclature that are
synonymous to the query term are collected into a list
3 Each term in the list is matched against the text corpus to
determine the locations in the corpus where the term is
found
4 A list is assembled of corpus locations matching the query
term or its synonyms
5 The user's query term is matched against all the equivalent
terms included in a preferred vocabulary.
|
List 9.8.2. Steps involved in implementing on-the-fly code searches.
1 Users will need to have a dataset or plain-text corpus in
which every record is separated by the same delimiter. For
testing purposes, the author created a 105 megabyte text.
The corpus was created by a PubMed query on "pathology
(ad)AND neoplasm", at the U.S. National Library of
Medicine's website (www.pubmed.org). The query gathered all
abstracts from the Pubmed database in which the term
neoplasm occurs somewhere in the Pubmed entry, and in which
the affiliation of the author contains the word
"pathology". The query yielded abstracts that are
likely to contain names of neoplasms. The PubMed output file
can serve as a good test for an autocoder that uses a
neoplasm nomenclature. The PubMed search yielded 66,509
abstracts. All of the abstracts were downloaded into a
single file from the PubMed site by setting the
"Display" attribute to "Abstract" and
the "Send to" attribute to "file". This
produced a 105,689,546 byte plain-text file. The file was
given the filename tumorab.txt, and this filename was used
by the autocoders as a parsing input file. Although this
file is not included with this manuscript, anyone in the
world with internet access can obtain a near-identical file
by repeating the same PubMed query. The records from the
text consisted of paragraphs delimited by double return
characters (also referred to as double-newline characters or
double |
List 9.9.1. When pre-annotation is useful.
1 When there is the need for a global analysis of the dataset.
(In a global analysis of a dataset, all the data elements
are examined at once. Typically, the researcher is looking
for relationships among the different data elements
2 When the query response time must be very rapid. [When the
expected number of queries may be very large.
|
List 9.9.2. When fast doublet matching is useful.
1 When the dataset is typically searched one item at a time
2 When the dataset does not change or changes only by the
addition of records
3 When the dataset is being searched using many different
types of vocabularies
4 When the dataset is searched with a single vocabulary that
is constantly changing
5 When one dataset needs to be integrated with another
dataset, and the datasets have not been annotated with the
same nomenclature.
|
List 9.10.1. When you will need to re-code.
1 Whenever you want to change from one nomenclature to another
(eliminates problem of brand-name loyalty)
2 Whenever you introduce a new version of a nomenclature
[Whenever you want to use a new coding algorithm (e.g.
Parsimonious versus comprehensive, or linking code to a
particular extracted portion of report) [Whenever you add
legacy data to your LIS [Whenever you merge different
pathology datasets - eliminates many mapping projects
|
List 10.2.1. Making medical record data harmless.
1 1. De-identification of data fields that specifically
characterize the patient (name, social security number,
hospital number, address, age, etc.)
2 2. Free-text data scrubbing, removing identifiers from the
textual portion of medical reports
3 3. Rendering the dataset ambiguous, ensuring that patients
cannot be identified by data records containing a unique set
of characterizing information
4 4. Free-text data privatizing, removing any information of a
private nature that may be contained within the report.
|
List 10.3.1. HIPAA safe harbor identifiers.
1 Names
2 Geographic subdivisions smaller than a State
3 Dates (except year) directly related to patient
4 Telephone numbers
5 Fax numbers
6 E-mail addresses
7 Social security numbers
8 Medical record numbers
9 Health plan beneficiary numbers
10 Account numbers
11 Certificate/license numbers
12 Vehicle identifiers and serial numbers
13 Device identifiers and serial numbers
14 Web URLs
15 Internet Protocol (IP) address numbers
16 Biometric identifiers, including finger and voice prints
17 Full face photographic images and any comparable images
18 Any other unique identifying number, characteristic, or
code, except as permitted under HIPAA to re-identify
data
|
List 10.6.1. Deficiencies of subtractive data scrubbing methods.
1 Requires the creation and continuous maintenance of an
identifier list conisting of names of patients, staff, and
medical centers as well as addresses and other geographic
minutiae
2 Requires the creation and continuous maintenance of rules
for excluding text based on co-locations or patterns of
expression that might signify a HIPAA identifier (e.g. a
sequence of digits and slashes that might represent a date)
3 Does not exclude private information that is non-identifying
but which may be incriminating or distasteful
4 Does not satisfy the "minimum necessary" (see
Glossary) principle, holding that medical data convey only
that information which is needed for research purposes
5 Slow. Each parsed sentence is typically evaluated through
the entire list of pattern rules. This means that parsing a
long corpus of medical text will take considerable time
6 Complex. Maintaining the rule list and the identifier list
will add to the overall complexity of the software. Each
institution that implements the software will need to
maintain their own lists created for their patients and for
their textual styles and formats
7 Inadequate. Subtractive scrubbers, under the best of
circumstances, will occasionally miss an identifier. If a
scrubber is 99% accurate, it may miss thousands of
identifiers in a large text.
|
List 10.6.2. Algorithm for the Concept-Match method of data scrubbing.
1 1. Parse all input into sentences
2 2. Parse each sentence into words
3 3. Each stop word (high-frequency words, including
prepositions and common adjectives) is preserved in its
original place within each sentence
4 4. Intervening words and phrases are mapped to a standard
nomenclature. This step requires breaking phrases into all
possible ordered concatenations of words. For instance,
"Margins free of tumor" would become "margins
free of tumor, margins free of, free of tumor, margins free,
free of, of tumor, margins, free, of, tumor." Each
member of the derivative list is matched against the entire
database of Unified Medical Language System (UMLS) terms to
determine if a code exists for the term. Large terms subsume
smaller substring terms
5 5. Each coded term is replaced by an alternate term that
maps to the same concept code, if an alternate term exists.
For instance, the term renal cell carcinoma appearing in the
text would be replaced by C0007134 (the UMLS Concept Unique
Identifier for renal cell carcinoma) and by a different term
that maps to the same code (such as rcc, hypernephroma,
hypernephroid carcinoma, or Grawitz tumor). This step
produces an output containing a different set of words than
the original text (see List)
6 6. All other words are replaced by blocking symbol
consisting of an asterisk.
|
List 10.6.3. Sample output of the Concept-Match scrubbing method.
1 <rdf:Description
about="urn:PMID-15832079">
2 <dc:title>
3 primary synovial sarcoma of the mediastinum a [
clinicopathologic immunohistochemical and ultrastructural
4 study of 15 cases
5 </dc:title>
6 <v:autocode term="sarcoma"
code="C0000000" />
7 <v:autocode term="sarcoma of the
mediastinum" code="C6606000" />
8 <v:autocode term="synovial sarcoma"
code="C3400000" />
9 <v:autocode term="synovial sarcoma of the
mediastinum" code="C6618000" />
10 <v:autocode term="primary synovial
sarcoma" code="C8826000" />
11 <de_id>
12 * primary synovial sarcoma of the mediastinum a [ * *
and * * of * * * *
13 </de_id>
14 </rdf:Description>
|
List 10.6.4. Scrub.pl, a Perl script to scrub text.
1 #!/usr/local/bin/perl
2 open (TEXT, "doubdb.txt")||die"Can't open
file"; [$line = " ";
3 while ($line ne "")
4 {
5 $line = <TEXT>;
6 $line =~ s/\n//o;
7 $doubhash{$line} = "";
8 }
9 close TEXT;
10 print "What would you like to scrub?\n";
11 $line = <STDIN>;
12 print "Scrubbed text.... ";
13 $line = lc($line);
14 $line =~ s/\'s//g;
15 $line =~ s/[^a-z0-9 \-/ /g; [@hoparray = split(/ +/,$line);
16 $lastword = "\*";
17 for ($i=0;$i<(scalar(@hoparray));$i++) [ {
18 $doublet = "$hoparray[$i]
$hoparray[$i+1]";
19 if (exists $doubhash{$doublet}) [ {
20 print " $hoparray[$i"; [ $lastword = "
$hoparray[$i+1 ";]
21 }
22 else
23 {
24 print $lastword; [ $lastword = " \*";
25 }
26 }
27 print "\n";
28 exit;
|
List 10.6.5. Sample output from scrub.pl. 1 C:\ftp>perl scrub.pl 2 What would you like to scrub? 3 Basal cell carcinoma, margins involved 4 Scrubbed text.... basal cell carcinoma margins involved 5 C:\ftp>perl scrub.pl 6 What would you like to scrub? 7 Rhabdoid tumor of kidney 8 Scrubbed text.... rhabdoid tumor of kidney 9 C:\ftp>perl scrub.pl 10 What would you like to scrub? 11 Mr Brown has a basal cell carcinoma 12 Scrubbed text.... * * has a basal cell carcinoma 13 C:\ftp>perl scrub.pl 14 What would you like to scrub? 15 Mr. Brown was born on Tuesday, March 14, 1985 16 Scrubbed text.... * * * * * * * * * 17 C:\ftp>perl scrub.pl 18 What would you like to scrub? 19 The doctor killed the patient 20 Scrubbed text.... * * * * * |
List 10.10.1. Rules of non-uniqueness in database de-identification.
1 1. No unique cancers are allowed (i.e., every concept in
the database must be present in at least two records). Any
record with a uniquely occurring tumor is expunged
2 2. N-plicate records cannot contain values that are unique
to the n-plicate set (if you have n identical records, you
should be able to look at all the cancer diagnoses in the
replicated record and find other records with the same
diagnosis). If that is not the case, all such n-plicate
records should be expunged from the public use database
3 3. No value may be allowed to occur in every record from a
set of records that share a common diagnosis. If that is the
case, every record containing the concept is expunged
4 4. Every value that co-occurs in a record that contains a
gender-specific tumor must occur in more than one additional
record that does not contain a gender-specific tumor.
Otherwise, all records containing the concept are expunged.
(NOTE: the same argument may be made for tumors that are
highly age specific, or ethnicity-specific or possibly
locale-specific)
5 5. No value-binned record can be unique to a second
diagnosis (i.e. there may be 10 rhabdoid tumor cases, but
only one case of rhabdoid tumor and basal cell carcinoma).
Such cases are expunged.
|
List 10.11.1. Performance issues for de-identification software.
1 1. Product availability. Is the software product freely
available and open source? Grant applications that propose
proprietary data sharing solutions may receive disparaging
reviews in study sections. Reviewers may expect large,
multi-institutional efforts to implement open source
de-identification algorithms. Conversely, proprietary
solutions my be ideal for laboratory personnel who lack the
resources to implement and test published algorithms and who
prefer turnkey applications
2 2. Product speed. Is the de-identification process fast?
This becomes important when the research project involves
millions of records or requires reprocessing records to
satisfy research objectives that change over time or that
serve different research protocols
3 3. Product error rate. There is a trade-off between the
accurate preservation of textual information and the
successful elimination of all identifiers. If the research
project requires the human review of de-identified reports,
it may be necessary to use a de-identification method that
preserves as much of the original text as is feasible.
De-identification methods that maximize the preservation of
original text will tend to have the highest error rates
4 4. Product integration and support issues. Will the
de-identification software work with heterogeneous data
sources, or is it constrained to work with a specific data
input? Will the software permit an interface to the
researcher's preferred database, or will the researcher be
required to transform the primary data structure to a
secondary data structure? If so, will the secondary data
structure conform to an open standard (see Glossary), or
will it be a proprietary data structure? Will the software
be upgraded and will the upgrades be freely available? Can
the software be modified without violating license
agreements?
5 5. Convenience. Will the product require continual
maintenance, staff training, and quality assurance?
Sometimes simplicity and easy maintenance will justifiably
outweigh performance.
|
List 11.1.1. Cryptographic methods vital to biomedical informatics. 1 Encrypting and decrypting messages 2 Electronic signatures 3 Message authentication 4 Time stamping 5 Creating unique identifiers 6 Reconciling patients across institutions 7 De-identification and Re-identification 8 Privatizing data sharing protocols 9 Data referencing (with message digests) 10 Watermarking and steganography (see Glossary) utilities |
List 11.2.1. Example protocol for a one-way hash de-identified record linkage.
1 1. John Q. Public arrives for the first time in your medical
clinic
2 2. John Q. Public has a glucose test ordered and recives a
glucose value of 85
3 3. Using the MD_5 one-way hash algorithm, on the character
string, "John Q. Public" a hash value of
"3f875ec450dfbb07ed889e7b9c36da92" is generated
4 4. In addition to John Q. Public's identified medical
record, a de-identified record is prepared:
5 3f875ec450dfbb07ed889e7b9c36da92^^glucose^^85
6 A property of the one-way hash value is that it is a
seemingly random collection of letters and numbers and no
computational efforts applied to the one-way hash value can
yield the patient's name
7 The de-identified record is given to a trusted database
administrator who adds it to the database of de-identified
records. The database administrator cannot identify any of
the patients whose records are included in the database
8 5. Ten years later, John Q. Public returns to the medical
clinic and has another glucose test. This time, the glucose
value is 95
9 A one-way hash is performed on the string "John Q.
Public" yielding 3f875ec450dfbb07ed889e7b9c36da92, and
a new de-identified record is prepared:
10 3f875ec450dfbb07ed889e7b9c36da92^^glucose^^95
11 The de-identified record is given to the trusted database
administrator who adds it to the aggregate database. The
database program finds a match to the one-way hash and
concatenates the new record to the old record:
12 3f
875ec450dfbb07ed889e7b9c36da92^^glucose^^85^^glucose^^95
|
List 11.4.1. Section of HIPAA regulation specifically addressing use of one-way hash de-identifiers.
1 Comment: Several commenters who supported the creation of
de-identified data for research based on removal of facial
identifiers asked if a keyed-hash message authentication
code (HMAC) can be used as a re-identification code even
though it is derived from patient information, because it is
not intended to re-identify the patient and it is not
possible to identify the patient from the code. The
commenters stated that use of the keyed-hash message
authentication code would be valuable for research, public
health and bio-terrorism detection purposes where there is a
need to link clinical events on the same person occurring in
different health care settings (e.g. to avoid double
counting of cases or to observe long-term outcomes)
2 Response: The HMAC does not meet the conditions for use as a
re-identification code for de-identified information. It is
derived from individually identified information and it
appears the key is shared with or provided by the recipient
of the data in order for that recipient to be able to link
information about the individual from multiple entities or
over time. Since the HMAC allows identification of
individuals by the recipient, disclosure of the HMAC
violates the Rule..
|
List 11.4.2. Steps of the protocol.
1 Step 1. Institution A and Institution B each create a random
character string and send it to the other institution
2 Step 2. Each institution receives the random character
string from the other institution and sums it with their own
random character string, producing a random character string
common to both institutions (RandA+B)
3 Step 3. Each institution takes a patient identifier (a name,
a social security number, a birth date, or some combination
of identifiers) and sums it with RandA+B. The result is a
patient random character string that is identical across
institutions when the patient is identical in both
institutions. This step may be implemented several different
ways
4 Step 3, implementation strategy 1. RandA+B is now destroyed
at both institutions, and RandA and RandB are destroyed by
the institutions that created each random string, leaving
only the patient random character string at each
institution. The destruction of these random numbers makes
it impossible to recompute the original identifier from the
patient random character string
5 Step 3, implementation strategy 2. At this point,
institutions may provide the patient random character string
to a data broker. Having only the patient random character
strings, the broker has zero patient-related information
6 Step 3, implementation strategy 3. The summation function
can be any one of many logical operations on characters or
strings or their constitutive bits, including logical or,
xor, modulo addition, etc
7 Step 4. Institutions A and B compare a subset of their
patient random character strings
8 Step 4, implementation strategy 1. Institution A sends the
first character of the patient random character string to
Institution B. If the first character (or any subsequent
character) is not identical in both institutions, the
protocol ends. The 2 patients are not the same person. If
the first character is identical in both institutions,
Institution B sends the second character of its patient
random character string to Institution A. If the second
character held by Institution B is the same as the second
character held by Institution A, the process is repeated
until a sufficient number of transactions have occurred,
short of the length of the random character string, to
convince the institutions that they have the same patient
random character string. Implementing this optional strategy
ensures that the patient random character strings are never
actually exchanged between institutions.
|
List 11.4.3. Zero-Check Properties.
1 No knowledge about the patient is transmitted across
institutions. When institutions use an institutional broker
to complete the transactions, the institutions themselves
have no knowledge of the identity of the individuals
2 The protocol uses no encryption or 1-way hash algorithm, and
therefore, there is no need to protect the protocol from
discovery
3 By destroying RandA, RandB, and RandA+B, the protocol can be
implemented in a manner that makes it impossible to
recompute the original identifier from the patient random
character string.
|
List 11.5.1. The basic threshold protocol.
1 1. Text is divided into short phrases
2 2. Each phrase is converted by a one-way hash algorithm into
a seemingly-random set of characters
3 3. Threshold Piece 1 is composed of the list of all phrases,
with each phrase followed by its one-way hash
4 4. Threshold Piece 2 is composed of the text with all
phrases replaced by their one-way hash values, and with
high-frequency words preserved.
|
List 11.5.2. Bob's piece 1.
1 684327ec3b2f020aa3099edb177d3794 => suggested autosomal
dominant inheritance
2 3c188dace2e7977fd6333e4d8010e181 => mother
3 8c81b4aaf9c2009666d532da3b19d5f8 => manifestations
4 db277da2e82a4cb7e9b37c8b0c7f66f0 => suggested
5 e183376eb9cc9a301952c05b5e4e84e3 => sons
6 22cf107be97ab08b33a62db68b4a390d => severe
|
List 11.5.3. Bob's piece 2. 1 they db277da2e82a4cb7e9b37c8b0c7f66f0 that the 2 8c81b4aaf9c2009666d532da3b19d5f8 were as 3 22cf107be97ab08b33a62db68b4a390d in the 4 3c188dace2e7977fd6333e4d8010e181 as in the 5 e183376eb9cc9a301952c05b5e4e84e3 and that this 6 684327ec3b2f020aa3099edb177d3794. |
List 11.5.4. Properties of Piece 1 (the listing of phrases and their one-hashes).
1 Contains no information on the frequency of occurrence of
the phrases found in the original text (because recurring
phrases map to the same hash code and appear as a single
entry in piece 1)
2 Contains no information that Alice can use to connect any
patient to any particular patient record. Records do not
exist as entities in Piece 1
3 Contains no information on the order or locations of the
phrases found in the original text
4 Contains all the concepts found in the original text. Stop
words are a popular method of parsing text into concepts
5 Bob can destroy piece 1 and re-create it later from the
original file
6 Alice can use the phrases in Piece 1 to transform, annotate
or search the concepts found in the original file
7 Alice can transfer piece 1 to a third party without
violating HIPAA privacy rules or Common Rule human subject
regulations (in the U.S.). For that matter, Alice can keep
piece 1 and add it to her database of piece 1 files
collected from all of her clients.
|
List 11.5.5. Properties of Piece 2.
1 Contains no information that can be used to connect any
patient to any particular patient record
2 Contains nothing but hash values of phrases and stop words,
in their correct order of occurrence in the original text
3 Anyone obtaining piece 1 and piece 2 can reconstruct the
original text
4 Bob can lose or destroy piece 2, and re-create it later from
the original file.
|
List 11.6.1. FDA definitions related to digital signatures, from ().
1 Biometrics means a method of verifying an individual's
identity based on measurement of the individual's physical
|
List 11.6.2. Qualities of an adequate electronic signature.
1 Non-repudiation. Can the signator plausibly deny that she
signed the document?
2 Unique identification. Does the electronic signature
uniquely identify the signator?
3 Universal. Can the signature be used with other existing
signature or biometric systems?
4 User-friendliness. Can the signature be signed incorrectly
or misinterpreted?
5 Non-obsolescence. Will everyone be able to read the
signature ten years or 100 years from now?
6 Security. Can someone change the file that was signed,
change the signature to someone else's, invalidate a valid
signature, or steal the signature?
7 Extensibility. Can someone sign for the signator, sign with
the signator, or notarize the signature with another
electronic signature?
|
List 11.6.3. Do you trust digital signatures?
1 Is your computer secure? Might someone have stolen your
private key?
2 Is your encryption software secure? Might it contain a
trapdoor subprogram that sends your private key to a
malevolent entity?
3 Is your encryption software reliable. Might it produce the
same private key for different customers?
4 Can you be certain that the software "signed" the
correct document. Might it have signed a different document
by mistake during a signing transaction?
5 Are you certain that you will not lose track of your
private keys over time?
|
List 12.2.1. Six extraordinary properties of XML.
1 Enforced and defined structure (XML rules and schema)
2 Formal metadata (through ISO11179 specification)
3 Namespaces (permits sharing of uniquely identifiable common
data [elements (CDEs))
4 Linking data via the internet (through Unique Resource
Identifiers)
5 Logic and meaning (the Semantic Web and Ontologies)
6 Self-awareness (software agents (see Glossary), artificial
intelligence (see Glossary), embedded protocols and
commands)
|
List 12.4.1. Descriptors for the metadata tag <core_organism>.
1 core_organism
2 Identifier: core_organism
3 Version: 1.0
4 Registration Authority: Association for Pathology
Informatics
5 Language: English (en)
6 Obligation: Optional
7 Datatype: Character String representing taxonomy.dat
identifier number followed by an allowable taxonomy.dat name
for the identifier number
8 Maximum Occurrence: Unlimited
9 Definition: Organism name at species level for organism
whose tissue is represented in the donor block
10 Comment: URI for taxonomy.dat is ftp://ftp.ebi.ac.uk/pub/databases/taxonomy/taxonomy.dat
The correct entry for human tissue is "9606
human"
|
List 12.10.1. Steps to create data documents that can be easily integrated across heterogeneous datasets.
1 1. Find a simple way of describing pathology data using a
syntax that confers meaning onto data (i.e., RDF syntax)
2 2. Develop a simple approach to listing the unique objects
relevant to the pathology domain (e.g. surgical pathology
report, specimen, block, stain, laboratory test, submission
data, completion data) and a way to uniquely identify these
objects that can be used by any laboratory without risk of
losing track of unique objects and without risk of assigning
non-unique objects the same identifier)
3 3. Develop a repository for metadata that defines each
metadata element in the pathology domain and describes the
data constraints (if any) of the data elements described by
the metadata
4 4. Develop general algorithms/software for integrating,
aggregating and transforming data held in RDF triple
databases.
|
List 12.11.1. Image description using RDF triples in Notation 3 format.
1 file:image.n3 @prefix :
2
&
lt;http://www.pathologyinformatics.org/image_schema.rdf#>
3 @prefix rdf:
<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
4 :Baltimore_Hospital_Center rdf:type "Hospital"
5 :Baltimore_Hospital_Center_4357
rdf:type"Unique_medical_identifier"
6 :Baltimore_Hospital_Center_4357 :patient_name
"Sam_Someone"
7 :Baltimore_Hospital_Center_4357 :surgical_pathology_specimen
"S3456_2001"
8 :S_3456_2001 rdf:type
"Surgical_pathology_specimen"
9 :S_3456_2001 :image
<https://baltohosp.org/pathology/y49w3p2.jpg>
10 :S_3456_2001:log_in_date "2001-08-15"
11 :S_3456_2001 :clinical_history
"30_years_oral_tobacco_use"
12 <https://baltohosp.org/pathology/y49w3p2.jpg> rdf:type
"Medical_image"
13 <https://baltohosp.org/pathology/y49w3p2.jpg>
:surgical_pathology_accession_number "S3456-2001"
14 <https://baltohosp.org/pathology/y49w3p2.jpg>
:specimen "2"
15 <https://baltohosp.org/pathology/y49w3p2.jpg> :block
"3"
16 <https://baltohosp.org/pathology/y49w3p2.jpg> :format
"jpeg"
17 <https://baltohosp.org/pathology/y49w3p2.jpg> :width
"524_pixels"
18 <https://baltohosp.org/pathology/y49w3p2.jpg> :height
"429_pixels"
19 <https://baltohosp.org/pathology/y49w3p2.jpg>
:hash_value "84027730gjsj350489"
20 <https://baltohosp.org/pathology/y49w3p2.jpg>
:hash_type "md_5"
21 <https://baltohosp.org/pathology/y49w3p2.jpg>
:watermark "none"
22 <https://baltohosp.org/pathology/y49w3p2.jpg> :camera
"Sony"
23 <https://baltohosp.org/pathology/y49w3p2.jpg>
:camera_model "342"
24 <https://baltohosp.org/pathology/y49w3p2.jpg>
:capture_date "2002-02-02"
25 <https://baltohosp.org/pathology/y49w3p2.jpg>
:diagnosis "squamous_cell_carcinoma"
26 <https://baltohosp.org/pathology/y49w3p2.jpg>
:topography "floor_of_mouth"
27 <https://baltohosp.org/pathology/y49w3p2.jpg> :has
"Intellectual_property_restriction"
28 <https://baltohosp.org/pathology/y49w3p2.jpg>
:copyright "all_rights_reserved"
29 <https://baltohosp.org/pathology/y49w3p2.jpg>
:copyright_holder "Baltimore_Hospital_Center"
30 <https://baltohosp.org/pathology/y49w3p2.jpg>
:microscope "Olympus"
31 <https://baltohosp.org/pathology/y49w3p2.jpg>
:microscope_model "3453"
32 <https://baltohosp.org/pathology/y49w3p2.jpg>
:microscope_objective_power "40X"
33 <https://baltohosp.org/pathology/y49w3p2.jpg>
:photographer_name "Jules_Berman".
|
List 12.11.2. RDFParse.pl, a Perl script that converts Notation 3 into RDF. 1 #!/usr/local/bin/perl 2 #rdfparse.pl 3 use RDF::Notation3; 4 use RDF::Notation3::Triples; 5 $path = "image.n3"; 6 $rdf = RDF::Notation3::Triples->new(); 7 $rdf->parse_file($path); 8 $triples = $rdf->get_triples; 9 print @$triples->[0]->[0]; 10 print "###\n"; 11 use RDF::Notation3::XML; 12 $rdf = RDF::Notation3::XML->new(); 13 $rdf->parse_file($path); 14 $string = $rdf->get_string; 15 print $string; 16 use RDF::Notation3; 17 exit; |
List 12.11.3. Properties of image.n3.
1 Every statement has a fully specified object followed by a
property and a value (a triple)
2 Every unique object belongs to a class (at least one).
|
List 12.12.1. A few examples of XML schema primitives that can be incorporated in DAML. 1 enumeration 2 positiveInteger 3 minInclusive 4 integer 5 pattern |
List 13.1.1. Complexities of commercial biomedical software.
1 Must protect the confidentiality and privacy of patients
2 Must not produce errors. Patient lives are at stake
3 Must not crash. Patient lives are at stake
4 Must have graphic user interface that anyone can use. Staff
training is an expensive proposition, and most medical
centers hate to train their staff to use their computers
5 Must provide the key functionality to the user. Most errors
found by users relate to a misunderstanding between the user
and the developer relating to the intended purpose of the
software (). Most software developers know nothing about the
work flow of hospital Hospitals. Developers seldom know
which functions of their software are vital to patient care.
Hospital personnel have trouble understanding how anyone
could NOT know these things
6 All functionalities must be scalable (able to accommodate
unanticipated increases in usage demands)
7 Must successfully interoperate with many other systems
(hardware and software) throughout the hospital and over the
internet
8 Must anticipate future needs. Hospitals pay many millions of
dollars for information systems. They would much prefer not
to buy a new system every time their activities change
9 Must survive the demise of the company that produced the
software. The death rate of commercial biomedical software
developing companies is atrociously high. Hospitals have
been left with virtually useless systems when their
software vendors have vanished
10 Must not be vulnerable to malicious attack. There are
actually very few programmers with the expertise to design
software with principles of computer security. The
likelihood that any piece of software has been written with
the help of a computer security expert is quite low
11 Must not be vulnerable to the unintended consequences of
exuberant users. Large systems have been known to crash when
users strain computational resources with recursive
queries.
|
List 13.2.1. Reasons why hospital informatics projects, such as CPOE, may fail.
1 Tasks that were traditionally accomplished through
intepersonal communication may be replaced by solitary entry
sessions with HIS computer terminals. Opportunities to
share helpful explanations and patient status updates may be
lost
2 Computer entry tasks may be tedious, time-consuming, and
repetitive. Harried staff, under these circumstances, may do
an incomplete or sloppy job
3 Computer orders, once entered, may have no mechanism for
correcting entry errors, leading to miscommunications
4 The asynchronous nature of multi-user entries into the HIS
may cause havoc in a system that depends on coordinated
workflow. For instance a prescription may not be filled by
pharmacy until an order entered by a clerk-typist is
released by a physician. If there is no system to ensure
that each entry occurs in a timely and coordinated manner,
workflow is halted.
|
List 13.3.1. Desperate actions intended to cope with complexity.
1 Lock-down your data. Restricting the variety of data
permitted in the database can sometimes reduce complexity
2 Lock down your software components. Use the last software
version that worked and stop trying to enhance any
components
3 Lock down your operating system. Do not even try to offer
cross-platform interoperability
4 Lock down your set of assumptions. Software built on an
static set of assumptions about the user's world is often
manageable. Changing these assumptions is asking for
problems
5 Lock down activities. Reduce the number of people or
services that are supported by the software
6 Hire specialized programmers. Each programmer should
concentrate on a very small component of your software. Hire
more programmers so that every software component is
adequately staffed with developers, analysts, and technical
support staff. There is strength in numbers
7 Stop thinking about fundamental appraches to problems. There
is no going back once the juggernaut is launched
8 Replace system with a newer, more expensive and more complex
system. Ahh. The cycle of life is renewed.
|
List 13.3.2. A few effective measures to cope with software complexity.
1 Write clean software code and use in-line documentation to
explain the purpose of software commands
2 Provide detailed and clear documentation for all software
components
3 Use object-oriented languages and follow standard techniques
for good object design
4 Use UML
5 Use refactoring (see Glossary) methodology to improve
complex code
6 Continually test software and carefully document all
modifications.
|
List 13.3.3. Intrinsically complex components of biomedical informatics. |
List 13.3.4. Intrinsically simple components of biomedical informatics.
1 Classifications. A class inherits properties in a direct
lineage from a parent class. An object can only occupy a
single class. Classifications are easy to understand and
compute
2 Flat data files that can be extended but not re-written. A
telephone book is a close example. If people never changed
their names, never died, and never changed their telephone
numbers, a telephone directory would be an ideal flat-file
3 The EMR (electronic medical record). The EMR is the digital
equivalent of the patient chart. In this model, all new
clinical reports pertaining to a patient are inserted into
the EMR object for the patient. This is a simple data model
that can work well so long as one and only one record is
created for each patient
4 Small, self-contained specialized information systems. These
applications are built designed for a specific and narrow
function (e.g. cytopathology information system). Complexity
does not intervene until the specialized information system
needs to interact with other systems in the hospital
5 Fundamental algorithms. Almost all important algorithms are
simple and can be explained in a few steps. From these
simple algorithms, complex systems can arise
6 Simple protocols. Very simple protocols can support
incredibly complex systems. TCP/IP (the internet protocol)
is a simple strategy for transferring packets of
information over a network of computers
7 Elegant object oriented programming languages, such as Ruby.
Though Ruby is a simple and elegant language, it can be used
to create hopelessly complex software. Programmers need
extensive training in design principles that minimize
complexity
8 Specifications. Specifications are formal ways of explaining
what you've done so that computers and humans can understand
and replicate your work
9 Unique data identifiers. Computers are good at creating and
tracking unique identifiers
10 Encryption. It is easy to make something a secret
11 De-identified public datasets. Publicly released
de-identified data has immense scientific value and, with
remarkably few exceptions, has not hurt patients.
|
List 13.5.1. CDC report of annual number of deaths in the U.S. from leading diseases. 1 Heart disease: 696,947 2 Cancer: 557,271 3 Stroke (cerebrovascular diseases): 162,672 4 Chronic lower respiratory diseases: 124,816 5 Accidents (unintentional injuries): 106,742 6 Diabetes: 73,249 7 Influenza/Pneumonia: 65,681 8 Alzheimer's disease: 58,866 9 Nephritis, nephrotic syndrome, and nephrosis: 40,974 10 Septicemia: 33,865 |
List 13.6.1. Properties of a classification.
1 A classification is a grouped taxonomy (listing of all
objects in a knowledge domain) with the following four
properties:
2 1. Inheritance: Hierarchical structure, with each class of
tumors inheriting properties of its ancestors
3 2. Uniqueness: Each tumor occurs in only one place in the
classification
4 3. Comprehensive: All tumors are included
5 4. Class-intransitive: A tumor from one class does not
change into a tumor from another class (e.g. an
adenocarcinoma does not become a lymphoma)
|
List 13.6.2. Limitations of current neoplasm classifications.
1 Classifications are created piecemeal for specific sites or
organ systems. Nobody has published a comprehensive
classification, although comprehensive taxonomies have been
attempted
2 Classifications are often based on medical disciplines,
rather than on any biologic principles (e.g. classification
of dermatologic tumors)
3 A given tumor will appear redundantly when
subclassifications are merged
4 No tumor classification has been prepared in a standard
format designed to exchange, merge or analyze heterogeneous
biological data The most widely-used authoritative resources
are the World Health Organisation classifications, which
list the tumors that occur at different body sites. The
problem with an organ system approach to classification is
that every organ contains organ-specific and organ
non-specific cell types. The brain, for instance, contains
connective tissue and lymphoid tissue, and therefore is
prone to tumors of connective tissue and lymphoid tissue. A
listing of tumors that occur in the brain must include:
osteocartilaginous tumors, lipoma, fibrous histiocytoma,
hemangiopericytoma, rhabdomyosarcoma, melanoma, lymphoma and
myeloma, among others. These same tumors will be included
again and again in every site-specific classification.
Although each term may occur only once in each site-specific
classification, the same lesion may occur a virtually
limitless number of times when the site classifications are
combined into a comprehensive classification of tumors.
Although cancer taxonomies are different from
classifications, they usefully provide all the instances of
tumors that must be grouped within a classification.
Excellent tumor taxonomies are now publicly available at no
cost.
|
List 13.6.3. Schematic for the Developmental Lineage Classification of Cancer. 1 embryonic 2 primitive 3 primitive_differentiating 4 totipotent_or_multipotent_differentiating 5 limited_differentiating 6 germ cell 7 primitive_non_differentiating 8 non_primitive 9 endoderm_or_ectoderm 10 endoderm_or_ectoderm_surface 11 endoderm_or_ectoderm_endocrine 12 endoderm_or_ectoderm_parenchymal 13 odontogenic_epithelium 14 mesoderm 15 mesenchyme 16 connective_tissue 17 muscle 18 fibrous_tissue 19 vascular 20 adipose_tissue 21 bone_cartilage 22 heme_lymphoid 23 non_mesenchymal_mesoderm 24 coelomic 25 coelomic_ductal 26 coelomic_cavities 27 coelomic_gonadal 28 sub_coelomic 29 sub_coelomic_gonadal 30 sub_coelomic_endocrine 31 sub_coelomic_ nephric 32 neuroectoderm_neural_plate 33 neural_tube 34 neural_tube_parenchyma 35 neural_tube_lining 36 neural_crest 37 peripheral_nervous_system 38 neural_crest_endocrine 39 neural_crest_melanocytic |
List 13.6.4. General features of the tumor classification relevant to biomedical informatics.
1 Instance uniqueness. Each tumor entity appears occurs only
once within the classification
2 Comprehensive. The classification ensures that every tumor
of man can be placed somewhere within the classification
3 Simplicity. One of the purposes of a classification is to
drive down the complexity that exists when the domain
taxonomy is large. The entire classification is described by
under 40 classifiers
4 Principled. The classification is based on known principles
of developmental biology, not on political or artifactual
distinctions between tumors. A counterexample would be a
tumor classifications based on medical specialty (e.g.
dermatologic neoplasms, hematologic neoplasms, head and neck
tumors, etc.)
5 The classification has "competence." In the field
of informatics, competence is the ability to answer
questions related to the instances of a data group
6 Standard method of organization. The classification is
represented as an XML document
7 Scalability. It is easy to expand the classificaiton with
new subclasses. This is important, as the molecular analysis
of tumors is likely to provide new taxa
8 Modifiable. It is easy to move subdivisions of the
classification. Classifications are hypothetical
re-creations of reality and must be changed as information
is accrued
9 Understandable. The classification is easily understood by
developmental biologists. Developmental biologists are major
participants in post-genomic science and need to have tools
to relate basic research with clinical exigencies
10 Credible. The classification complements modern theories of
the "stem cell" origin of tumors
11 Compatible with other visions of reality. The classification
does not invalidate existing diagnoses found in pathology
reports. The medicolegal importance of this feature cannot
be exaggerated. This relieves pathologists from reviewing
all their prior cases and re-diagnosing them in conformance
with a new classification
12 Open access. The classification is an open access document
that can be used or criticized freely by the biomedical
community ().
|
List 13.8.1. Distinctions between ontologies and classifications.
1 1. An ontology does not need to provide a theoretic
embodiment of a data domain. In fact, an ontology need not
be comprehensive (i.e. an oncology does not need to include
all the instances of a knowledge domain) and may extend over
several different knowledge domains
2 2. Ontologic classes are characterized by one or more
logical rules and include object instances that behave in
conformity to the class rules. The classes in
classifications are determined by a set of features that are
shared among the member of the class. The features that
define a class within a classification are usually not
logical rules
3 3. Ontologies permit multi-class inheritance. Any ontologic
class can inherit from any number of father classes. A class
within a classification can have at most one father class
4 4. An ontologic object instance may belong to more than one
class, just as long as the object obeys the rules of the
class in which they are a member. An object in a
classification can belong to only one class
5 5. An ontologic class inherits the rules of its superclass.
However, an ontologic class is not required to have a
superclass (i.e., an ancestor class) or descendat classes
(i.e. subclass).
|
List 13.8.2. Good things about ontologies.
1 Ontologies are computable and fit neatly into the object
oriented programming model
2 Ontologies are semantically sensible and can be described
with standard RDF syntax
3 Ontologies have competence, meaning they can be used to draw
a variety of inferences about class members based on class
rules
4 Ontologies are extensible and can be integrated with other
ontologies
5 When the same ontology is used by different researchers,
concepts in common use will have the same meaning and
properties .
|
List 13.8.3. Bad things about ontologies.
1 Ontologies place no constraints on internal complexity and
can quickly become incomprehensible to humans
2 Ontologic complexity may lead to unanticipated consequences
(including paradoxes of self-referral)
3 Ontologies are relatively new and there are very few
examples where they have shown to be of any biomedical
value. Classifications have proven their value over
millennia
4 Ontologies work on the assumption that medically relevant
domains have an intrinsic logic that can be described by
rules. This assumption may be false.
|
List 14.2.1. Some medical breakthroughs that occurred without benefit of randomized prospective clinical trials.
1 1796. Edward Jenner successfully vaccinates 8 year old James
Phipps with unproven smallpox vaccine (prepared from cowpox)
2 1881. Louis Pasteur successfully vaccinates Joseph Meister
with unproven rabies vaccine
3 1900. Jesse Lazear demonstrates (on himself) that yellow
fever is transmitted by mosquito bite. Lazear dies from
successful inoculation
4 1944, 1972, 1992. Sudden Infant Death syndrome often
associated with infants sleeping on stomach, on soft
mattresses
5 1985 Marshall infects himself with H. pylori, thus
developing gastrities and demonstrating the bacterial
origin of gastric ulcers.
|
List 14.3.1. Questions for clinical trialists.
1 Are data from clinical trials made available to the public?
2 Are we making the best use of data collected in clinical
trials?
3 How often do clinical trials fail to provide definitive
answers to the question that motivated the trial?
4 When a prospective clinical trial fails to answer the
question that motivated the trial, is the trial data made
available to the public?
5 Are we using the best available methods to guarantee that
clinical trials are designed properly?
6 Might clinical trials be designed in a manner that enhances
the scientific and medical value of the trials beyond a
single hypotheses?
7 Might some prospective clinical trials be replaced by
cheaper, faster retrospective trials?
8 Might some clinical trials be replaced by new, innovative
models producing clinically sound results in less time and
for less money?
|
List 14.5.1. Snippet of Perl code demonstrating how metastastic events can be simulated using th Monte Carlo technique.
1 $badoutcome = "No mets for the bad tumor";
#begin with no metastases
2 $start = time();
3 while (1) #this will loop forever unless something in
block promts exit
4 {
5 $bad = 2 * $bad;
6 print "$bad\n";
7 srand;
8 for (1...$bad)
9 {
10 $badchance = |
List 14.7.1. Role of biomedical informaticians in clinical trials-based translational research.
1 Protecting human subject privacy and confidentiality (always
the first responsibility of biomedical informaticians)
2 Developing new approaches for clinical trials that reduce
the cost and length of trials without sacrificing scientific
value
3 Choosing a primary hypothesis whose scientific importance
will endure to the end of the study
4 Expanding study designs to test multiple hypotheses during
the course of the trial
5 Capturing data that can complement other scientific efforts
6 Designing the studies in ways that ensure that the primary
hypothesis (i.e., the hypotheses that justifies the study)
is adequately tested
7 Organizing the data (using common standards such as CDISK
8 Ensuring that the analysis of data is conducted in a manner
free from bias
9 Reporting the conclusions of the study
10 Distributing the data that support the conclusions of the
study.
|
List 15.1.1. Plain-English description of a software agent protocol.
1 You provide the software program with the following inputs:
a list of people who you would like to make an appointment
with, and a calendar file that contains free dates and times
as well as dates and times that have already been obligated
2 The software agent uses the standard http (web) protocol to
visit all the URLs of all the names on your list
3 At each web site, the software agent searches for the class
objects associated with the unique person name. If all goes
well, the software agent finds a calendar object that
belongs to the named person
4 The calendar object contains unique date-time entries
associated with a prior appointments. The agent matches a
list of available dates and times from your calendar.
|
List 15.2.1. Increasingly complex task-sharing network protocols. 1 1. FTP (file transfer protocol) 2 2. TELNET 3 3. HTTP (hypertext transfer protocol) 4 4. RPC (remote procedure calls) 5 5. XML-RPC (xml-based remote procedure calls) 6 6. SOAP (simple object access protocol) 7 7. P2P (peer-to-peer networking) 8 8. WEB Services 9 9. GRID computing |
List 16.1.1. Infrastructural issues that have delayed advancement in the field of biomedical informatics.
1 Lack of standards for acquiring, collecting, annotating, and
exchanging all types of biomedical data
2 Poor quality of clinical data in hospital information
systems
3 Inability of institutions to cope with HIPAA privacy
regulations
4 Reluctance of funded researchers to share data
5 Questionable data analysis methodologies for new
technologies
6 Enormous administrative cost of obtaining and tracking the
patient consent process. High cost and complexity of
privacy/security tasks related to prospective clinical
studies
7 Poor access to large clinically annotated banks of human
tissue samples of a wide variety of diseases, required to
validate candidate diagnostic tests.
|
List 16.3.1. Statement on the key function of biomedical informatics in CTSA funding.
1 "Biomedical Informatics is the cornerstone of
communication within C/D/Is and with all collaborating
organizations. Applicants should consider both internal,
intra-institution and external interoperability to allow
for communication among C/D/Is and the necessary research
partners of clinical and translational investigators (e.g.
government, clinical research networks, pharmaceutical
companies, commercial vendors, laboratories, and equipment
manufacturers). Biomedical Informatics support is expected
to be flexible and innovative. Interoperability, security,
workflow, usability and standards are essential areas of
work. To facilitate the conduct of research in health care
settings and the transfer of research findings into routine
care, clinical and translational research must employ
applicable standards (e.g., identifiers, vocabularies,
transactions, security measures) adopted by the Department
of Health and Human Services for use in U.S. health care and
public health operations. All human subject data must be
handled securely to ensure privacy and confidentiality.
Biomedical informatics research activity should be
innovative in the development of new tools, methods, and
algorithms."
|
List 16.3.2. Some common biomedical informatics goals for institutions.
1 Develop a thoughtful approach to issues of human subjects
protection, data organization and data sharing. These may
not be the focus of your research, but your research will
suffer if you minimize their importance
2 Develop general protocols, approved by your IRB, for dealing
with issues of confidentiality and data sharing. Experienced
grant reviewers appreciate institutions that use a tested
infrastructure to support their research staff
3 Develop collaborations with researchers outside your
department, and outside your institution. Translational
research needs cross-disciplinary expertise and biomedical
informatics requires large datasets collected from multiple
institutions. Funding agencies and grant reviewers
understand this and will look favorably at innovative
proposals that draw information and expertise from diverse
sources
4 Train staff in the fundamentals of biomedical informatics.
Hiring an informatics guru does not compensate for a staff
of luddites (unless the guru can bring enlightenment to the
entire staff).
|
List 16.4.1. Ingredients of a good grant application in the field of biomedical informatics (in order of descending importance).
1 A solid, credible set of specific aims (absolutely crucial)
2 A known track record in the general area (usually determines
who in the research team will be named the PI)
3 Preliminary data
4 A respected institutional infrastructure supporting the
effort
5 Collaborators who have the ability to provide a
translational component
6 Statistical expertise sufficient to convince the study
section that the project is well-designed and that the data
analysis will be unimpeachable
7 A clear understanding of data sharing and human subjects
issues related to the project
8 A sense of where the project will move into after the
initial funding period
9 A sense of where the project complements other current
efforts in the same field
10 Good communication with a bright and competent Program
Director
11 An important idea.
|
List 16.4.2. An ancient observation that success falls occasionally on the undeserving. 1 The race is not to the swift, 2 nor the battle to the strong, 3 nor bread to the wise, 4 nor riches to the intelligent, 5 nor favor to the men of skill; 6 but time and chance happen to them all. |
List 16.4.3. Suggestions for researchers.
1 When an investigator submits a work for publication, where
the data is derived from patient records, the Methods
section should include a description of the steps taken to
minimize patient risks, and should document that that the
IRB reviewed the research proposal. When these items are
missing from a paper, editors and reviewers should feel free
to ask authors to supply this information
2 When an investigator submits a grant application
(particularly an application to a U.S. Federal Agency), a
detailed strategy for protecting human subjects from human
subject risks is required. Investigators should be aware
that research using patient records is human subject
research. Investigators should also be aware that current
U.S. Federal Guidelines call for the inclusion of
minorities, women and children in clinical studies, unless
there is a good reason for excluding them from the study
population. For the purpose of satisfying federal inclusion
guidelines, most agencies consider studies based on patient
records to be clinical studies. A statement describing the
inclusion of minorities, women and children will be, in most
circumstances, a requirement for would-be biomedical
informaticians who seek federal funding
3 Human subjects issues are a legitimate area of research for
the biomedical informaticians. Novel protocols for achieving
confidentiality and security while performing increasingly
ambitious studies (distributed network queries across
disparate databases, extending the patient's record to
collect rich data from an expanding electronic medical
record, linking patient records to the records of relatives
or probands, peer-to-peer exchange of medical data) will be
urgently needed by the data mining community. Data miners
would do well to stay abreast of regulations controlling the
use of medical data so that they can develop
regulation-compliant protocols for data mining activities
4 Anonymized data, by definition, cannot be linked to
patients. Therefore, there is no legal or ethical reason to
withold anonymized datasets from the public. Quite the
opposite. Anonymized datasets have enormous value to other
researchers who can merge your data with theirs, derive new
ways of analyzing your data, or develop new questions that
can be addressed by your dataset. Researchers who have
created anonymized datasets should seriously consider
publishing their data as a primary resource or as a
secondary resource attached to any publication that results
from the research project. Many journals and on-line
publication services (such as PubMed Central and BioMed
Central) encourage authors to submit their datasets as
publication attachments. Issues of intellectual property
impacting on the investigator and the institution (e.g.
ownership, licensing of data, derivative work
"reach-through") have accumulated very little
legal precedent
5 Funding agencies often have grandiose hopes for their
research initiatives. They are willing to pay large sums of
taxpayer money to support ambitious research projects. With
few exceptions, large research projects create complexity.
Historically, projects whose chief goal is to simplify data
resources or reduce software complexity are seldom funded. I
would encourage investigators to persevere. When complexity
has become an impediment to biomedical progress,
investigators must provide a convincing description of the
problem, along with a clear explanation of the benefits
derived when complexity is reduced. A savvy study section
may be receptive to an application that contains a credible
set of goals for reducing complexity.
|
List 17.1.1. Sources of ethical challenges for biomedical informaticians. 1 HIPAA regulations 2 IRB approvals 3 Flawed de-identification methods 4 Problematic network security protocols 5 Data sharing requirements from funding agencies 6 Contractual IP arrangements with employer 7 Conflicts of interest 8 Hidden patent violations 9 Unanticipated lawsuits |
List 17.2.1. Conditions under which it may be OK to lie. All conditions must hold.
1 The lie protects a human subject for which you have a
fiduciary responsibility
2 You are certain that the lie will not harm another human
subject
3 You do not personally benefit from the lie
4 The lie is not an included component in a plan to mislead
people
5 There is no honest way of protecting the patient that does
not involve lying
6 The liar is willing to accept any negative consequences that
may ensue.
|
List 17.3.1. HIPAA: 164.512 Uses and disclosures for which consent, an authorization, or opportunity to agree or object is not required.
1 (I) STANDARD: USES AND DISCLOSURES FOR RESEARCH PURPOSES
2 (1) Permitted uses and disclosures. A covered entity may use
or disclose protected health information for research,
regardless of the source of funding of the research,
provided that:
3 (iii) Research on decedent's information. The covered entity
obtains from the researcher:
4 (A) Representation that the use or disclosure is sought is
solely for research on the protected health information of
decedents;
5 (B) Documentation, at the request of the covered entity, of
the death of such individuals; and
6 (C) Representation that the protected health information for
which use or disclosure is sought is necessary for the
research purposes.
|
List 17.3.2. Suggested research purposes of an autopsy database integrated with the electronic medical records of the deceased patients.
1 Validating the sensitivity and specificity of new diagnostic
tests, including imaging techniques and new medical devices
2 Determining the extent of disease, at death, of persons
enrolled in clinical trials, as a measurement of response to
different treatment protocols
3 Documenting adverse effects of medications administered
during the patient's life
4 Correlating autopsy findings with pharmacogenomic databases,
attributing diseases found at autopsy with specific gene
variations.
|
List 17.4.1. Questions that researchers should ask before purchasing a software license.
1 Will colleagues who want to review my work be required to
buy their own version of the software?
2 Can the software be modified by the licensee?
3 Will the software export files or data objects in a standard
format that can be exchanged with and used by colleagues who
use other software applications
4 Will the software export files or data objects that can be
publicly distributed (as in supplementary files distributed
with a manuscript), or publicly displayed (as on a web
site)?
5 Does the software license contain reach-through clauses? A
reach-through is a legal device through which licensees must
pay royalties on intellectual property produced with the
help of the software.
|
List 17.9.1. Reasons for refusing to share research data.
1 The data is confidential and it would be unethical to
release the data to the public
2 Members of the public may misunderstand the data
3 Competitors may purposely misinterpret the data to dispute
the validity of the work
4 Preparing the data in a format that can be distributed to
the public (e.g., the internet) requires time, effort,
expertise and money that could be better spent on research
activities
5 The data owner wishes to control the direction of research
in her specific area, and her control would be lost if
everyone in the field had access to the data
6 The data owner is offended that anyone would question her
integrity by asking to review the primary data
7 The data is integral to another manuscript that has not yet
been published. Distributing the data would violate the
publisher's ban on pre-publication release of research
data.
|
List 17.9.2. When is data sharing important?
1 When the data contributes one piece of a planned multi-part
research effort towards which many different laboratories
contribute
2 When the data can contribute towards other research efforts
3 When the validity of the assertions drawn from the data are
doubted
4 When the validity of the data itself is doubted
5 When there is reason to believe that the data can be
re-examined to yield additional results.
|
List 17.10.1. Excerpted from 164.514 Other requirements relating to uses and disclosures of protected health information.
1 (1) A person with appropriate knowledge of and experience
with generally accepted statistical and scientific
principles and methods for rendering information not
individually identifiable:
2 (i) Applying such principles and methods, determines that
the risk is very small that the information could be used,
alone or in combination with other reasonably available
information, by an anticipated recipient to identify an
individual who is a subject of the information; and
3 (ii) Documents the methods and results of the analysis that
justify such determination;
|
List 17.10.2. Excerpted from Section 164.512 Uses and disclosures for which consent, an authorization, or opportunity to agree or object is not required.
1 (ii) Waiver criteria. A statement that the IRB or privacy
board has determined that the alteration or waiver, in whole
or in part, of authorization satisfies the following
criteria:
2 (A) The use or disclosure of protected health information
involves no more than minimal risk to the individuals;
3 (B) The alteration or waiver will not adversely affect the
privacy rights and the welfare of the individuals;
4 (C) The research could not practicably be conducted without
the alteration or waiver;
5 (D) The research could not practicably be conducted without
access to and use of the protected health information;
6 (E) The privacy risks to individuals whose protected health
information is to be used or disclosed are reasonable in
relation to the anticipated benefits if any to the
individuals, and the importance of the knowledge that may
reasonably be expected to result from the research;
7 (F) There is an adequate plan to protect the identifiers
from improper use and disclosure;
8 (G) There is an adequate plan to destroy the identifiers at
the earliest opportunity consistent with conduct of the
research, unless there is a health or research justification
for retaining the identifiers, or such retention is
otherwise required by law; and
9 (H) There are adequate written assurances that the protected
health information will not be reused or disclosed to any
other person or entity, except as required by law, for
authorized oversight of the research project, or for other
research for which the use or disclosure of protected health
information would be permitted by this subpart.
|
List 17.13.1. Ethical questions for biomedical informaticians building tissue biorepositories?
1 Does the use of the tissues for research purposes deprive
patients of material that may have importance to their
current or future medical care?
2 Were the patients informed that their tissues may be used
for research purposes that could result in commercial
products, and that they (the patients) will not share in any
resultant profits?
3 Are patients fully protected from any harms that may result
from the research on their tissues?
4 Is the data collected from patient charts de-identified and
is it compliant with "Minimum necessary" policy
(see Glossary) described in HIPAA (see Glossary)?
5 Are the anticipated revenues from research so large as to be
onerous or otherwise conflicting with the best interests of
the patients whose tissues are used in the research?
|
List 17.14.1. Some physician rejoinders to the opening phrase, "I would describe the HIPAA Privacy Rule as:"
1 A disaster for future of medical practice and a windfall for
trial attorneys
2 A disingenuous scam and a ridiculous waste of time and money
3 A further burden on physicians who are already having
trouble surviving
4 A joke, not really privacy. Maybe impossible to actually
comply with
5 A major hassle; no protection for patients' privacy
6 A Trojan horse
7 A way not to pay doctors
8 A worthless but potentially damaging body of legislation
9 Another example of bureaucrats justifying their existence
10 Another governmental intervention to make medicine
difficult.
|
List 17.14.2. Expectations of HIPAA privacy act prosecutions.
1 1. Prosecutions would not be frequent (the August 19, 2004
conviction was the first criminal conviction under the HIPAA
regulations that went into effect 16 months earlier)
2 2. Criminal prosecutions would apply to cases where patients
suffered actual harm as the result of a willful HIPAA
violation (as in the first reported criminal prosecution)
3 3. Hospitals would not be subject to frivolous prosecutions
for minor HIPAA violations that did not result in harm to
patients.
|
List 17.16.1. Examples of "reasonableness" standard applied within HIPAA.
1 Statutory Background... The security standard authority
applies to both the transmission and the maintenance of
health information, and requires the entities described in
section |
List 17.17.1. Closing platitudes.
1 Insisting that something is true does not make it true.
(Don't let people intimidate you into believing anything)
2 All biomedical informatics research is human subjects
research and all human subjects must be protected from harm.
(This feature distinguishes biomedical informatics from
bioinformatics and reminds us that biomedical data comes
from patients who place their trust in us.)
3 Get funded so that you can do research, don't do research so
that you can get funded. (Getting funded is not an
achievement. Funding is a societal contract in which the
investigator promises that she will achieve something.)
4 The good fight is the fight that you lose and lose and lose
until you win. (Don't worry about losing. Just worry about
working toward an important goal.)
5 Life is short but art is long.(This was written by
Hippocrates around 400 BC and refers specifically to the art
of medicine. Advances in biomedical informatics will endure
beyond our short lives.)
|
List 19.5.1. Ruby script to read lines from a file.
1 #!/usr/local/bin/ruby
2 #READsome.rb, reads 300 lines from a big file
3 f = File.open "big"
4 outf = File.open("bigout.out", "w")
5 count = 0
6 while count < 300
7 STDOUT.puts f.gets
8 outf.puts f.gets
9 count = count + 1
10 end
|
List 19.5.2. Ruby script to scrub an input line.
1 #!/usr/local/bin/ruby
2 f = File.open "doubdb.txt"
3 outf = File.open("scrub.out", "w")
4 doubhash = Hash.new
5 while line = f.gets
6 line = chomp
7 doubhash[line] = " "
8 end
9 f.close
10 puts "What would you like to scrub?"
11 line = gets.chomp.downcase
12 line = line.gsub(/\'s/, '')
13 line = line.gsub(/[^\w\s]/, ' ')
14 line = line.gsub(/ +/, ' ')
15 linearray = line.split
16 arraysize = linearray.length - 2
17 lastword = "*"
18 for arrayword in (0 .. arraysize)
19 phrase = linearray[arrayword] + " " +
linearray[arrayword+1]
20 if doubhash.key?(phrase)
21 print " " + linearray[arrayword]
22 lastword = " " + linearray[arrayword+1]
23 else
24 print lastword
25 lastword = " *"
26 end
27 if arrayword == arraysize
28 print lastword
29 end
30 exit
|