List 0.0.0. What every determined reader will learn from this book.
1 How to acquire and organize biomedical data even when the
data is received in the form of unstructured text
2 How to merge and share biomedical data even when the data
is confidential or comes from seemingly incompatible
sources
3 How to write your own programs in Perl that will allow you
to perform common informatics tasks with just a few lines
of code
4 How to automatically index biomedical text and code text
using freely available biological and medical nomenclatures
5 How to use metadata to provide structure and meaning to
biomedical datasets
6 How to use confidential medical data while obeying current
law and protecting patients
7 How to reduce the complexity of biomedical data and
biomedical software
8 How to evaluate ethical problems related to intellectual
property, privacy, human subjects research (see Glossary),
data sharing (see Glossary), and software development.
|
List 0.0.1. People who will benefit from reading this book.
1 Bioinformaticians
2 Biomedical scientists
3 Clinical trialists
4 Computer scientists who need cross-over skills in the
biomedical sciences
5 Government officials at any of the health-related federal
agencies
6 Healthcare graduate students and professionals who use large
biomedical datasets, who need to have data/software
interoperability, or who need to comply with federal, state,
or institutional data requirements
7 Hospital staff, including medical students, physicians,
nurses, technicians, hospital administrators, information
officers
8 Lawyers who handle intellectual property (see Glossary)
cases related to biomedicine
9 Library scientists
10 Medical ethicists
11 Medical software developers and vendors
12 Medical transcriptionists
13 Members of IRBs (Institutional Review Boards, see Glossary)
and Privacy Boards (see Glossary)
14 Privacy experts who work with medical scientists
|
List 1.1.1. Roles of the biomedical informatician. 1 Biologist 2 Healthcare professional 3 Lawyer 4 Software programmer 5 Computer scientist 6 Cryptographer 7 Metadata expert 8 Linguist 9 Statistician 10 Diplomat |
List 1.2.1. Pre-1955 biomedical advances resulting in increased longevity in many developed countries.
1 Antisepsis
2 Refrigeration of food
3 Standards for the hygienic preparation of food
4 Eradication of insect vectors for yellow fever and malaria
5 Potable public drinking water
6 Antibiotics effective against many bacterial infections
including syphilis, gonorrhea and tuberculosis
7 Vaccines against smallpox and polio
8 The virtual elimination of iodine-deficiency associated
goiter
9 The near elimination of vitamin deficiency diseases
10 The marked reduction of cervical cancer in women thanks to
cytologic screening of cervical smears
11 The prevailing blood tests and quantitative blood cell
analyses used to monitor deviations from normal function
12 The correction of diabetic hyperglycemia with insulin
13 The introduction of radiologic imaging
14 The treatment of hypertension with large variety of
effective drugs
15 The recognition of the association between cigarette use and
cancer
16 The role of diet and of cigarettes in the progression of
vascular diseases.
|
List 1.2.2. Medical setbacks since 1955.
1 The global spread of AIDS
2 Diminished access to potable water in much of the world
population
3 The emergence of multiple antibiotic resistant strains of S.
aureus and other previously treatable organisms
4 Increased number of cancer patients due primarily to an
absolute increase in the number of senior citizens at
highest risk for cancer
5 The re-emergence of tuberculosis
6 The re-emergence of insect and other vectors carrying viral
and parasitic diseases
7 The astronomical costs of new, effective medications for
chronic diseases, including cancer
8 High quality, long-term health care attainable for only a
small fraction of the earth's population
9 The rising incidence of obesity and sequelae disorders,
worldwide
10 The rapid geographic spread of outbreaks of new strains of
influenza and other evolving viruses, including HIV and
hemorrhagic fever viruses
11 The threat of destructive and pathogenic species of plants,
insects and animals that have been introduced to new
habitats through acts of human negligence or error
12 Weakening of the earth's ozone layer, increasing human
exposure to ultraviolet radiation
13 The political uses of toxic agents, endemic diseases, and
public health infrastructures.
|
List 1.3.1. Immediate consequences of Semelweis' prevention of puerperal fever deaths.
1 The medical students were opposed to being forced to wash
their hands
2 Semmelweis' superior, Johann Klein, was likewise opposed,
considering the clinical trial a criticism of his
performance
3 Other obstetricians agreed that Semmelweis' measures were an
attack on their professional conduct
4 The maternity patients were opposed as well, interpreting
sanitary measures as a criticism of their personal
hygiene
|
List 1.3.2. Beliefs held by biomedical informaticians.
1 Medical progress requires the integration of biological data
and clinical data
2 Aggregate clinical data has value beyond its use in guiding
the treatment of individual patients
3 Researchers need methods to acquire clinical data without
harming patients
4 To be useful, biological and clinical data need to be
organized in a standard manner that permits seamless data
integration (see Glossary)
5 Classifications (see Glossary) drive down the complexity of
clinical and biological data)
6 Important new testable hypotheses may derive from
pre-existing biological and clinical datasets, but only if
the datasets are made available to scientists
7 The primary data that supports scientific assertions should
be be made publicly available, whenever feasible
8 Data analysis is an inexpensive science, particularly if you
know how to program.
|
List 1.4.1. Some important bottlenecks in translational research.
1 Access to clinically annotated tissues collected from human
subjects
2 Access to electronic medical records and other electronic
archives of human clinical data
3 Methods to organize data in a manner that permits the data
to be meaningful and comparable from laboratory to
laboratory and institution to institution
4 Methods to draw clinically valid conclusions from large
datasets containing heterogeneous types of data (e.g.
molecular data and clinical test data).
|
List 1.5.1. Basic skills and activities in biomedical informatics.
1 An understanding of a computer's file and subdirectory
system
2 The ability to download, install and use popular software
applications and utilities
3 An awareness of the differences between structured and
unstructured data
4 Basic understanding of XML (see Glossary) and metadata
annotation
5 Basic appreciation of computer algorithms
6 Some familiarity with data privacy rules and how these rules
relate to the research uses of medical data. Most countries
have such privacy regulations for biomedical data. In the
U.S., this would be the HIPAA privacy rules (), and in the
United Kingdom, it would be the Data Protection act ()
7 A general understanding of concepts of medical record
de-identification
8 Familiarity with the publicly available biological search
engines, databases and tools, including PubMed and
GenBank.
|
List 1.5.2. Advanced skills and activities in biomedical informatics.
1 Programming at a moderate level in at least one programming
language
2 Experience choosing and implementing a laboratory or
hospital information system
3 Knowledge of regulations pertaining to the use of identified
medical data in research
4 Participation in an effort seeking FDA approval for a device
or technology developed from a biomedical informatics effort
5 Participation on a standards committee
6 Intermediate level understanding of XML
7 Basic understanding of RDF (see Glossary)
8 Experience as a member of an IRB or Privacy Board
9 Competing for funding for a biomedical informatics grant
(see Glossary) or contract.
|
List 1.7.1. Steps in gold mining (or data mining).
1 Physical access to mine
2 Legal rights of access to mine
3 Acquire tools to find desired items in mine
4 Acquire tools to extract desired items in mine
5 Acquire tools to refine desired items
6 Acquire tools to certify the purity and quantity of desired
items
7 Transform the desired items into a standard format
8 Transport the desired items to an intended recipient
9 Arrang payment for the desired items
10 Store the desired items
11 Protect the stored items.
|
List 1.8.1. Realistic uses of Biomedical Informatics.
1 Store, share, search, retrieve and analyze heterogeneous
data sources. This entire process is vastly enhanced by our
current ability to send any type of data anywhere at
anytime, cheaply
2 Create large comprehensive databases (millions of cases)
that allow you to ask questions that could not be asked of
small or non-comprehensive databases
3 Drive down the complexity of biomedical data by using data
specifications (see Glossary) and classifications
4 Track data collected in hospital information systems and
dispatch automatic clinical alerts when data values fall
outside an expected range of behavior or when values violate
the expected properties of data classes
5 Develop new hypotheses by examining and correlating
biological and clinical observations
6 Validate new clinical tests and treatments by examining the
correlations between test values, treatment choices and
clinical outcomes.
|
List 1.8.2. Unrealistic uses of Biomedical informatics.
1 Replace physicians with computers. Doctors are trained to
make diagnoses, and they don't desire or use software that
purports to does this for them
2 Create superdoctors through the use of computer tools. The
practice of medicine is learned through personal
experiences. Doctors do not need simulations of reality
3 Vastly improve upon books and traditional teaching
strategies. Books are an adequate method of conveying
knowledge. Computers can certainly provide some improvements
to book learning, but there is no reason to think that a
system of learning based on printed literature, that works
perfectly fine, can be vastly improved
4 Solve subtle or complex problems via the use of medical
ontologies (see Glossary). Complex systems are inherently
chaotic, and inferences reached through a logical ontology
modeling a complex system are likely to be misleading
5 Create, within the next decade, comprehensive medical
records for all U.S. citizens that can be accessed and
annotated by all authorized care-givers. This holy grail of
U.S. medical informatics is a worthy long-term pursuit, but
there is no reason to expect that it can be achieved within
a decade or even two decades.
|
List 2.2.1. When are databases particularly useful?
1 When the stored data is complex (e.g., hospitals and
academic centers)
2 When the basic data structure is constant (i.e., when the
model of the data records does not change)
3 When there are continuous real-time additions, deletions and
modifications of records by multiple users
4 When the computer staff (responsible for the data) prefers
to work with databases.
|
List 2.2.2. When are data files particularly useful?
1 When the dataset is relatively stable
2 When the data structure is relatively instable (i.e., when
the fundamental model of the data records changes)
3 When XML is the native method of data representation
4 When the computer staff (responsible for the data) prefers
to work with data files.
|
List 2.3.1. Three properties of reality relevant to hospital databases.
1 Database records can be designed in such as manner as to
corrupt the integrity of the database
2 Databases do not care if their integrity is corrupted
3 Modifications to the basic structures of database records
almost always have negative (sometimes catastrophic)
consequences.
|
List 2.3.2. Common weaknesses of some hospital databases.
1 Inability to guarantee that every patient is uniquely
identified within the database
2 Inability to classify types of data into groups with shared
properties
3 Inability to extend data records to include data elements
linked to other databases
4 Inability to organize data as simple collections of
meaningful statements
5 Inability to produce self-describing data records (i.e.,
including in data records all the data necessary to fully
describe the meaning of the data record)
|
List 2.3.3. Desiderata for hospital information systems.
1 Every patient must be uniquely identified within the system
2 Every report must be uniquely identified and associated with
one patient
3 Data items contained in reports must be entered into reports
once only
4 Data items must be well-defined and used in a consistent
manner throughout the system
5 Data values must be bound to a unique identifier (see
Glossary) and associated with a unique report
6 All data entered should be technically retrievable
7 Someone must have the authority to retrieve any and all data
in the hospital information system
8 Data, once entered, should not be corrected or modified in
any way without creating a visible transaction record of the
modification
9 All electronic data related to an electronic record in the
hospital information system should be included in the
hospital information system.
|
List 2.11.1. General classes of patents.
1 Utilities - new and useful methods, machines, items, or
chemical compounds
2 Designs - a new appearance for a manufactured article
3 Plants - the invention or discovery of a plant variety that
can be asexually reproduced
|
List 2.12.1. Copyright Act of 1976, Title 17, U.S. Code, section 107. Limitations on exclusive rights: Fair use.
1 Notwithstanding the provisions of sections 106 and 106A, the
fair use of a copyrighted work, including such use by
reproduction in copies or phonorecords or by any other means
specified by that section, for purposes such as criticism,
comment, news reporting, teaching (including multiple copies
for classroom use), scholarship, or research, is not an
infringement of copyright. In determining whether the use
made of a work in any particular case is a fair use the
factors to be considered shall include -
2 (1) the purpose and character of the use, including whether
such use is of a commercial nature or is for nonprofit
educational purposes;
3 (2) the nature of the copyrighted work;
4 (3) the amount and substantiality of the portion used in
relation to the copyrighted work as a whole; and
5 (4) the effect of the use upon the potential market for or
value of the copyrighted work
6 The fact that a work is unpublished shall not itself bar a
finding of fair use if such finding is made upon
consideration of all the above factors. ()
|
List 2.15.1. Tissues that are routinely destroyed by pathology departments.
1 Institutions regularly dispose of tissues removed during
surgical procedures. When a large specimen, such as a colon,
is received in a pathology department, samples are routinely
embedded in paraffin and saved for at least 5 years. The
unsampled colon (the bulk of the specimen) is saved for
several weeks, sufficient time to ensure that the
pathologist has rendered a final diagnosis on the specimen,
and then the specimen is discarded
2 Institutions regularly dispose of archived paraffin-embedded
tissues. Most institutions archive paraffin-embedded tissues
for at least 5 years. At that time, some medical centers
conclude that the tissues are no longer of any importance to
the patient. To avoid the expense of continued storage, some
institutions simply dispose of archived material after 5
years.
|
List 2.15.2. Questions that institutions should ask before transferring tissues and medical records to an external tissue repository.
1 Would the transfer to a third party constitute a sale of
human tissue?
2 Would the transfer to a third party harm any of the patients
from whom the tissue was excised?
3 Would the transfer to a third party benefit society?
4 Do any of the institutional staff encouraging the transfer
of tissues and data have relevant conflicts of interest?
|
List 2.16.1. Recent developments that have enhanced access to experimental datasets.
1 Online journals that invite authors to submit data files
2 Editor policies that require the submission of data files
supporting assertions made in manuscripts
3 Technical ease of storing large datasets on publicly
available servers
4 Technical ease of downloading large datasets from servers
via the internet
5 Data sharing requirements issued by biomedical funding
agencies
6 Expansion of Freedom of Information Act
7 Greater involvement of informaticians in biomedical research
8 Scientific advancements using publicly available datasets
9 Stunning power and scope of publicly available search
engines, including Google (internet documents) and PubMed
(medical abstracts)
|
List 2.18.1. Some definitions of terms related to the open source movement.
1 Free software: The concept of free software, as popularized
by the Free Software Foundation, refers to software that can
be used freely, without restriction, and does not
necessarily relate to the actual cost of the software. The
generally acknowledged father of the free software movement
is Richard Stallman, an MIT visionary who has led an
energetic and unwavering campaign to create and freely
distribute some of the most valued software applications in
use today. The free software movement is similar to the open
source software movement, but some of the features of free
software (ability to modify and re-distribute software in a
prescribed manner as discussed in the software license) are
not always guaranteed in open source software (see List)
2 Open source - The Open Source Software movement is an
offspring of the Free Software movement. The reason that the
open source movement was created was, in part, to placate
developers who wanted to sell software and felt the the term
"free" as in "free software movement",
would be misconstrued by prospective customers to mean that
the developer requires no remuneration. Although a good deal
of free software is no-cost software, the intended meaning
of the term "free" is that the software can be
used without restrictions. The term "open source"
obviates the need to draw this distinction. The Open Source
Initiative posts an open source definition () and a list of
approved open source licenses ()
3 Open Access - In general, open access applies to text and
data the same way that open source applies to software. In
general, open access biomedical data is retrievable (i.e.,
you can find it by using a PubMed search or through a search
engine), and once you've found it, you can download it and
read it. There are several closely-related consensus
statements on the meaning of open access (), ()
4 Open source software license - The Open Source Initiative
has an approval process for open source licenses. Software
distributed under an approved license can include a
declaration that the software is "OSI Certified Open
Source Software." The GNU copyleft licenses have been
certified as open source software licenses.
|
List 2.19.1. Examples of undifferentiated software.
1 Basic algorithms
2 Fundamental laws of physics, chemistry, mathematics and
biology
3 Free, cross-platform programming languages
4 TCP/IP internet protocol
5 HTML and XML.
|
List 2.19.2. Examples of undifferentiated data. 1 Human genome 2 Standards documents 3 Nomenclatures 4 Biological classification systems. |
List 2.19.3. Examples of differentiated software.
1 Programming languages with special features such as a
easy-to-use interfaces or integrated environment, or a
specialized purpose
2 Neural network programs designed for specific types of data
input
3 Complex software designed to support commercial devices,
such as CT-scanners
4 Most hospital information systems and laboratory information
systems.
|
List 2.19.4. Examples of differentiated data. 1 Lexis/Nexis and other legal databases 2 Subscription journals 3 Codes for billable procedures 4 Science Citation Index 5 Chemical Abstracts (R) database. |
List 2.20.1. A few of the human databases that have been described in the Nucleic Acids Research Database Issue.
1 Androgen Receptor Gene Mutations Database
2 Atlas of Genetics and Cytogenetics in Oncology and
Haematology
3 Atlas of Genetics and Cytogenetics in Oncology and
Haematology
4 BGED - Brain Gene Expression Database
5 Cancer Chromosomes
6 Cancer gene databases
7 CGED - Cancer Gene Expression Database
8 Collagen Mutation Database
9 COSMIC - Catalogue Of Somatic Mutations In Cancer
10 Cypriot national mutation database
11 Cytokine Gene Polymorphism Database
12 Cytokine Gene Polymorphism Database
13 Cytokine Gene Polymorphism in Human Disease
14 Database of Genomic Variants
15 Database of Germline p53 Mutations
16 EICO DB - Expression-based Imprint Candidate Organiser
17 EpoDB - Erythropoiesis Database
18 ERGDB - Estrogen Responsive Genes Database
19 Gene-, system- or disease-specific databases
20 General polymorphism databases
21 GOLD.db - Genomics Of Lipid-associated Disorders
22 GRAP Mutant Databases
23 HAGR - Human Ageing Genomic Resources
24 HCAD - Human Chromosome Aberration Database
25 HemoPDB - Hematopoietic Promoter Database
26 HERVd - Human Endogenous Retrovirus database
27 HGMDr - Human Gene Mutation Database
28 HGMDr - Human Gene Mutation Database
29 HORDE - Human Olfactory Receptor Data Exploratorium
30 HPMR - Human Plasma Membrane Receptome
31 Human p53, human hprt, rodent lacI and rodent lacZ databases
32 Human PAX2 Allelic Variant Database
33 Human PAX6 Allelic Variant Database
34 IARC TP53 Database
35 Imprinted Gene Catalogue
36 IPD - Immuno Polymorphism Database
37 Lowe Syndrome Mutation Database
38 MTB - Mouse Tumor Biology Database
39 NCL Mutation Database
40 OMIM - Online Mendelian Inheritance in Man
41 Oral Cancer Gene Database
42 PTCH1 Mutation Database
43 RB1 Gene Mutation Database
44 RTCGD - Retroviral Tagged Cancer Gene Database
45 SNP500Cancer
46 SV40 Large T-Antigen Mutant Database
47 T1DBase - Type 1 Diabetes Database
48 The Autism Chromosome Rearrangement Database
49 The Lafora Database
50 The SNP Consortium database
51 TPMD - Taiwan polymorphic microsatellite marker database
52 Tumor Gene Family Databases (TGDBs)
|
List 2.23.1. A record in Taxonomy.
1 ID : 50
2 PARENT ID : 49
3 RANK: genus
4 GC ID : 11
5 SCIENTIFIC NAME : Chondromyces
6 SYNONYM : Polycephalum
7 SYNONYM : Myxobotrys
8 SYNONYM : Chondromyces Berkeley and Curtis 1874
9 SYNONYM : "Polycephalum" Kalchbrenner and Cooke
1880
10 SYNONYM : "Myxobotrys" Zukal 1896
11 MISSPELLING : Chrondromyces
|
List 3.2.1. The types of human subject research risks.
1 The risk to life and health as a direct result of a medical
intervention
2 The risk of loss of database functionality
3 The risk of loss of confidentiality resulting from
participation in a medical study
4 The risk of loss of privacy resulting from participation in
a medical study.
|
List 3.8.1. Confidentiality issues for biomedical informaticians.
1 Demonstrating to the hospital's IRB (see Glossary) that the
chosen methodology for anonymizing or de-identifying records
is safe and reliable
2 Demonstrating to the hospital's IRB and to the hospital's
information officers that the anonymization and
de-identification processes can be performed automatically,
without giving the informatician any access to the primary
patient record and without opening any HIS vulnerabilities
when data is transferred out of the system.
|
List 3.9.1. Exemption 4 (E4) permitting unconsented research on de-identified medical records. |
List 3.9.2. Section 164.502(f) of the HIPAA Privacy Rule -- Deceased Individuals.
1 We proposed to extend privacy protections to the protected
health information of a deceased individual for two years
following the date of death. During the two-year time frame,
we proposed in the definition of ``individual'' that the
right to control the deceased individual's protected health
information would be held by an executor or administrator,
or other person (e.g., next of kin) authorized under
applicable law to act on behalf of the decedent's estate.
The only proposed exception to this standard allowed for
uses and disclosures of a decedent's protected health
information for research purposes without the authorization
of a legal representative and without the Institutional
Review Board (IRB) or privacy board approval required (in
proposed Sec. 164.510(j)) for most other uses and
disclosures for research
2 In the final rule (Sec. 164.502(f)), we modify the standard
to extend protection of protected health information about
deceased individuals for as long as the covered entity
maintains the information. We retain the exception for uses
and disclosures for research purposes, now part of Sec.
164.512(i), but also require that the covered entity take
certain verification measures prior to release of the
decedent's protected health information for such purposes
(see Secs. 164.514(h) and 164.512(i)(1)(iii))
3 We remove from the definition of ``individual'' the
provision related to deceased persons...
|
List 3.10.1. Five requirements for de-identifying medical records.
1 De-identification of data fields that specifically
characterize the patient (name, social security number,
hospital number, address, age, etc.)
2 Free-text data scrubbing, removing identifiers from the
textual portion of medical reports
3 Free-text data privatizing, removing any information of a
private nature that may be contained within the report
4 Rendering the dataset ambiguous, ensuring that patients
cannot be identified by data records containing a unique set
of characterizing information
5 Rendering the data non-complementary, ensuring that the data
cannot be combined with data from other other databases or
from multiple searches of the same database that can lead to
the identification of records.
|
List 3.12.1. Some possible consequences of Common Rule violations.
1 The loss to the institution of its funding for the grant in
question
2 The loss to the institution of its Federal Assurance. The
Office of Human Research Protections issues Assurances
(currently called Worldwide Federal Assurances or WFAs) to
institutions that have in-place processes for IRB reviews of
research and for maintaining research standards. An
institution must have an assurance registered with OHRP in
order to receive federal funding for human subjects research
3 An institution-wide suspension of human subject research
efforts
4 The imposition of grant-related restrictions imposed on the
investigators (e.g. a prohibition from applying for federal
grant funding).
|
List 3.13.1. Section 1177 of the Act established civil and criminal penalties.
1 Civil Money Penalties. HHS may impose civil money penalties
on a covered entity of $100 per failure to comply with a
Privacy Rule requirement. Pub. L. 104-191; 42 U.S.C.
1320d-5. That penalty may not exceed $25,000 per year for
multiple violations of the identical Privacy Rule
requirement in a calendar year. HHS may not impose a civil
money penalty under specific circumstances, such as when a
violation is due to reasonable cause and did not involve
willful neglect and the covered entity corrected the
violation within 30 days of when it knew or should have
known of the violation
2 Criminal Penalties. A person who knowingly obtains or
discloses individually identifiable health information in
violation of HIPAA faces a fine of $50,000 and up to
one-year imprisonment. Pub. L. 104-191; 42 U.S.C. 1320d-6.
The criminal penalties increase to $100,000 and up to five
years imprisonment if the wrongful conduct involves false
pretenses, and to $250,000 and up to ten years imprisonment
if the wrongful conduct involves the intent to sell,
transfer, or use individually identifiable health
information for commercial advantage, personal gain, or
malicious harm. Criminal sanctions will be enforced by the
Department of Justice.
|
List 3.15.1. Questions related to consent tracking that institutions must be able to answer.
1 Does each consent form have an identifier and a locator, a
study number, and a data element indicating that the consent
form itself was approved by an IRB?
2 If needed, could you put your hands on the physical consent
document?
3 Does your database indicate the specific study for which
consent was approved?
4 Was the consent form sufficiently detailed, allowing the
patient to approve certain uses of specimens/data and
decline other uses?
5 Is each consent tagged with tracking data?
6 Was the consent approved or declined?
7 What day was the consent signed?
8 Does the institution have a policy that applies to
situations wherein a subject cannot provide an informed
consent (e.g., infants, patients with dementia)?
9 If the institution has a policy of excluding certain classes
of patient from providing informed consent, has the
institution received approval for the policy from its IRB?
10 For children and challenged subjects, was the informed
consent document signed by a surrogate?
11 For children and challenged subjects, how is it determined
who may act as a surrogate, and how is the identity of the
surrogate recorded and tracked?
12 Did the consenting subject change her mind and withdraw
consent after consent had been approved?
13 If consent was withdrawn, what date did this occur?
14 If consent was withdrawn, was consent withdrawn for a
particular use of a specimen/data, or for all purposes
described by the consent document?
15 If consent was withdrawn, does the withdrawal of consent
apply to more than one consent form?
|
List 3.16.1. Advantages of unconsented medical record research.
1 Saves money and time by eliminating the tedious and
expensive process of obtaining individual consents
2 Sometimes favored by patient advocacy organizations who see
unconsented research as a way of expediting medical progress
and improving the chances of survival of the patients in
their disease constituencies
3 De-identification requirements for most unconsented patient
record research essentially guarantees that no harm will
come to the patient
4 De-identified unconsented databases can be shared and used
for multiple scientific efforts. Consented databases, in
most cases, can be used only for the purposes specified in
the consent form
5 De-identified unconsented databases pose no particular
threat over time to patients. Consented databases often
contain patient identifiers and may pose a confidentiality
and privacy threat long after the consented research is
concluded.
|
List 4.1.1. Examples of dealt standards
1 The permitted levels of toxic substances in foods
2 TCP/IP (Transmission Control Protocol/Internet Protocol),
the internet specification
3 IEEE 802.11, the wireless data transfer standard
4 Longitude and latitude assignments
5 Divisions of time (days, hours, minutes and seconds)
6 Statutes governing medical privacy
|
List 4.2.1. Some causes of medical errors in the field of biomedical informatics.
1 Absence of standards (for describing clinical data)
2 Inadequate terminologies
3 Poorly written text
4 Inadequate object identifiers (e.g., identifiers for names,
tests, reports)
5 Poor interoperability of software tools
6 Poor integration of biomedical databases
7 Poor documentation (of software, of medical devices, of
protocols)
8 Poor annotation (of medical encounters and transactions)
9 Inadequate data structuring (of reports)
10 Sloppy data representation.
|
List 4.2.2. Purposes of data standards. 1 Enhance interoperability of software 2 Enable data integration 3 Increase the efficiency of medical services 4 Increase the speed of medical research 5 Reduce medical errors. |
List 4.3.1. Why governments may choose to avoid creating biomedical standards.
1 Private entities that use a standard may be in the best
position to create the best possible standard
2 Private entities that use a standard may be willing to pay
for the standards development process
3 Private entities are more likely to adopt a new standard if
they had a part in developing the standard
4 Governments may be unwilling to accept the responsibility of
promoting a new standard
5 Governments know that many standards are never adopted by
the public and do not want to waste their resources on a
standard that will be ignored
6 Governments may be reluctant to face criticism for standards
that may adversely effect certain segments of its
population.
|
List 4.4.1. Excerpt from RICO that may be applicable to standards developers.
1 "1951. Interference with commerce by threats or
violence
2 (a) Whoever in any way or degree obstructs, delays, or
affects commerce or the movement of any article or commodity
in commerce, by robbery or extortion or attempts or
conspires so to do, or commits or threatens physical
violence to any person or property in furtherance of a plan
or purpose to do anything in violation of this section shall
be fined under this title or imprisoned not more than twenty
years, or both
3 (b) As used in this section-
4 (1) The term "robbery" means the unlawful taking
or obtaining of personal property from the person or in the
presence of another, against his will, by means of actual or
threatened force, or violence, or fear of injury, immediate
or future, to his person or property, or property in his
custody or possession, or the person or property of a
relative or member of his family or of anyone in his company
at the time of the taking or obtaining
5 (2) The term "extortion" means the obtaining of
property from another, with his consent, induced by wrongful
use of actual or threatened force, violence, or fear, or
under color of official right."
|
List 4.4.2. Disclaimer against hidden patents within standards
1 "The attention of adopters is directed to the
possibility that compliance with or adoption of OMG
specifications may require use of an invention covered by
patent rights. OMG shall not be responsible for identifying
patents for which a license may be required by any OMG
specification, or for conducting legal inquiries into the
legal validity or scope of those patents that are brought to
its attention. OMG specifications are prospective and
advisory only. Prospective users are responsible for
protecting themselves against liability for infringement of
patents. ()"
|
List 4.4.3. Perceived risks of developing a new standard.
1 The standard may inadvertently contain intellectual property
(particularly patented methods) resulting in a legal
complaint against the creators of the standard
2 The standard may create loss of revenue or property to
certain entities, resulting in legal actions taken against
the creators of the standard
3 The standard may result in medical errors, resulting in
injury to patients and subsequent legal actions taken
against the creators of the standard
4 The standard may have been developed in a manner that
excluded participation by an entity, resulting in a legal
action
|
List 4.5.1. Questions that should be asked prior to developing a new standard.
1 Is there a pre-existing standard that covers the same
technology?
2 If there is a pre-existing standard, can it be enhanced or
modified to provide a desired functionality?
3 How much will it cost to develop the standard?
4 How long will the standards development process take?
5 Will the intended beneficiaries of the standard pay for the
standards development process?
6 Who will develop the standard? Are the selected developers
competent to produce an adequate standard?
7 Are any of the developers conflicted? Do they stand to
profit if the standard is developed in a specific way?
8 Do any of the developers have proprietary software or data
that they may wish to include in the standard?
9 Are the expected developers committed to work through the
duration of the standards development process, and are they
committed to providing all of the time and energy needed to
develop the standard?
10 Will there be a mechanism whereby drafts of the standard are
reviewed openly by the public? Will the minutes of the
working committee be made public? Will public comments be
used to modify successive drafts of the standard?
11 Will the standard have dependencies on other standards? If
so, are there intellectual property issues that must be
resolved before development begins? Will these issues
require licenses or royalty agreements from the standards
developers or the standards users?
12 Once created, is the standard likely to be adopted? Is the
anticipated standard easily implemented?
13 Who will be the adopters of the standard? Are the expected
standard adopters included in the development process for
the standard?
14 Will the standard benefit a range of users beyond the
standards developers?
15 What are the hazards that the standard may produce, and who
might be hurt by the standard? In particular, will any
entities be disadvantaged if they cannot readily adopt the
standard?
16 Is it necessary to have the standard approved by an external
organization?
17 If so, who will pay for the extra costs of obtaining
approval from an external standards organization?
18 Will the standard need to be continuously updated and
modified? Is there a planned process for producing multiple
versions of the standard?
19 Is it really important to have the standard? Is it worth
the effort?
|
List 4.6.1. Organizations active in the field of biomedical standards.
1 ASTM, American Society of Testing and Materials
2 ANSI, American National Standards Institute (see Glossary)
3 HISB, Health Information Standards Board
4 IEEE, Institute of Electrical and Electronics Engineers,
Inc
5 ACR/NEMA, American College of Radiology (ACR) and National
Electrical Manufacturers Association (NEMA), which oversees
the DICOM (Digital Imaging and Communications in Medicine)
image standard
6 NCPDP, National Council for Prescription Drug Programs, Inc
7 NIST, National Institute of Standards and Technology
8 ISO, International Organization for Standardization
9 IEC, International Electrotechnical Commission.
|
List 4.6.2. Some American National Standards programming languages. 1 Mumps (ANSI approval 1977) 2 Basic (ANSI approval 1978) 3 ADA (ANSI approval 1983) 4 C (ANSI approval 1989) 5 Common Lisp (ANSI approval 1994) 6 ADA 95 (ANSI approval 1995) 7 Smalltalk (ANSI approval 1998) 8 C++ (ANSI approval 1999). |
List 4.7.1. New and future technologies that create biomedical data. 1 Gene Expression arrays (see Glossary) 2 Proteomic arrays 3 Tissue Microarrays 4 Metabolomic arrays 5 Image morphometric arrays. |
List 4.8.1. Problems created by the introduction of new standards.
1 New classes of data object requires a new standard for the
new object class. (Examples Tissue Microarray Data, Gene
Expression Array Data)
2 New standards require new implementations
3 Existing data standard require revision
4 Revisions of existing standards require retro-active
implementation in data records conforming to the prior
version of the standard
5 New data standards require harmonization with other existing
standards. Otherwise multiple standards may compete for the
standards-based data structures and data descriptors
applicable to data elements common to multiple standards
6 Because standards often become the intellectual property of
the standards development organization, new standards
cannot include parts of standards developed by other
organizations. This means that redundant standards may
describe the same objects.
|
List 4.9.1. Fundamental properties of a specification.
1 The object specified must be defined and distinguished from
all other objects. (i.e., one object cannot have two
different specifications and one specification cannot apply
equally to two non-equivalent objects)
2 The description must be organized in a way that is
understandable and unambiguous. (i.e., a standard method of
describing things, in the general sense, can be used.
Languages are standard methods of describing things, but a
better method might employ a formal semantic logic)
3 The descriptors must be well-defined in the context of the
specification and not confused with descriptors of the same
name but different meaning that may appear in other
specifications (e.g., a "date" may be a calendar
notation in one standard and a type of dried fruit in
another specification)
4 The measurements and descriptor values must be well-defined
and not confused with measurements and values of the same
alphanumeric value but different meaning that may appear in
other specifications. (e.g., 10 pounds is not the same as 10
Kg)
5 The specification must describe itself, include information
pertaining to its purpose, its creator, its ownership, any
restrictions on its uses, and any instructions necessary to
interpret the specification.
|
List 4.9.2. Logistical advantages of specifications over standards.
1 A specification need not be developed through a standards
development process. A specification is basically a
descriptive document and only requires fully unambiguous
language. An individual can create a specification that
everyone in the world can understand and use
2 Specifications do not require approval by any federal agency
or organization. Standards have almost no meaning unless
they are approved. In some cases, standards are enforced by
authority of law
3 There are usually many different ways of specifying things.
The same object can be described by different
specifications. Standards tend to impose a monolithic
implementations
4 A specification is a general way of describing things and
can be used for many different and new types of things.
Standards are typically developed for specific items and
cannot accommodate new items without pursuing a development
and approval process through a standards development
organization. Biomedical informaticians who use research
data will almost certainly find that existing standards will
not keep pace with the arrival of new techniques and data
objects. The chair shown (see Figure) is a fully specified
image created with Pov-Ray, a free, open source rendering
program (see Appendix). It was created using a .pov file,
which is a plain-text set of instructions written for the
rendering application.
|
List 4.9.3. Snippet from chair.pov rendering specification, modified from Matthias Opitz's public domain scene file. |
List 4.11.1. Parts of an LSID, from The LSID Resolution Protocol Project.
1 Network Identifier (NID)
2 root DNS name of the issuing authority
3 namespace chosen by the issuing authority
4 object id unique in that namespace and assigned locally
5 revision id for storing versioning information
(optional)
|
List 4.11.2. Examples of LSIDs, from The LSID Resolution Protocol Project.
1 urn:lsid:pdb.org:1AFT:1 This is the first version of the
1AFT protein in the Protein Data Bank
2 urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434 References a
PubMed article
3 urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2 Refers to the
second version of an entry in GenBank
|
List 4.13.1. Principles of unique object identification.
1 A unique object can be distinguished from all other unique
objects
2 A unique object cannot be distinguished from itself
3 A class (or collection) of instances can be unique.
|
List 4.13.2. Some registries that continually assign unique identifiers to requesting entities.
1 DOI, Digitial object identifier
2 PMID, PubMed identification number
3 LSID (Life Science Identifier)
4 HL7 OID (Health Level 7 Object Identifier)
5 DICOM (Digital Imaging and Communications in Medicine)
identifiers
6 ISSN (International Standard Serial Numbers)
7 Social Security Numbers (for U.S. population)
8 NPI, National Provider Identifier, for physicians
9 Clinical Trials Protocol Registration System
10 Office of Human Research Protections FederalWide Assurance
number
11 Data Universal Numbering System (DUNS) number ()
12 DNS, Domain Name Service.
|
List 4.13.3. Dependable computer systems that rely on unique object identifiers. 1 Google (relies on URLs) 2 PubMed (relies on PubMed identifiers) 3 Libraries (relies on ISSN, DOI) 4 Swiss banks (relies on unique account numbers). |
List 4.13.4. Some medical errors related to misidentification.
1 Correctly identified medication provided to incorrectly
identified person
2 Incorrectly identified medication provided to correctly
identified person
3 Incorrectly identified dosage of correct medication provided
to correctly identified person
4 Blood transfused provided to incorrectly identified person
5 Report sent to incorrectly identified physician
6 Report identified with wrong person's name
7 Bill sent to incorrectly identified person
8 Report provided with diagnosis intended for different person
9 Wrong operation performed on incorrectly identified patient
10 Incorrectly identified patient treated for another patient's
illness.
|
List 4.15.1. Information deficiencies in the statement "John Smith has a blood glucose of 85".
1 No unique patient identifier (many people are named John
Smith)
2 No unique time identifier (indicating when the test was
performed and distinguishing the test results from other
blood glucose values obtained from the patient at other
times)
3 No unique test identifier (indicating the specific protocol
used to measure blood glucose in this instance)
4 No unique identifier for the units of measurement
5 No unique report identifier (indicating that the report
itself is a unique laboratory object that can be archived
and retrieved)
|
List 4.15.2. Three conditions for a meaningful assertion in informatics.
1 There is a specified object about which the statement is
made. When the object is a unique object (such as a
patient), the object must be specified in a manner that
distinguishes the object from all other objects, and this is
typically done with a unique object identifier
2 There is data that pertains to the specified object
3 There is metadata that describes the data (that pertains to
the specified object.)
|
List 4.15.3. Generalizable scientific statements.
1 f=ma -- Force is mass time acceleration
2 If a gas is held at constant temperature, its volume is
inversely proportional to its pressure - Boyle's law
3 Ontogeny recapitulates phylogeny - fetal development follows
the evolutionary path of the species (a false assertion)
4 There are 10 types of people, those who use binary notation
and those who do not
5 (love of money) = (evil)x(evil) -- The love of money is the
root of all evil.
|
List 4.15.4. Algorithm for de-identifying with an identifier.
1 Collect data on unique object. "Joe Public has brown
eyes."
2 Assign a unique identifier. "Joe Public has unique
identifier, 77300183."
3 Substitute name of object with its identifier
4 Consistently use the identifier with data. "77300183
has brown eyes."
5 Do not let anyone know that Joe Public is 77300183.
|
List 5.2.1. Some questions that can be answered with short program scripts.
1 Strip all the private identifiers from a medical record
2 Find all the surgical procedures included in the dataset of
surgical post-op notes, and annotate each procedure with its
frequency of occurrence in the dataset
3 Index a book with the page location of all terms that are
names of diseases
4 Find all the palindromes in a gene sequence database and
arrange them by frequency of occurrence
5 Find the most common occurring sequence of octamers in the
human genome database
6 Find all octamers that occur only once in the human genome
database
7 Rank sequences from a gene expression array experiment based
on levels of over-expression
8 From a patient database, find the diseases that have a
chronologic relationship with another condition (e.g.
chicken pox never occurs after shingles)
9 Find all tumors associated with a gene fusion mutation
10 Collect 100 histopathologic images of liver disease from the
Web.
|
List 5.2.2. The three programming tricks in medical informatics.
1 File parsing (opening a file and examining the contents of
the file, one line at a time)
2 Pattern matching (finding a fragment of parsed text that
matches a word, a phrase or a character pattern of interest)
3 Assigning data structures to hold numbers or textual data
that can be operated on, with outputs placed in an external
file.
|
List 5.2.3. Pseudocode to collect all the lines from a file that contain the phrase "biomedical informatics".
1 1. Open a file for reading. (Verbose equivalent: Get a file
from the hard drive that has a particular name and prepare
it so that the data in the file can be extracted and put
into holders in the computer's memory)
2 2. Parse the lines of the file. (Verbose equivalent: Grab
the characters from the first line of the file and put it
into a data holder that occupies a specific place in
computer memory. Be prepared to repeat this for all the
lines of the file.)
3 3. Collect all the lines that contain the phrase
"biomedical informatics. (Verbose equivalent: As each
line is placed in a holder in computer memory, determine
whether the line contains the string "biomedical
informatics" and if it does, add the held data to a
structure called an array, which can hold many character
strings, in sequence.)
4 4. When the file is exhausted, empty all the matching lines
into an external file, opened for writing, named
"output.txt". (Verbose equivalent: At the end of
the file parsing loop, take the array structure, and
transfer all the character strings from the array, in
sequence, into a newly created file that has been prepared
to accept data.
|
List 5.2.4. Reasons to program in Perl.
1 Perl can be obtained at no cost
2 Perl is available for virtually every operating system and
comes bundled into Unix and Linux distributions
3 Perl is extremely popular among bionformaticians
4 It takes just a few hours to learn enough Perl to write your
own biomedical informatics programs
5 Perl programs tend to be much shorter and easier to
understand than programs written in C or Java
6 A Perl script written for your computer will probably work
on any other computer loaded with a Perl interpreter, even
if the other computer has a different operating system
7 Unlike C and C++, Perl comes with native pattern matching
commands (so called regular expressions) which are used in
virtually every program in the field of biomedical
informatics
8 There are many thousands of freely available Perl tools that
perform a wide range of useful operations that can extend
the functionality of your own programs
9 Perl code can be written in a manner that looks much like
simple narrative text (if you make the effort) making it
easy for others to to read
10 Once you've learned Perl, you can migrate to almost any
other programming language with ease.
|
List 5.5.1. Contents of typical flat-file, "taxo.txt" extracted from "Taxonomy". 1 SYNONYM : Bacillus aegyptius 2 SYNONYM : Haemophilus aegyptius 3 SYNONYM : Hemophilus conjunctivitidis 4 SYNONYM : Haemophilus influenzae aegyptius 5 SYNONYM : Bacillus conjunctivitidis 6 SYNONYM : Bacterium aegyptiacum 7 SYNONYM : Bacterium conjunctivitis 8 SYNONYM : Bacterium pseudo conjunctivitidis |
List 5.5.2. Perl script, open1.pl, to open a file and read a file.
1 #!/usr/bin/perl
2 open(FILE, "taxo.txt");
3 $line = " ";
4 while ($line ne "")
5 {
6 $line = <FILE>;
7 print $line;
8 }
9 exit;
|
List 5.5.3. Output of open1.pl. 1 C:\ftp>perl open1.pl 2 SYNONYM : Bacillus aegyptius 3 SYNONYM : Haemophilus aegyptius 4 SYNONYM : Hemophilus conjunctivitidis 5 SYNONYM : Haemophilus influenzae aegyptius 6 SYNONYM : Bacillus conjunctivitidis 7 SYNONYM : Bacterium aegyptiacum 8 SYNONYM : Bacterium conjunctivitis 9 SYNONYM : Bacterium pseudo conjunctivitidis |
List 5.10.1. Mwp.pl, a ridiculously short text editor, in Perl.
1 #!/usr/bin/perl
2 open (OUT, ">>mycumu.txt");
3 open (NEW, ">mynew.txt");
4 $line = " ";
5 until ($line eq "\n")
6 {
7 $line = <STDIN>;
8 print OUT $line;
9 print NEW $line;
10 }
11 exit;
|
List 5.10.2. Until loop in Perl.
1 $line = " ";
2 until ($line eq "\n") #loop stops when
all you've entered is
3 #the return key
4 {
5 $line = <STDIN>; #waits for the next line
of input
6 print OUT $line; #appends to the cumulative
file
7 print NEW $line; #writes to the current
script-session file
8 }
|
List 5.11.1. Common errors in Perl scripts.
1 Perl blocks must be balanced with curly brackets. Every
block (e.g., while, if, for, unless, foreach) must have a
beginning curly bracket,"{" and a balanced closing
curly bracket, "}". This can become hairy in
scripts that have multi-nested blocks
2 Command lines must end with a semicolon
3 String variables must be pre-pended with a "$",
as in, $date
4 Spelling counts in scripts. Perl cannot interpret a
misspelled command or variable
5 An uppercase character has a different ascii value than its
lowercase equivalent. With few exceptions, you will find it
useful to maintain case consistency in Perl scripts
6 Characters that serve as reserved Perl symbols must be
backslashed if they are used as string characters. For
example, use \. \/ \\ \$ if you want to use ./\$ as
characters. There are exceptions to this rule: \n,\d, \w are
reserved symbols and never refer to the letters, ndw. The
strange and non-intuitive use of backslashes in Perl takes
some mental adjustment and accounts for the "leaning
toothpick syndrome" in Perl scripts. Complex regular
expressions often resemble toothpicks tossed amidst string
characters
7 Certain operations must be enclosed by parentheses (e.g., if
(1 == 2), not (if 1 == 2)
8 The "=" operator is assignes a value and does not
test for equality. To test for equality, use "=="
if you are comparing two numbers and use "eq" if
you are comparing two strings. Remember that string
comparison operators (eq, ne, lt, gt) are different from
number comparison operators (==, >, <)
9 Using an "=" operator when you really want to use
the regex comparison operator, "=~".
|
List 5.11.2. Summary of the first Perl programming section.
1 Perl scripts are simple text files. [Perl scripts should be
named using the .pl extension [Perl is a quintessential
command-line language. At the command prompt, run your
scripts by typing perl, then the name of the script, then
the return-key (on some systems, you needn't include the
name perl)
2 Perl scripts start off with a header line
3 Perl commands end with a semicolon
4 Perl blocks are delineated by curly brackets ({ })
5 You can assign strings to variables by using the assignment
operator, "="
6 You can read, write or append to files using the
"open" command
|
List 5.12.1. Pseudocode that outlines the general construction of a Perl script.
1 header (shebang) line;
2 input something;
3 if (something evaluates to true)
4 {
5 do something;
6 for or while (some condition)
7 {
8 do something;
9 }
10 do something;
11 do something;
12 }
13 for or while (some condition)
14 {
15 do something;
16 if (something evaluates to true)
17 {
18 do something;
19 do something;
20 do something;
21 }
22 output something;
23 }
24 exit;
|
List 5.14.1. Perls script bigread.pl.
1 #!/usr/bin/perl
2 #bigread.pl
3 #This script lets you page through enormous files,
4 #20 lines at a time, with no file load time
5 print "What file do you want to read?";
6 $filename = <STDIN>;
7 chomp($filename);
8 open (TEXT, $filename)||die"Can't open file";
9 $line = " ";
10 while ($line ne "") [#comment: while $line is
not equal to empty
11 {
12 for ($count = 1; $count <= 20; $count++)
13 {
14 $line = <TEXT>;
15 print $line;
16 }
17 print "Type QUIT if you want to quit.
Otherwise press any key\n";
18 $response = <STDIN>;
19 if ($response =~ /QUIT/i)
20 {
21 last;
22 }
23 }
24 exit;
|
List 5.14.2. Output of File Reader.
1 C:\ftp>perl readbig.pl
2 What file do you want to read?e:\omim.txt
3 *RECORD*
4 *FIELD* NO [100050 [*FIELD* TI [100050 AARSKOG SYNDROME
[*FIELD* TX [Grier et al. (1983) reported father and 2 sons
with typical Aarskog [syndrome, including short stature,
hypertelorism, and shawl scrotum. [
5
6
7 sons and that this suggested autosomal dominant inheritance.
Actually,
8 the mother seemed less severely affected, compatible with
X-linked
9 Type QUIT if you want to quit. Otherwise press any key
|
List 5.14.3. Summary of the second Perl programming section. 1 How to open and read from files, line by line 2 How to prompt a user for input 3 Looping using |
List 5.15.1. Things you can do with a one-line Regular expression.
1 Collect the lines from a file that contain a specific word,
phrase or number
2 Collect the lines from a file that contain any desired
combination of the above
3 Substitute any alphanumeric character string for any other,
for the entire file
|
List 5.16.1. Using the match operator with regular expressions.
1 for all the lines of a given file
2 {
3 put the next line from the file into some variable;
4 check the line to see if it matches your regular
expression;
5 {
6 if the line matches the regular expression
7 {
8 do something with it, like put it into another file;
9 or do an operation on the matching value;
10 }
|
List 5.16.2. Using the substitution operator with regular expressions.
1 for all the lines of a given file
2 {
3 put the next line from the file into some variable;
4 do a substitution on all of the parts of the line that match
your regular expression;
5 do something with the the revised line, like
rearranging it and then putting the rearranged line into
another file;
6 }
|
List 5.17.1. Pattern match options.
1 g Match globally, (find all occurrences)
2 i Do case-insensitive pattern matching
3 m Treat string as multiple lines
4 o Compile pattern only once
5 s Treat string as single line
6 x Use extended regular expressions
7 ^ Match the beginning of the line
8 . Match any character (except newline)
9 $ Match the end of the line (or before newline at
the end)
10 | Alternation
11 () Grouping
12 [] Character class
13 * Match 0 or more times
14 + Match 1 or more times
15 ? Match 1 or 0 times
16 {n} Match exactly n times
17 {n,} Match at least n times
18 {n,m} Match at least n but not more than m times
19 \n newline(LF, NL)
20 \W Match a non-word character
21 \s Match a whitespace character
22 \S Match a non-whitespace character
23 \d Match a digit character
24 \D Match a non-digit character.
|
List 5.17.2. Sentence.pl Perl script, which creates a file wherein each new sentence begins on a new line.
1 #!/usr/local/bin/perl
2 open (TEXT, "1DFRE10.TXT")||die"Can't
open file";
3 open (OUT,">1DFRE10.OUT")||die"Can't
open file";
4 undef($/);
5 $string = <TEXT>;
6 $string =~ s/[\n]+/ /g;
7 $string =~ s/([^A-Z]+\.[ ]{1,2})([A-Z])/$1\n$2/g;
8 print OUT $string;
9 exit;
|
List 5.18.1. Periods.pl, a Perl script for removing periods that do not delineate sentences.
1 #!/usr/bin/perl
2 #disbrev2.pl
3 #replaces periods with *, except when period marks end
of sentence
4 $k = "Mr. P.I.N. Ph.D. M.D. 0.3 .4 5. 4.6.7.8.9
end_of_sentence. Hello";
5 $firstvalue = $k;
6 $k =~ s/\b([ \w\d]*)\.+(?=[\w\d]*)(?! [A-Z])/$1\*$2/g;
7 print "$firstvalue =>\n$k";
8 exit;
|
List 5.18.2. Output of disbrev2.pl.
1 C:\ftp>perl disbrev2.pl
2 Mr. Dr. P.I.N. Ph.D. M.D. 0.3 .4 5. 4.6.7.8.9
end_of_sentence. Hello =>
3 Mr* Dr* P*I*N* Ph*D* M*D* 0*3 *4 5* 4*6*7*8*9
end_of_sentence. Hello
4 C:\ftp>
5 USEFUL REGULAR EXPRESSIONS
|
List 5.18.3. Regex (regular expression) substitution examples.
1 $string =~ s/^ +//o; Removes leading spaces from a character
string
2 $string =~ s/ +$//o; Removes trailing spaces from a
character string
3 $string =~ s/ +/ /g; Changes all sequences of one or more
spaces to just a single space
4 $string =~ s/\n//g; Gets rid of newline (sometimes called
linebreak) characters in your string
5 $string =~ s/\b(\w+\.[ ]{1,2})([A-Z])/$1\n$2/g;
6 This finds the most common sentence delimiter (the end of a
word followed by a period followed one or two spaces,
followed by by an uppercase letter) and substitutes a
newline character to that the each new sentence begins on a
new line
7 $string =~ tr/A-Z/a-z/ Every uppercase letter is converted
to a lowercase letter using the translate operator
(tr/a-z/A-Z/ does the opposite)
8 $string = |
List 5.19.1. Wc.pl Perl script, which counts the words in a file in 5 commands. 1 #!/usr/local/bin/perl 2 open (TEXT, "1DFRE10.TXT"); 3 undef($/); 4 $all_text = <TEXT>; 5 @wordarray = split(/[\n\s]+/, $all_text); 6 print scalar(@wordarray); 7 exit; |
List 5.20.1. The Zipf distribution of the prior paragraph of the prior paragraph. 1 c:\ftp>perl zipf.pl 2 00007 of 3 00005 a 4 00004 the 5 00003 words 6 00003 is 7 00003 in 8 00002 zipf 9 00002 text 10 00002 occurrences 11 00002 distribution 12 00001 zipf's 13 00001 way 14 00001 this 15 00001 their 16 00001 that 17 00001 small 18 00001 shown 19 00001 see 20 00001 practical 21 00001 paragraph 22 00001 order 23 00001 most 24 00001 listing 25 00001 list 26 00001 law 27 00001 interpreting 28 00001 for 29 00001 different 30 00001 descending 31 00001 any 32 00001 amount 33 00001 account |
List 5.20.2. The first ten items in the Zipf distribution of The Decline and Fall of the Roman Empire. 1 26856 the 2 18032 of 3 09136 and 4 06026 to 5 04654 a 6 04155 in 7 03170 was 8 03081 his 9 02815 by 10 02391 that |
List 5.20.3. Zipf.pl, a Perl script that creates a Zipf distribution in 6 commands.
1 #!/usr/local/bin/perl
2 open (TEXT, "1DFRE10.TXT");
3 open (OUT, ">1DFRE10.OUT");
4 undef($/);
5 $all_text = <TEXT>;
6 $all_text = lc($all_text);
7 $all_text =~ s/[^a-z\-\']/ /g;
8 @wordarray = split(/[\n\s]+/, $all_text);
9 foreach $thing (@wordarray)
10 {
11 $freq{$thing}++;
12 }
13 #The Zipf list finished. The next lines just display
the distribution
14 while ((my $key, my $value) = |
List 5.20.4. Example of an associative array, %patient_weight.
1 $patient_weight{"John Public"} = 155;
2 $patient_weight{"Mary Smith"} = 110;
3 $patient_weight{"Jules Berman"} = 195;
4 $patient_weight{"Jules Berman"}++; #evaluates
to 196
|
List 5.20.5. Summary of the third Perl programming section.
1 Creating and interpreting complex regular expressions
2 Looping through arrays with foreach blocks
3 Looping through associative arrays with while blocks
4 New Perl operators and commands split(), push(), lc(),
sort(), join(), substr(), scalar(), undef(), incrementing
values and concatenating strings
5 Advanced pattern substitution and substitution options
|
List 5.21.1. A sample MESH record.
1 *NEWRECORD
2 RECTYPE = D
3 MH = Heparin
4 AQ = AA AD AE AG AN BI BL CF CH CL CS CT DF DU EC GE HI IM
IP ME PD PH [ PK PO RE SD SE ST TO TU UL UR
5 PRINT ENTRY = Heparinic Acid|T118|T121|T123|
6 NON|EQV|UNK (19XX)|800523|abbbcdef
7 PRINT ENTRY = alpha-Heparin|T118|T121|T123|NON|NRW|
8 UNK (19XX)|800523|abbbcdef
9 ENTRY = Liquaemin|T118|T121|T123|TRD|NRW|UNK
(19XX)|861029|abbbcdef
10 ENTRY = Sodium Heparin|T118|T121|NON|NRW|UNK
(19XX)|830330|abbcdef
11 ENTRY = Heparin, Sodium
12 ENTRY = alpha Heparin
13 MN = D09.698.373.400
14 PA = Anticoagulants
15 PA = Fibrinolytic Agents
16 EC = antagonists & inhibitors:Heparin Antagonists
17 MH_TH = BAN (19XX)
18 ST = T118
19 ST = T121
20 ST = T123
21 N1 = Heparin
22 RN = 9005-49-6
23 MS = A highly acidic mucopolysaccharide formed of equal
[parts of sulfated D-glucosamine and D-glucuronic acid with
24 sulfaminic bridges. The molecular weight ranges from six to
[twenty thousand. Heparin occurs in and is obtained from
liver,
25 lung, mast cells, etc., of vertebrates. Its function is
unknown, [but it is used to prevent blood clotting in vivo
and vitro, in
26 the form of many different salts
27 PM = /therapeutic use was HEPARIN, THERAPEUTIC 1965
28 HN = /therapeutic use was HEPARIN, THERAPEUTIC 1965
29 MED = *1635
30 MED = 3275
31 M90 = *2406
32 M94 = 4517
33 MR = 20040707
34 DA = 19990101
35 DC = 1
36 UI = D006493
|
List 5.21.2. Creating a persistent database object from the MESH flat-file.
1 #!/usr/bin/perl
2 use Fcntl;
3 use SDBM_File;
4 tie%item, "SDBM_File", 'mesh',
O_RDWR|O_CREAT|O_EXCL, 0644;
5 untie%item; #these two lines simply create a file
6 open (TEXT, "d2002.bin")||die"Can't open
file";
7 $/ = "*NEWRECORD";
8 $line = " ";
9 while ($line ne "")
10 {
11 tie%item, "SDBM_File", 'mesh', O_RDWR,
0644; #use the created file
12 $line = <TEXT>;
13 @linearray = split(/\n/,$line);
14 foreach $piece (@linearray)
15 {
16 if ($piece =~ /MN = /)
17 {
18 $meshno = $';
19 }
20 if ($piece =~ /ENTRY = /)
21 {
22 $entry = $';
23 if ($entry =~ /\|/o)
24 {
25 $entry = $`;
26 }
27 $entry =~ s/s\b//g;
28 $entry = lc($entry);
29 push (@synonyms, $entry);
30 }
31 }
32 foreach $term (@synonyms)
33 {
34 $item{$term} = $meshno;
35 }
36 undef $meshno;
37 undef @synonyms;
38 untie%item;
39 }
40 undef(%item);
41 close TEXT;
42 exit;
|
List 5.22.1. Retrieving a persistent database object from the MESH flat-file.
1 #!/usr/bin/perl
2 use Fcntl;
3 use SDBM_File;
4 tie%item, "SDBM_File", 'mesh', O_RDWR, 0644;
5 while(($key, $value) = each (%item))
6 {
7 print "$key => $value\n";
8 }
9 untie%item;
10 exit;
|
List 5.23.1. Syntax rules for valid XML tags.
1 XML tags, unlike Perl variables, are case-sensitive
("Name" is different from "name").
Parsers must preserve character case
2 Letters, underscores, hyphens, periods and numbers may be
used in a tag
3 Only letters and underscores are eligible as the first
character
4 Colons are allowed, but only as part of a declared namespace
prefix. For all practical purposes, this means that only one
colon is allowed in a tag, and the colon must appear in an
internal location in the tag (not at the beginning or the
end of a tag).
|
List 5.23.2. Tagcheck.pl, a program that validates XML tags.
1 #!/usr/bin/perl
2 @elements = qw (gene 4gene gene:ncbi gene-autry ge::ne [
gene&autry -gene _gene gene- gene: [
:gene ge:n:e ge:ne: ge,ne ge.ne);
3 foreach $value (@elements)
4 {
5 if ($value =~
/^[a-z\_][a-z0-9\-\.\_]*[\:]?[a-z0-9\-\.\_]*$/i)
6 {
7 print "$value is good\n";
8 }
9 else
10 {
11 print "$value is bad\n";
12 }
13 }
14 exit;
|
List 5.23.3. Output of tagcheck.pl 1 c:\ftp>perl tagcheck.pl 2 gene is good 3 4gene is bad 4 gene:ncbi is good 5 gene-autry is good 6 ge::ne is bad 7 gene&autry is bad 8 -gene is bad 9 _gene is good 10 gene- is good 11 gene: is good 12 :gene is bad 13 ge:n:e is bad 14 ge:ne: is bad 15 ge,ne is bad 16 ge.ne is good |
List 5.24.1. What we have learned so far.
1 The =~ operator tells Perl to look for the pattern that
follows the operator in the variable that precedes the
operator. Regular Expressions are Perl's way of describing a
pattern
2 You can create most of your patterns by following a few
simple rules and by "borrowing" regular
expressions from published listings
3 The most common usage for regular expressions are in scripts
that examine a line (or all the lines) from a file and that
perform a substitution or rearrangement or other operation
on the line, based on the results of the pattern match
4 Regular expressions are a powerful and fast tool for
modifying text or data records or finding exactly what you
want in any text
5 Perl associative arrays can be tied to an external database
object that persists even when the Perl script has finished
executing.
|
List 6.1.1. Some biomedical informatics tasks that can be accomplished with Perl.
1 Statistics
2 Mathematical Computations
3 Mathematical modeling
4 Web protocols (e.g., http and ftp)
5 Cryptographic techniques
6 Integrating data
7 Glue functions (e.g., calling subroutines written in C)
8 Digital Signal Processing (including Image analysis)
9 Bioinformatics methods (e.g. interfacing to Blast)
10 Database interfaces
11 Remote procedure calls and distributed computing
12 Middleware (see Glossary) [Software agents (via web
services, GRID, SOAP (see Glossary), or related protocols)
13 Transformations to and from XML
14 XML data queries
15 Logical annotation of data (e.g., RDF)
|
List 6.2.1. Creating an MD_5 one-way hash value for any provided string.
1 #!/usr/local/bin/perl
2 use MD5;
3 print "What words would you like to
digest?\n";
4 $holdstring = <STDIN>;
5 chomp;
6 $hexhashstring = MD5->hexhash($holdstring);
7 print "md_5 hexhash => $hexhashstring\n";
8 exit;
|
List 6.2.2. Three executions of the the MD_5 algorithm. 1 Execution 1: 2 c:\ftp>perl md5_word.pl 3 What words would you like to digest? 4 Jules Berman 5 md_5 hexhash => 0ab7ad79962fd2ea036cc8dbaade6f2a |
List 6.2.3. Creating an MD_5 one-way hash for a file.
1 #!/usr/local/bin/perl
2 use MD5;
3 print "What file would you like to
digest?\n";
4 $holdfile = <STDIN>;
5 chomp;
6 open (TEXT,"$holdfile");
7 $context = new MD5;
8 $context->addfile(TEXT);
9 $digest = $context->digest();
10 print (unpack ("H*", $digest));
11 exit;
|
List 6.3.1. Simple Perl script for computing the mean from an array of numbers.
1 #!/usr/bin/perl
2 #mean.pl
3 #computes the mean of an array of numbers
4 @numbersarray = (1,2,3,4,5,6,7,8,9,10);
5 $arraysize = scalar(@numbersarray);
6 print "The number of elements in our array is
$arraysize\n";
7 $sum = 0;
8 foreach $value(@numbersarray)
9 {
10 $sum = $sum + $value;
11 }
12 $mean = $sum / $arraysize;
13 print "Your population number is
$arraysize\n";
14 print "The array mean is $mean\n";
15 exit;
|
List 6.3.2. General method of building an array that can be used in a statistical or mathematical Perl routine.
1 Open the file containing your records
2 Go through the file, one line (record) at a time
3 From a complex record, pick out the number you want using
Regex
4 Add that number to your array variable (using the Perl push
command)
5 Calculate the mean (or any other statistical test) on the
array variable.
|
List 6.3.3. Computing the mean of an array entered at keyboard.
1 #!/usr/bin/perl
2 #mean2.pl
3 #computes the mean of an array of numbers entered at
keyboard
4 print "Type a bunch of numbers, pressing the
return key\n";
5 print "after each number. Decimal numbers are
allowed\n\n";
6 $number = " ";
7 until ($number eq "")
8 {
9 $number = <STDIN>;
10 $number =~ s/\n//o; #deletes the newline character
11 if ($number eq "")
12 {
13 next;
14 }
15 if ($number !~ /[0-9]+/) #the entry must contain
at least one digit
16 {
17 print "You're only allowed to enter
numbers...";
18 print " We just won't count this
entry\n";
19 next;
20 } [ if ($number !~ /^[0-9
|
List 6.3.4. Output of mean2.pl.
1 C:\ftp>perl mean2.pl
2 Type a bunch of numbers, pressing the return key after each
number. Decimal numbers are allowed
|
List 6.4.1. Some of the available Perl statistics modules ().
1 Statistics-Basic
[Statistics-ChisqIndep
2 Statistics-ChiSquare
[Statistics-Contingency
3 Statistics-ConwayLife [Statistics-DEA
4 Statistics-DependantTTest
[Statistics-Descriptive
5 Statistics-Descriptive-Discrete
[Statistics-Distributions
6 Statistics-Frequency
[Statistics-GammaDistribution
7 Statistics-KruskalWallis
[Statistics-LineFit
8 Statistics-Lite
[Statistics-LogRank
9 Statistics-LSNoHistory [Statistics-LTU
10 Statistics-OLS
[Statistics-RankCorrelation
11 Statistics-RankOrder
[Statistics-Regression
12 Statistics-ROC
[Statistics-SerialCorrelation
13 Statistics-Shannon
[Statistics-Simpson
14 Statistics-Table-F [Statistics-Test
15 Statistics-TTest Before you can
use these tests, you must download the appropriate module
into your Perl installation. A sample installation of
Statistics-Descriptive (by Colin Kuskie, Andrea Spinelli and
Jason Kastner), through the ActiveState package manager is
shown (see List). ppm> install statistics-descriptive
==================== Install 'statistics-descriptive'
version 2.6 in ActivePerl 5.8.7.815. ====================
Downloaded 10294 bytes. Extracting 5/5:
blib/arch/auto/Statistics/Descriptive/.exists Installing
C:\activepl\html\site\lib\Statistics\Descriptive.html
Installing C:\activepl\site\lib\Statistics\Descriptive.pm
Successfully installed statistics-descriptive version 2.6 in
ActivePerl 5.8.7.815. Only the first line is input:
[ppm> install statistics-descriptive
|
List 6.4.2. Perl script for calculating variance. 1 #/usr/local/bin/perl 2 use Statistics::Descriptive; 3 $stat = Statistics::Descriptive::Full->new(); 4 $stat->add_data(1,2,3,4,5,6,7,8,9,10); 5 $mean = $stat->mean(); 6 $var = $stat->variance(); 7 print "mean $mean\nvariance $var\n"; 8 exit; |
List 6.4.3. Output of statistics script. 1 c:\ftp>perl stat.pl 2 mean 5.5 3 variance 9.16666666666667 |
List 6.4.4. Perl script for computing the ChiSquare statistic.
1 #!/usr/bin/perl
2 use Statistics::ChiSquare;
3 print chisquare([1, 9, 1, 15, 4, 7]), "\n";
4 print chisquare([20, 20, 20, 30, 20, 20, 30 ]),
"\n";
5 exit;
|
List 6.4.5. Output of chi.pl.
1 C:\ftp>perl chi.pl
2 There's a <1% chance that this data is random
3 There's a >50% chance, and a <70% chance, that this
data is random.
|
List 6.5.1. Types of statistical errors.
1 Type 1 error. Rejecting the null hypothesis when the null
hypothesis is correct (i.e., seeing an effect when there was
none)
2 Type 2. Accepting the null hypotheses when the null
hypothesis is false. (i.e. seeing no effect when there was
one)
3 Type 3. Rejecting the null hypothesis correctly, but for the
wrong reason, leading to an erroneous interpretation of the
data in favor of an incorrect affirmative statement
4 Type 4. Erroneous conclusion based on performing the wrong
statistical test. The type 4 error is the most embarrassing
and the least excusable. You cannot blame a type 4 error on
the data. It's all on you. Considering the rich variety of
exotic statistical tests available to the novice, the
opportunities for type 4 errors are endless. One way of
avoiding type 4 errors is to have a dedicated statistician
analyze your data. For those informaticians who have access
to the services of a trustworthy statistician, this may
actually be the best and most practical solution. There is,
however, an alternate way approach: resampling. Resampling
is a type of statistical analysis that uses computers to
model experiments and then repeats the experiments thousands
or millions of time to determine the occurrence frequencies
for particular sets of data. This area of statistics was
popularized by Bradley Efron (), and may have particular
interest for readers of this book (see List). [List. Reasons
why resampling statistics are of interest to biomedical
informaticians
5 Does not require any knowledge of statistical tests
6 Applicable to a wide range of problems, including clinical
trial design and decision analyses
7 Easy to understand
8 Easy to program with Perl
|
List 6.6.1. Randtest.pl, a Perl script that simulates 600,000 casts of the die.
1 #!/usr/bin/perl
2 #randtest.pl
3 #Simulation of a throw of a die
4 $count = 0;
5 while ($count < 600000)
6 {
7 $count++;
8 $one_of_six = (int(rand(6))+1);
9 $hash{$one_of_six}++;
10 }
11 while(($key, $value) = each (%hash))
12 {
13 print "$key => $value\n";
14 }
15 exit;
|
List 6.6.2. Output of first test of randtest.pl. 1 C:\ftp>perl randtest.pl 2 1 => 100002 3 2 => 99902 4 3 => 99997 5 4 => 100103 6 5 => 99926 7 6 => 100070 |
List 6.6.3. Output of second test of randtest.pl. 1 C:\ftp>perl randtest.pl 2 1 => 100766 3 2 => 99515 4 3 => 100157 5 4 => 99570 6 5 => 100092 7 6 => 99900 |
List 6.6.4. Ranfile.pl, a Perl script that assigns random names to newly created files.
1 #!/usr/bin/perl
2 #ranfile.pl
3 #Makes 10 randomly named files, with 8 leading
characters
4 #a period and three trailing characters
5 while ($count < 10)
6 {
7 $count++;
8 &ranfile;
9 }
10 [sub ranfile
11 {
12 my @listchar;
13 my $count;
14 for ($count = 1; $count <= 12; $count++)
15 {
16 push(@listchar, chr(int(rand(26))+65));
17 }
18 $listchar[8]= ".";
19 my $randomfilename = join("",@listchar);
20 print "Your filename is $randomfilename\n";
21 return $randomfilename;
22 }
23 exit;
|
List 6.6.5. Output of ranfile.pl. 1 C:\ftp>perl ranfile.pl 2 Your filename is EKDUFKBR.YNX 3 Your filename is QVDKUVBY.QUI 4 Your filename is FNZXNKEE.MLV 5 Your filename is NRTXEHQI.VFX 6 Your filename is GWMOLKMX.AYU 7 Your filename is LZAKZQDW.RYR 8 Your filename is PRUAONQQ.OSJ 9 Your filename is XDEDHLKD.GAY 10 Your filename is RUSLNSXI.XVR 11 Your filename is IEPGAWDP.LEH |
List 6.7.1. Ai.pl, a Perl script that simulates clonal tumor growth.
1 #!/usr/bin/perl
2 #ai.pl
3 #Simulates the growth of a tumor from a single cells,
with
4 #a cell death probability per generation as provided by
the user
5 print "Enter the death probability for your
simulation\n";
6 print "Number must be between zero and
one.\n";
7 print "Most realistic numbers are .45 to
.50\n";
8 $value = <STDIN>;
9 $value =~ s/\n//o;
10 if ($value > 1) [ {
11 print "Exiting... you must pick a number between
zero and one\n";
12 end;
13 }
14 print "THE CELL DEATH PROBABILITY FOR THIS
SIMULATION IS $value\n\n";
15 my $roundnumber = 1; #initiate the generation counter
16 &cycle;
|
List 6.7.2. Output of ai.pl.
1 C:\ftp>perl ai.pl
2 Enter the death probability for your simulation [Number must
be between zero and one. [Most realistic numbers are .45 to
.50 [.46 [THE CELL DEATH PROBABILITY FOR THIS SIMULATION IS
.46 [Starting with a single malignant cell, let's watch the
clonal growth. Tumor terminated...good!
3 Starting with a single malignant cell, let's watch the
clonal growth. 1 Tumor terminated...good!
4 Starting with a single malignant cell, let's watch the
clonal growth. 2 1 4 2 1 1 1 Tumor terminated...good!
5 Starting with a single malignant cell, let's watch the
clonal growth. 2 1 5 6 8 8 8 12 15 18 18 20 19 27 32 30 31
20 14 16 23 30 30 36 38 34 50 52 67 75 97 114 133 143 150
156 159 178 200 254 302 292 329 336 382 441 489 603 630 701
770 862 923 1056 1084 1210 1369 1473 1664 1776 1959 2196
2475 2862 3098 3327 3740 4095 4634 Bad news. Let's stop
watching this malignancy
6 Starting with a single malignant cell, let's watch the
clonal growth. Tumor terminated...good!
7 Starting with a single malignant cell, let's watch the
clonal growth. Tumor terminated...good!
8 Starting with a single malignant cell, let's watch the
clonal growth. 3 1 3 5 3 1 1 Tumor terminated...good!
9 Starting with a single malignant cell, let's watch the
clonal growth. 4 6 5 6 3 3 1 Tumor terminated...good!
10 Starting with a single malignant cell, let's watch the
clonal growth. 2 2 5 3 2 2 6 5 6 5 4 2 1 Tumor
terminated...good!
11 Starting with a single malignant cell, let's watch the
clonal growth. 3 5 3 7 6 3 3 1 1 Tumor terminated...good!
I've seen enough!
|
List 6.7.3. Perl snippet showing the algorithm that repeatedly assigns probabilitic outcomes to an event.
1 while ($i < $sum +1)
2 {
3 $i++;
4 $randnum = int( |
List 6.8.1. Run.pl, a resampling script in Perl, that simulates runs of errors.
1 #!/usr/local/bin/perl
2 $errorno = 0;
3 while ($count < 100001)
4 {
5 $count++;
6 $x = rand(100);
7 if ($x < 2)
8 #similates a 2% error rate
9 {
10 $errorno++;
11 }
12 else
13 {
14 $errorno = 0;
15 }
16 if ($errorno == 3)
17 {
18 print "Uh oh. 3 consecutive errors\n";
19 $errorno = 0;
20 }
21 }
22 exit; The Perl script
simulates 100,000 diagnoses, which is a fair estimate of the
total number of diagnoses a pathologist might render in
their entire career (at 4,000 diagnoses per year over 25
years of service). Each diagnosis is assigned a random
number between 0 and 100. The "diagnosis" loop is
repeated 100,000 times. In each loop, if the randomly
assigned number is less than 2, the pathologist's error
number is incremented by 1. If the next diagnosis is
randomly assigned a number greater than 2, the error number
is dropped back down to 0 (i.e. the diagnosis is correct and
the run of errors is broken). If an error occurs on 3
consecutive occasions, the event is printed to the computer
monitor (see List). [List. Output of run.pl
23 c:\ftp>perl run.pl
24 Uh oh. 3 consecutive errors
25 Uh oh. 3 consecutive errors
|
List 6.9.1. Snippet of Perl code to determine unbiased random selection.
1 open
(HOLD,">holder.txt")||die"cannot";
2 while ($n < 1000000)
3 {
4 $x = |
List 6.10.1. Output of montesw.pl. 1 C:\ftp>perl montesw.pl 2 6598 3 C:\ftp>perl monteno.pl 4 3408 |
List 6.11.1. Ceil.pl, calling a POSIX function from a Perl script.
1 #!/usr/local/bin/perl
2 use POSIX qw(ceil floor);
3 $num = 11.3;
4 print "Floor is ", floor($num),
"\n";
5 print "Ceil is ", ceil($num), "\n";
6 exit;
|
List 6.11.2. Output of ceil.pl. 1 c:\ftp>perl ceil.pl 2 Floor is 11 3 Ceil is 12 |
List 6.12.1. Using the ActiveState Programmer's Package Manager. 1 c:\ftp>ppm 2 ppm - programmer's package manager version 3.3 3 copyright (c) 2001 activestate corp. all rights reserved 4 activestate is a division of sophos. |
List 6.12.2. Simple example script for the Fast Fourier Transform Module.
1 #!/usr/local/bin/perl [use Math::FFT;
2 my $PI = 3.1415926539;
3 my $N = 8; #N can be any power of 2, such as 4,8,16,64
4 $series = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16];
#could be anything
5 print "series " . join("
",@$series). "\n";
6 my $fft = new Math::FFT($series);
7 my $coeff = $fft->rdft();
8 print "coefficients \n @{$coeff}\n\n";
9 my $spectrum = $fft->spctrm;
10 print "spectrum \n @{$spectrum}\n";
11 exit;
|
List 6.12.3. Output of Fast Fourier Transform script 1 C:\FTP>perl fft.pl 2 series 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
List 6.13.1. A full concordance in 10 commands.
1 #!/usr/local/bin/perl
2 open (TEXT, "1DFRE10.TXT");
3 open (OUT, ">1DFRE10.OUT");
4 $line = " ";
5 while ($line ne "")
6 {
7 $cumline = "";
8 |
List 6.13.2. A full indexing script in 10 commands.
1 #!/usr/local/bin/perl
2 open (TEXT,
"1DFRE10.TXT")||die"cannot";
3 open (OUT,
">1DFRE10.OUT")||die"cannot";
4 $line = " ";
5 @indextermarray = ("gaul","roman
empire",&
quot;emperor","village","england");
6 while ($line ne "")
7 {
8 $cumline = "";
9 |
List 6.13.3. An excerpted output for the indexing program, listing only the terms "england" and "village" and the pages on which they are found. 1 c:\ftp>perl indexer.pl 2 3 4 5 england = 5 14 23 134 207 208 229 277 6 7 8 9 village = 43 77 81 94 128 141 147 184 185 225 226 244 |
List 6.13.4. Problems with human-based indexing.
1 Incredibly labor-intensive and time consuming
2 The index cannot be built until the book is in final form
and the page numbers are known, delaying the publication of
the book until the indexing is completed
3 If important phrases are omitted completely or if one or
more of their locations are omitted, no one will likely
catch the error
4 The indexing effort needs to be repeated if there are book
revisions and pagination changes.
|
List 6.13.5. Extracting candidate index phrases from a test file.
1 #!/usr/local/bin/perl [@stop = qw(
2 a about absent absence again all almost also although always
among an
3 and another any are as at be because been before being
between both
4 but by can could cm did do does done due during each either
enough
5 especially etc for found from further had has have having
here how
6 however i if in into is it its itself just kg km made mainly
make may
7 mg might ml mm most mostly must nearly neither no nor
observed
8 obtained often on our overall perhaps present presence quite
rather
9 really regarding seem seen several should show showed shown
shows
10 significantly since so some such than that the their theirs
them then
11 there therefore these they this those through thus to upon
use used
12 using various very was we were what when which while with
within
13 without would or can't doesn't not
14 );
15 open
(TEXT,"1DFRE10.TXT")||die"Cannot";
16 open
(OUT,">1DFRE10.OUT")||die"Cannot";
17 undef($/);
18 $phrase = <TEXT>;
19 $phrase =~ s/\n/ /g;
20 $phrase = lc($phrase);
21 $phrase =~ s/[^a-z \']/ /g;
22 foreach $stopword (@stop)
23 {
24 $phrase =~ s/ $stopword / \# /g;
25 }
26 $phrase =~ s/[\s]+/ /g;
27 $phrase =~ s/ ?\# ?/\#/g;
28 @phraselist = sort (split("#",$phrase));
29 @phraselist = grep
30 {$i{$_}++;(($i{$_}==2)&&(scalar(split("
",$_))>1));}@phraselist;
31 print OUT join("\n",@phraselist); [exit; [List.
First 9 lines of output from phrase.pl script
32 abate fortis
33 abbe foucher
34 abdication of diocletian
35 abilities of
36 able leader
37 abolition of
38 absolute power
39 abuse of
40 academy of inscriptions
|
List 6.14.1. Algorithm for regular expression searches of text files.
1 1. Asks you for a regular expression to search a file. If
you're not adept at regular expressions, just enter any
word. Remember, a word or phrase is always the simplest
regular expression. In the output example, we'll search for
the word "adenocarcinoma"
2 2. If you enter the return-key without entering a regular
expression, it simply exits the script
3 3. Asks Perl to give you the current epoch time (number of
seconds passed since some point in history)
4 4. Opens an enormous publicly available file (138 Mbytes)
named MRCON (we'll learn a lot about this file in Biomedical
Perl)
5 5. Reads every line of MRCON (about 2 million of them),
testing each line to see if it contains a substring that
matches the regular expression that you provided (step 1)
6 6. If it finds a match, it adds the line number and the line
to an external file named regexout.txt
7 7. When it's finished reading the file, it asks Perl again
for the epoch time, and determines the script execution time
by subtracting the script's end time from the script's
beginning time
8 8. It prints to the monitor the time spent executing the
script, as well as the filename containing the output of all
the lines from the MRCON file that matched your provided
regular expression.
|
List 6.14.2. Perl script for regular expression searches of text files.
1 #!/usr/bin/perl
2 #perlfind.pl
3 #11/20/01
4 #this will pull out all the matching lines for a
prompted
5 #regular expression from any text file. This short
script is incredibly
6 #powerful, but it requires the user to have facility
creating
7 #regular expressions
8 open (OUT,
">regexout.txt")||die"Can't open file
$value";
9 $filename = "regexout\.txt";
10 print "What's your search regex?\n";
11 $regex = <STDIN>;
12 $regex =~ s/\n//o;
13 if ($regex eq "")
14 {
15 close TEXT;
16 close OUT;
17 print "\nYou didn't give a
regex...Goodby\n";
18 }
19 #$re = qr/$regex/oi;
20 $start = time();
21 &searchsub;
22 $end = |
List 6.14.3. Output of regular expression search. 1 C:\ftp>perl perlfind.pl 2 What's your search regex? 3 adenocarcinoma 4 Retrieval time is 5 seconds 5 Your search results are in file regexout.txt. |
List 6.15.1. A short script that performs a binary search on a file.
1 #!/usr/bin/local/perl
2 open (TEXT, "find_bin.txt");
3 seek(TEXT, 0, 2);
4 print "What word would you like to find?\n";
5 $findword = <STDIN>;
6 $findword =~ s/\n$//o;
7 $filesize = tell (TEXT);
8 |
List 6.16.1. Cluster.pl, a Perl script demonstrating clustering algorithm. 1 #!/usr/local/bin/perl 2 use Algorithm::Cluster; |
List 6.16.2. Output of cluster.pl script. 1 c:\ftp>perl cluster.pl 2 Row0 => Cluster 1 3 Row1 => Cluster 2 4 Row2 => Cluster 2 5 Row3 => Cluster 1 6 Row4 => Cluster 0 7 Row5 => Cluster 0 |
List 6.17.1. Example of a very simple program using the LWP (Library for WWW in Perl). 1 #!/usr/bin/perl 2 use LWP::Simple; 3 print (get "http://www.nih.gov"); 4 exit; |
List 6.18.1. Some Perl books in bioinformatics (a very different field from biomedical informatics)
1 Beginning Perl for Bioinformatics, by James Tisdall
2 Mastering Perl for Bioinformatics, by James Tisdall
3 Genomic Perl: From Bioinformatics Basics to Working Code by
Rex A. Dwyer
4 Perl Programming for Biologists, by D. Curtis Jamison
5 Developing Bioinformatics Computer Skills by Per Jambeck and
Cynthia Gibas
6 Bioinformatics Biocomputing and Perl: An Introduction to
Bioinformatics Computing Skills, by Michael Moorhouse and
Paul Barry
|
List 6.18.2. A simple DNA palindrome, GAATTC. |
List 6.18.3. Perl script for finding palindromes in a gene sequence.
1 #!/usr/bin/perl
2 $filename = "sample";
3 open (TEXT, "sample")||die"Cannot";
4 $line = " ";
5 $count = 0;
6 for $n (5..20)
7 {
8 $re = qr /[CAGT]{$n}/;
9 $regexes[$n-5]= $re;
10 }
11 NEXTLINE: while ($count < 1000)
12 {
13 $line = <TEXT> ;
14 $count++;
15 foreach my $value (@regexes)
16 {
17 $start = 0;
18 while ($line =~ /$value/g)
19 {
20 $endline = $';
21 $match = $&;
22 $revmatch = reverse($match);
23 $revmatch =~ tr/CAGT/GTCA/;
24 if ($endline =~ /^([CAGT]{0,15})($revmatch)/)
25 {
26 $start = 1;
27 $palindrome = $match . "*" . $1 .
"*" . $2;
28 $palhash{$palindrome}++;
29 }
30 }
31 if ($start == 0)
32 {
33 goto NEXTLINE;
34 }
35 }
36 }
37 close TEXT;
38 while(($key, $value) = each (%palhash))
39 {
40 print "$key => $value\n";
41 }
42 exit;
|
List 6.18.4. Input of sample.pl (line-breaks omitted from original file). 1 ATGAGCGAAGAAAGCTTATTCGAGTCTTCTCCACAGAAGATGGAGTACGAAATTACAAAC 2 TACTCAGAAAGACATACAGAACTTCCAGGTCATTTCATTGGCCTCAATACAGTAGATAAA 3 4 5 6 7 AAGATCAGAAGCGACCATGACAATGCTATTGATGGATTATCTGAAGTTATCAAGATGTTA 8 TCTACCGATGATAAAGAAAAATTGTTGAAGACTTTGAAATAA |
List 6.18.5. Output of sample.pl.
1 (* separates the spacer region from the flanking palindromic
regions)
2 C:\FTP>perl sample.pl
3 CTTTG*TCAGGATGGGC*CAAAG => 1
4 AGTAT*T*ATACT => 1
5 GAAATC**GATTTC => 1
6 AGTTT*GGCATCC*AAACT => 1
7 CCTTA*CCCTGT*TAAGG => 1
8 CTTCT*GGAGATTGAGA*AGAAG => 1
9
10
11
12 GATGG*ATTCAAG*CCATC => 1
13 GTTTGG*CAT*CCAAAC => 1
14 CTTCT*CCAC*AGAAG => 1
|
List 6.21.1. Examples of software utility functions.
1 Archiving utilities
2 Calculator utility
3 Compression/decompression utilities
4 Conversion utilities - Converts files (text, images, sound,
video) to and from different formats
5 Database utilities
6 Directory searching
7 Email service
8 Encryption/decryption utilities
9 File copying utilities
10 File reading and parsing utilities
11 FTP file retrieval
12 Indexing utilities
13 Sorting utilities
14 Searching utilities
15 Telnet remote computer access
16 Text editing
17 Web retrieval utilities
|
List 6.22.1. Types of software of possible interest to the FDA.
1 Software used as a component, part, or accessory of a
medical device
2 Software that is itself a medical device (e.g., blood
establishment software)
3 Software used in the production of a device (e.g.,
programmable logic controllers in manufacturing equipment)
4 Software used in implementation of the device manufacturer's
quality system (e.g., software that records and maintains
the device history record) .
|
List 6.22.2. Features of software that buyers want.
1 Easy installation
2 Simple instructions and documentation
3 Friendly graphic user interface
4 Functionality that supports the user's goals
5 Transparency (no need for user to understand the underlying
assumptions, algorithms and data structures upon which the
functionality of the software is based)
6 Compatibility with operating system and other software
residing on the user's computer
7 Good user support services
|
List 6.22.3. Features of good software (that serious biomedical informaticians need).
1 Extensibility. The functionality of the software and the
data can be modified and expanded
2 Scalability. Should work with any size of inputs
3 Standardization of all data (input and output)
4 Open source code
5 Open access data
6 Self-describing software
7 Cross-platform functionality. Software should operate in
multiple operating systems
8 Interoperability
9 Availability of updates
10 Full documentation of methods and algorithms
|
List 6.22.4. Some properties of valid software, modeled on FDA Principles of |