Biomedical books by Jules J. Berman, Ph.D., M.D.


biomedical informatics cover Perl Programming for Medicine and Biology Cover Ruby for Medicine and Biology Cover Ruby: The Programming Language


















Berman JJ.  Concept-Match Medical Data Scrubbing: How 
pathology datasets can be used in research. 
Arch Pathol Lab Med Arch Pathol Lab Med, 127:680-686, 2003. 

Jules J. Berman, PhD, MD 
Accepted January 23, 2003 

Context.-In the normal course of activity, pathologists create and
archive immense data sets of scientifically valuable information.
Researchers need pathology-based data sets, annotated with clinical
information and linked to archived tissues, to discover and validate
new diagnostic tests and therapies. Pathology records can be used for
research purposes (without obtaining informed patient consent for each
use of each record), provided the data are rendered harmless. Large
data sets can be made harmless through 3 computational steps: (1)
deidentification, the removal or modification of data fields that can
be used to identify a patient (name, social security number, etc); (2)
rendering the data ambiguous, ensuring that every data record in a
public data set has a nonunique set of characterizing data; and (3)
data scrubbing, the removal or transformation of words in free text
that can be used to identify persons or that contain information that
is incriminating or otherwise private. This article addresses the
problem of data scrubbing.  

Objective.-To design and implement a general algorithm that scrubs
pathology free text, removing all identifying or private information.  
Methods.-The Concept-Match algorithm steps through confidential text.
When a medical term matching a standard nomenclature term is
encountered, the term is replaced by a nomenclature code and a synonym
for the original term. When a high-frequency "stop" word, such as a,
an, the, or for, is encountered, it is left in place. When any other
word is encountered, it is blocked and replaced by asterisks. This
produces a scrubbed text. An open-source implementation of the
algorithm is freely available.  

Results.-The Concept-Match scrub method transformed pathology free
text into scrubbed output that preserved the sense of the original
sentences, while it blocked terms that did not match terms found in
the Unified Medical Language System (UMLS). The scrubbed product is
safe, in the restricted sense that the output retains only standard
medical terms. The software implementation scrubbed more than half a
million surgical pathology report phrases in less than an hour.  

Conclusions.-Computerized scrubbing can render the textual portion of
a pathology report harmless for research purposes. Scrubbing and
deidentification methods allow pathologists to create and use large
pathology databases to conduct medical research.  

The biomedical research community is increasingly aware that pathology
records and archived specimens are absolutely vital to discover,
validate, and monitor new diagnostic tests and treatment regimens.
Virtually every Progress Review Group commissioned by the National
Cancer Institute has urged the creation of databases that link
pathologic and clinical data with archived specimens
(www.nci.nih.gov/research_programs/priorities/).  

Pathologists are the members of the medical community who are best
qualified to search through archived reports to select blocks, slides,
and data for basic and applied medical research. Pathologists are in
the best position to create very large databases of pathology reports
merged with data from multiple institutions, thus creating distributed
data sets linked to millions of specimens. The seemingly limitless
opportunities for pathologists have been diminished of late by a
mistaken perception that enacted federal legislation will effectively
block pathologists from using their own reports for medical research.
The 2 federal regulations that apply are the Protection of Human
Subjects, Common Rule1 and the Standards for Privacy of Individually
Identifiable Health Information, Final Rule (usually referred to under
the broader act, the Health Insurance Portability and Accountability
Act [HIPAA]).2 Both regulations permit the use of preexisting records
for human subject research, so long as the research can be conducted
in a way that does not harm patients. Both regulations specify that
deidentified records can be used for research purposes without
obtaining informed consent from patients.  

Patient-record data sets are rendered harmless by successfully
employing 3 computational procedures.3-8  


Deidentification of data elements that specifically characterize the
patient (name, social security number, hospital number, address, age,
date of procedure, etc).2 Computational deidentification may permit
researchers to continuously add outcome data to deidentified records
without violating the record's deidentified status.5,7,8  

Rendering the data set ambiguous, ensuring that patients cannot be
identified by data records containing a unique set of characterizing
information.3  

Free-text data scrubbing, removing words and phrases that can uniquely
identify a patient or other persons, or reveal any information of a
private nature contained within the free-text portion of reports.3  

These 3 computational steps have recently been discussed by myself and
others.3-8 The purpose of this article is to describe a novel approach
to free-text data scrubbing that is safe in the restricted sense that
the scrubbed product contains only those terms that match a standard
nomenclature. The algorithm steps through confidential text. When a
medical term matching a standard nomenclature term is encountered, the
term is replaced by a code and a synonym for the original term. When a
high frequency "stop" word, such as a, an, the, or for, is
encountered, it is left in place. When any other word is encountered,
it is blocked and replaced by asterisks. This produces a scrubbed
text.  



MATERIALS AND METHODS 

The algorithm for the Concept-Match scrubber is simple to list, but
each of the steps is a complex computational task.  
Parse all input into sentences.9,10 

Parse each sentence into words. 

Each stop word (high-frequency words, including prepositions and
common adjectives) is preserved in its original place within each
sentence.  

Intervening words and phrases are mapped to a standard
nomenclature.11-14 This step requires breaking phrases into all
possible ordered concatenations of words. For instance, "Margins free
of tumor" would become "margins free of tumor, margins free of, free
of tumor, margins free, free of, of tumor, margins, free, of, tumor."
Each member of the derivative list is matched against the entire
database of Unified Medical Language System (UMLS) terms to determine
if a code exists for the term. Large terms subsume smaller substring
terms.  

Each coded term is replaced by an alternate term that maps to the same
concept code, if an alternate term exists. For instance, the term
renal cell carcinoma appearing in the text would be replaced by
C0007134 (the UMLS Concept Unique Identifier for renal cell carcinoma)
and by a different term that maps to the same code (such as rcc,
hypernephroma, hypernephroid carcinoma, or Grawitz tumor). This step
produces an output containing a different set of words than the
original text.  

All other words are replaced by blocking symbol (consisting of 3
asterisks).  


Nomenclatures 

The algorithm is nomenclature-independent. Any standard medical
terminology that provides codes for medical concepts and synonyms for
all the terms belonging to a medical concept can be used for the
purposes of data scrubbing. UMLS was used in this study. This
nomenclature was chosen because it is comprehensive, professionally
curated, and freely available from the US government.  

UMLS is a metathesaurus built from approximately 100 different
nomenclatures. Its file of unique medical concepts (MRCON) is the
world's largest and most comprehensive listing of medical concepts,
codes and synonymous terms. The May 2002 release of UMLS contains
2072040 medical terms encompassed by 871585 different medical
concepts. Because UMLS contains some proprietary nomenclatures, its
use for commercial purposes may carry restrictions. Users of UMLS
should carefully review the UMLS license. Those who wish to distribute
UMLS-annotated data may consider using a subset of UMLS that consists
of thesauruses chosen for their nonrestrictive licenses. A Perl script
for extracting a public-domain (unrestricted) subset of the UMLS is
now available.15 More information about UMLS is available at:
www.nlm.nih.gov/research/umls/.  

The Medical Subject Headings (MESH) is another nomenclature suitable
for use with the Concept-Match protocol. MESH was developed by the
National Library of Medicine. Long known as a compendium of general
medical subjects used to store and retrieve medical information, MESH
has grown into a multipurpose hierarchical terminology containing a
rich and comprehensive collection of pathology terms. Information
about MESH is available at: www.nlm.nih.gov/mesh/meshhome.html.  

The Systematized Nomenclature of Medicine (SNOMED) will be technically
suitable for use in the proposed data scrubber implementation once the
US-wide license is finalized. One of the purposes of this article is
to provide an open-source solution to data scrubbing, in which the
results could be duplicated, examined, criticized, and improved by the
scientific community.  

Implementation 

The software implementation is written entirely in the Perl
programming language as an object-oriented (class) module (Parse.PM).
The stop words (high-frequency structural components of sentences,
including adverbs, general nominatives, and common adjectives) are
provided by the National Library of Medicine and are incorporated
directly into the Perl source code (Parse.PM). A surgical pathology
text corpus was obtained from JHARCOLL, a public domain collection of
more than half a million different phrases extracted from actual
pathology reports (surgical pathology, cytopathology, and autopsy).
This "phraseology" probably represents the most comprehensive source
of pathology text in existence. The JHARCOLL phrase collection is 
available at http://www.julesberman.info/jharcoll.tar.gz

Hardware 

All tests were performed using a desktop computer with 480 MB RAM
memory, running at 1.6 GHz.  

Availability of Materials 

All software (including source code), free-text documents, and
nomenclatures used in the preparation of this manuscript are available
at no cost.  

Perl is a platform-independent, open-source, free programming
language. To run a Perl program, users must obtain and install Perl.
Perl comes preinstalled on most Linux and Unix operating systems. A
Windows version of Perl can be downloaded at www.activestate.com. Perl
versions for virtually all other operating systems are available from
the Comprehensive Perl Archive Network, at www.cpan.org.  
A tarballed-gzipped collection of all the Perl class library
containing all of the methods used in this article can be downloaded
at http://www.julesberman.info/parse.tar.gz 

A 1-MB file of JHARCOLL-scrubbed
output is included in the collection of files, to permit easy review
of the scrubber's sample output. Interested persons are free to create
all of the output files described in the "Results."  

UMLS and MESH are available at no cost from the National Library of
Medicine. The May 2002 release of UMLS was used in this article. The
UMLS file used as the source of terms and codes (Concept Unique
Identifiers) is MRCON. UMLS can be obtained at umlsks.nlm.nih.gov.  


RESULTS 

The scrubber behaved perfectly, in the limited sense that its output
consisted only of high-frequency stop words, codes, and standard
synonyms for the medical terms originally contained in the sentence.  
The scrubber was tested using JHARCOLL. The JHARCOLL file is greater
than 14 MB in length and contains 567921 different pathology phrases
extracted from a variety of pathology free-text sources, including
archived surgical pathology and cytology reports. It is a supplemental
file for the Johns Hopkins Autopsy Resource.16 Each phrase consists of
fragments of original pathology text that were flanked by stop words.
To provide an example of how the JHARCOLL file was created, consider
the sentence, "Malignant melanoma is a highly malignant tumor." This
sentence can be broken by the stop words is a into 2 phrases,
"malignant melanoma" and "highly malignant tumor." If these 2 phrases
were not already found in JHARCOLL, they could be added to the list.
The final list consists entirely of different phrases (no 2 the same).
In theory, all of the text in all of the surgical pathology reports
used to produce the JHARCOLL could be fully reassembled by connecting
JHARCOLL phrases and stop words.  

All 567921 pathology phrases in JHARCOLL were scrubbed in 2968 seconds
(nearly 200 phrases per second). The speed is fast enough to make the
program usable even for a very large corpus of text.  

An example demonstrates how the algorithm works. The phrase "area of
resolving ischemic injury" was scrubbed as follows:  
(area = C0805132) of (resolved = C0750521) (ischaemic = C0475224)
(injurie = C0175677)  The first word of the phrase, area, is given a
UMLS code and is translated as itself. This means that the UMLS did
not have alternate synonyms that could be exchanged for the word area.
The next word, of, is a stop word that is left in place, without
translation. The next word, resolving is mapped to the UMLS code
C0750521, and is assigned an alternate word, resolved. Similarly,
ischemic and injury are mapped and translated. In the case of
ischemic, the translation is an alternate (British) orthography. The
word injury is translated to the alternate form, injuries, and
truncates the "s." The output phrase is fully encoded and translated
to alternate words.  

Thirty-six consecutive scrubbed phrases from the 43+-MB scrubbed
JHARCOLL output file are shown in Table 1 . These phrases demonstrate
the scope and accuracy of the scrubber. Multiword terms, such as
splenic marginal zone lymphoma, small cell carcinoma, spindle cell
lipoma, and solitary rectal ulcer, were mapped to their specific
codes. Terms that may have been errors in the original report were
mapped in a reasonable and consistent fashion (eg, small carcinoma 
minute malignant epithelial neoplasm). There was one error that
appeared in several of the output phrases; the word syndrome was
mapped to wolff parkinson white syndrome. Inspection of the output
phrases indicated that whenever the word syndrome appeared outside of
a term that UMLS autocoder recognized as a syndrome, the autocoder
translated the general term, syndrome, to the specific term, wolff
parkinson white syndrome. This example demonstrates how autocoding
errors may appear in output. The author specifically included the
error in the output table to indicate that autocoding is not perfect,
but it is consistent. It is relatively easy to scan the output,
determine the error, and rewrite the autocoding algorithm to exclude
the error for the next deidentification run. It is worth remembering
that autocoding is always more consistent than manual coding and is
capable of producing output that is a vast improvement over manual
coding with respect to accuracy, completeness, and speed.12  
Table 2  contains sentences that were invented for this study to test
the performance of the autocoder under "worst-case" conditions. The
sentences contain incriminating remarks, patient names, names of
fictitious persons, and misspellings. They were all removed by the
scrubber. Scrubbed sentences whose content has little to do with
medicine tend to consist largely of blocked output.

  
Table 1.A Sampling of Scrubbed Output Performed on JHARCOLL (38 Consecutive Entries in the 43+-MB Scrubbed File)
table 1







































Table 2.Concept-Match Scrubbing on "Invented" Free Text
table 1


















COMMENT

Federal laws are written to remove impediments to medical research
when it can be demonstrated that the research poses no risk to
patients. Patient consent is only required for research that poses
some level of risk to patients. Data scrubbing helps render medical
records harmless, so that patient consent becomes unnecessary.
Obtaining patient consent and monitoring patient consent status is an
enormous societal expense. It would seem self-evident that scientists
have an ethical obligation to conduct research in a manner that
produces the greatest societal benefit at the least societal expense.  

Value of Concept-Match Scrubbing 

The Concept-Match approach to data scrubbing provides many advantages. 

It produces an output devoid of phrases that do not map to a reference
terminology.  

It substitutes synonymous medical terms for the original terms
contained in the text, thus making it difficult for someone with
access to diagnostic terms found in the original report to match text
in the output record (another type of attack on confidentiality).  

It maintains the original order of terms in sentences, preserving
standard stop words. This integrity allows readers (and computer
parsers) of scrubbed text to construct grammatical (logical)
relationships between output terms in scrubbed sentences.  

It provides an output stripped of nonmedical and extraneous
information, in keeping with HIPAA recommendations that covered
entities restrict transfers of medical information to the minimum
necessary to accomplish its purpose.2  

It provides the terminology code for each medical term included in the
sentence, making it possible to index terms and to relate terms to
ancestor and descendant terms listed in biomedical ontologies.  

It does its job quickly. High-throughput techniques are required to
handle large volumes of data.  


Limitations of Prior Art 

Essentially all text scrubbers, until now, have made use of lists of
forbidden words or word combinations, directly excluding offending
text. Hospitals with lists of all their patients and staff can exclude
names derived from the list. In a recent contribution, Miller et al17
suggested a scrubbing scheme that requires institutions to prepare a
dictionary of all the names and all the misspellings occurring in
their medical reports. It would seem plausible for a medical center to
simply provide a look-up table containing all the names of persons
registered in the hospital system. Names appearing in free-text could
be blocked if they also appeared in the patient master list.
Unfortunately, there is no way to account for name misspellings (eg,
Keith, Kieth, Debra, Deborah, Bergman, Bergmann) and for alternate
representations of the same name (eg, nicknames). Variant spelling is
particularly irksome in instances in which names contain accent
markings (eg, umlauts, hyphens) or spacings (de Souza or deSouza).
Names that require an extended alphabet are typically typed by
approximation to standard keyboard elements. Mistakes that fail to
remove names often permit the unique reidentification of records. In
addition, many names are common words that have a rightful place in
medical text. It is not unusual to encounter Mr or Ms Page, Book,
Gram, Curtain, Lamps, or Fields. Chinese names are often 2 or 3
letters in length, and can be confused with some of the most commonly
used words and acronyms (eg, An, The, So, Go, He, No). It is not
practical or even advantageous to block all of the short common words
simply because each may rarely occur as a patient's name. It is
sometimes argued that names can be found by their distinctive location
(after an honorific, such as Mr, Sir, Ms, or Dr) and by their
uppercase first letter followed by lowercase trailing letters. Alas,
pathologic free text in the diagnostic field often appears entirely as
uppercase text, and honorifics are often omitted entirely.  
Regarding misspellings, it is simply impossible to find all
misspellings in text. The number of correctly spelled words is
actually quite finite. The number of entries in the Unabridged Oxford
Dictionary, second edition, is 291500 (Facts about the Oxford English
Dictionary www.oed.com/public/inside/funfacts.htm). This is a
manageable number for modern computers. However, the number of
possible misspelled words is essentially infinite. Furthermore,
misspellings are often words in their own right. If the word will is
misspelled as kill, automatic spell checkers will fail to detect a
misspelling, even though the occurrence of the word kill may be an
embarrassing inclusion in a medical report. There are many instances
of medical words that are misspelled as proper words of incorrect
meaning. Examples include arteritis  arthritis, auxilliary  axillary,
brachial  branchial, callus  callous, coitus  colitis, decease 
disease, dyskaryosis  dyskeratosis, facial  fascial, facies  faeces,
firearm  forearm, hallux  helicis, hydatid  hydatidiform, ileitis 
iliitis, ileum  ilium, isotope  isotrope, kerasin  keratin, keratosis 
ketosis, lipoma  lymphoma, malleolus  malleus, mucus  mucous,
myelofibrosis  myofibrosis, palette  palate, palpation  palpitation,
penal  penile, pleural  plural, porphyria  porphyruria, prostrate 
prostate, rachischisis  rachitis, ret  rett, rosacea  rosea, semantic 
somatic, silicon  silicone, taenia  tinea, thecoma  thekeoma,
trichinosis  trichosis, ureteral  urethral, vagitis  vaginitis.  
The Concept-Match scrubber deals with the issues of personal names and
of misspellings with one broad stroke. Most proper names are blocked
by the scrubber unless they fall into the special realm of disease
eponyms, in which case the name is coded as a disease. Furthermore,
when an alternate noneponymous form of the disease is found, it is
substituted for the name. All other proper names are blocked. Words
that are misspelled as nonwords or as words that do not appear in the
text as part of a proper term in a standard nomenclature are blocked.  

Data Scrubbers as Data Censors 

Discussion of medical data scrubbers tends to center on the necessity
of removing patient identifiers. My own experience has been that
scientists who attempt to scrub medical records understand that they
have an obligation to remove all private text contained in reports,
even when the text does not identify a patient. What is "private"
text? Basically, private text is text that is nobody's business and
that does not enhance the intended use of the deidentified patient
record. There is a general understanding that when medical data are
shared for the purposes of conducting research, there is an implied
ethical obligation to share only that portion of the patient record
that is actually needed to conduct the research.  

In many cases, private text is written by hospital personnel with the
expectation that it will only be shared among the persons directly
responsible for the care of the patient. This may include notes
documenting errors, misjudgments, warnings, and complaints. Most
hospital personnel are expected to exclude information of an
incriminating nature from the patient's medical records. Incident
reports and quality assurance reports exist for this purpose. In
reality, medical records often contain information that is best
removed from shared data sets. An exception list of offending terms
that must be removed, when present in text, is a first-pass remedy. A
few of the most egregious terms are blocked by the Concept-Match
scrubber. More importantly, the Concept-Match scrubber blocks all
terms that do not match against a medical nomenclature. Most
nonmedical terms will not match against the nomenclature and will not
appear in scrubbed data.  

Computational Feasibility of the Scrubber 

The Concept-Match scrubber would have been impossible just a few years
ago. Before then, computers simply did not have the memory or speed to
scrub a report using the algorithm described in this article. In 1980,
an algorithm that matched terms from pathology reports against a table
of a 5500 codes required about 20 seconds to map a final diagnosis
phrase to a phrase.18 Until recently, pathologists were advised to use
microglossaries (specialized subsets of a nomenclature) to facilitate
coding.19 It is my perception that most, if not all, autocoders
embedded into commercial laboratory information systems contain
truncated versions of full nomenclatures. The Concept-Match scrubber
is capable of using large reference vocabularies, so that encountered
medical phrases will have the optimal chance of mapping to a code, and
so that consistency in scrubbing can be achieved by different
laboratories using the same scrubbing algorithm and the same
nomenclature. Before now, memory and speed considerations would have
precluded the scrubbing algorithm used in this report.  

Limitations of Concept-Match Scrubbing 

The are 3 important limitations to the value of the protocol, but all
3 have compensations or remedies. One problem is that when words are
removed or changed from an original sentence, the meaning of the
sentence is almost always changed and sometimes lost. Another problem
is that autocoding is not perfect. Since the Concept-Match protocol
depends on autocoding, it is certain that some terms will be miscoded.
A third problem is that if the reference nomenclature contains
identifying or incriminating concepts, these concepts may appear in
the scrubbed output.  

It is best to think of scrubbed output exclusively in terms of its
scientific purpose. Outputting a sentence as a collection of standard
terms separated by stop words is a way of condensing a sentence to its
essential concepts and providing a grammatic structure that preserves
the logical relationship between concepts found in the original
sentence. Although the esthetic and literate value of text is damaged
by the scrubber, the textual concepts and their order are preserved.
The concepts are reduced to codes, which can be organized and
subsequently searched in databases. The codes are far superior to
terms for the purpose of data organization because a single code will
identify a concept by any of its word forms. The scrubber replaces the
original terms with other terms that happen to match to the same code,
thus providing a layer of ambiguity to the output text. The alternate
terms are really nothing more than placeholders allowing humans to
read the scrubbed output file (because humans cannot understand
7-digit codes). The alternate terms will never be used for purposes of
data organization or data retrieval (that is why we have codes).  
Does scrubbing obscure the pathologist's original meaning of the text?
The answer is yes, but probably no more than we do whenever we
interpret the text. A recent study showed that the pathologist's
intended meaning is commonly misinterpreted by clinicians. In that
study, the authors observed that surgeons misunderstood pathology
reports 30% of the time. Furthermore, attempts to streamline the
reports only made matters worse.20  

A computer translator that maps terms to a standard nomenclature will
contain errors. A major reason is the occurrence of the same term with
more than 1 concept; for example, the term COLD is linked to the
concept code meaning coryza, and also to the concept code meaning
chronic obstructive lung disease. The scrubber requires terms
extracted from the original free text to be automatically coded, a
process that will introduce some errors. As has been reported, quality
assurance measures will enhance the performance of autocoders.12  
The third limitation of the Concept-Match protocol is that terms
present in reference terminologies are not always safe. Gratefully,
most terminologies lack common expletives. However, comprehensive
terminologies, such as the UMLS, will contain terms such as sexual
abuse and homicide. These terms, when present in the original text,
may or may not be desirable in the scrubbed output. It is unlikely
that a pathology department would want to preserve such terms in a
cancer tissue database constructed from archived surgical pathology
reports. However, a forensic laboratory constructing an autopsy index
may find these terms critical to their mission. A report containing
the term smith may refer to a patient by that name, in which case it
would be unwise to include the UMLS term smith in the scrubbed output.
Alternatively, smith may appear in a comment on the patient's
occupational history (alluding to the intended meaning of smith in
UMLS). The list of problem terms would include all eponymous diseases.  
There are several possible approaches to this problem. The easiest
approach is to ignore the problem. The HIPAA Final Rule   
acknowledges that uses or disclosures that are incidental to an
otherwise permitted use or disclosure may occur. Such incidental uses
or disclosures are not considered a violation of the Rule provided
that the covered entity has met the reasonable safeguards and minimum
necessary requirements.2  

The rare occurrence of a UMLS code that happens to match a name
(Cushing, Osler, Smith) somewhere in the free-text portion of a report
may fall under this exemption. In addition, the Final Rule permits  
the creation and dissemination of a limited data set (that does not
include directly identifiable information) for research, public
health, and health care operations.. the final Rule conditions
disclosure of the limited data set on a covered entity and the
recipient entering into a data-use agreement, in which the recipient
would agree to limit the use of the data set for the purposes for
which it was given, and to ensure the security of the data, as well as
not to identify the information or use it to contact any individual.  
The limited data set described in HIPAA has a relaxed privacy standard
because its distribution is restricted to parties entering a data-use
agreement, and this would need to be approved by an institutional
review board. Unmodified Concept-Match scrubbing may be an ideal
solution in a setting in which the data set is distributed to a small
set of collaborators.  

A second solution to the problem of unsafe terms occurring in
reference terminologies involves either (1) pruning unsafe terms from
a reference terminology or (2) creating a home-brew reference
terminology. Both solutions are actually feasible. Free-text terms are
essentially infinite in number. No one is ever certain what words and
phrases may exist in any report. But terminologies are finite. It is
actually feasible to inspect every term in a nomenclature, looking for
eponyms or other objectionable concepts. Although UMLS is large, the
unencumbered public subset of UMLS terms consists of about 189000
terms.15 MESH has approximately 19000 concepts, excluding its
supplemental concept listings. Although tedious, an institution could
review each term, extracting deprecated terms.  

A third approach may involve including a list of words to be excluded
from nomenclature matches. The list would be used in the software
implementation of the Concept-Match protocol. This approach would be a
hybrid of the Concept-Match algorithm and standard scrubbing
algorithms that exclude specific words. The current Perl
implementation of the Concept-Match protocol includes a short list of
words that are always excluded from UMLS matches.  

In summary, the Concept-Match scrubbing protocol is a novel approach
to medical free-text scrubbing. The scrubber was tested using a
publicly available corpus of pathology terms constituting the text of
more than a quarter million surgical pathology reports. Output of the
Concept-Match scrubber on this corpus is publicly available in files
included with the Perl implementation. Perhaps the most valuable
contribution in this article is the open source distribution of all of
the software and nomenclatures required to recreate the Concept-Match
scrubber. Pathologists who wish to participate in research using
archived pathology records will need to select, implement, and
evaluate confidentiality protocols.  


References 

1.  Protection of human subjects: common rule, 56 Federal Register
28003-28032 (1991) (codified at 45 CFR 46).   

2.  Standards for privacy of individually identifiable health
information: final rule, 67 Federal Register 53181-53273 (2002)
(codified at 45 CFR 160 and 164).   

3. Sweeney L. Computational Disclosure Control: A Primer on Data
Privacy Protection (PS) [PhD thesis]. Cambridge, Mass: Massachusetts
Institute of Technology; 2001.   

4. Moore GW, Berman JJ. Anatomic pathology data mining. In: Cios KJ,
ed. Medical Data Mining and Knowledge Discovery. New York, NY:
Springer-Verlag; 2000.   

5. Bouzelat H, Quantin C, Dusserre L. Extraction and anonymity
protocol of medical file. Proc AMIA Annu Fall Symp. 1996:323-327.   

6. Berman JJ, Moore GW, Hutchins GM. Maintaining patient
confidentiality in the public domain Internet Autopsy Database (IAD).
Proc AMIA Annu Fall Symp. 1996:328-332.   

7. Quantin C, Bouzelat H, Allaert FA, Benhamiche AM, Faivre J,
Dusserre L. Automatic record hash coding and linkage for
epidemiological follow-up data confidentiality. Methods Inf Med
1998;37:271-277.

8. Berman JJ. Confidentiality for medical data miners. Artificial
Intelligence Med 2002;26:25-36.

9. Berman JJ. Improved medical sentence parser [abstract]. Arch Pathol
Lab Med. In press.   

10. Berman JJ. Medical sentence parsing in Perl [abstract]. Arch
Pathol Lab Med 2002;126:781   

11. Moore GW, Berman JJ. Automatic SNOMED coding. Proc Annu Symp
Comput Appl Med Care. 1994:225-229.   

12. Moore GW, Berman JJ. Performance analysis of manual and automated
Systemized Nomenclature of Medicine (SNOMED) coding. Am J Clin Pathol
1994;101:253-256.

13. Berman JJ, Moore GW. SNOMED-encoded surgical pathology databases:
a tool for epidemiologic investigation. Mod Pathol 1996;9:944-950.

14. Berman JJ. Threshold protocol for the exchange of confidential
medical data. BMC Med Res Methodol 2002;2:12   

15. A Tool for Sharing Annotated Research Data: the "Category 0" UMLS
(Unified Medical Language System) Vocabularies. 
BMC Medical Informatics and Decision Making 3:6, 2003. 
Full text from BiomedCentral

16. Berman JJ, Moore GW, Hutchins GM. Internet autopsy database. Hum
Pathol 1997;28:393-394.

17. Miller RE, Boitnott JK, Moore GW. Web-based free-text query system
for surgical pathology reports with automatic case de-identification
[abstract]. Arch Pathol Lab Med 2001;125:1011   

18. Foulis PR, Norbut AM, Mendelow H, Kessler GF. Pathology
accessioning and retrieval system with encoding by computer (PARSEC).
Am J Clin Pathol 1980;73:748-753. 

19. Cote RA, Robboy S. Progress in medical information management:
Systematized Nomenclature of Medicine. JAMA 1980;243:756-762. [PubMed
Citation]   

20. Powsner SM, Costa J., Homer RJ. Clinicians are from Mars and
pathologists are from Venus. Arch Pathol Lab Med 2000;124:1040-1046.   



Last modified: January 6, 2008



biomedical informatics cover Perl Programming for Medicine and Biology Cover Ruby for Medicine and Biology Cover Ruby: The Programming Language