Books by Jules J. Berman, covers

Pathology Abbreviated: A Long Review of Short Terms
Archives of Pathology and Laboratory Medicine
128:347-352, 2004.
Jules J. Berman, PhD, MD

Context. Abbreviations are used frequently in pathology reports and
medical records. Efforts to identify and organize free-text concepts
must correctly interpret medical abbreviations. During the past
decade, the author has collected more than 12000 medical
abbreviations, concentrating on terms used or interpreted by
pathologists.  

Objective. The purpose of the study is to provide readers with a
listing of abbreviations. The listing of abbreviations is reviewed for
the purpose of determining the variety of ways that long forms are
shortened.  

Design. Abbreviations fell into different classes. These classes
seemed amenable to distinct algorithmic approaches to their correct
expansions. A discussion of these abbreviation classes was included to
assist informaticians who are searching for ways to write software
that expands abbreviations found in medical text. Classes were
separated by the algorithmic approaches that could be used to map
abbreviations to their correct expansions. A Perl implementation was
developed to automatically match expansions with Unified Medical
Language System concepts.  

Measurements. The abbreviation list contained 12097 terms; 5772
abbreviations had unique expansions. There were 6325 polysemous
abbreviation/expansion pairs. The expansions of 8599 abbreviations
mapped to Unified Medical Language System concepts. Three hundred
twenty-four abbreviations could be confused with unabbreviated words.
Two hundred thirteen abbreviations had different expansions depending
on whether the American or the British spellings were used. Nine
hundred seventy abbreviations ended in the letter s.  

Results. There were 6 nonexclusive groups of abbreviations classed by
expansion algorithm, as follows: (1) ephemeral; (2) hyponymous; (3)
monosemous; (4) polysemous; (5) masqueraders of common words; and (6)
fatal (abbreviations whose incorrect expansions could easily result in
clinical errors).  

Conclusion. Collecting and classifying abbreviations creates a logical
approach to the development of class-specific algorithms designed to
expand abbreviations. A large listing of medical abbreviations is
placed into the public domain. The most current version is available
at http://www.julesberman.info/abbtwo.htm  

Expanding and removing the ambiguity from abbreviations is one of the
more challenging issues in natural language parsing. Collecting and
classifying abbreviations is a necessary exercise. It is the first
step toward developing algorithmic strategies to assign the correct
expansions for medical abbreviations parsed from text.  
Most medical abbreviations were no doubt invented as time-savers for
health professionals. These same abbreviations can waste the time of
those persons tasked with organizing, indexing, searching, or
interpreting medical text. Those reading a medical record are expected
to know that ga occurring within an obstetric history usually means
gestational age, while ga occurring within a dermatologic history
usually means granuloma annulare. Automatic indexers and machine
translators of medical text require large and accurate lists of
medical abbreviations and accurate algorithms to find abbreviations
within text and to map the abbreviations to their correct expansions.
Also required is an understanding of the types of problems encountered
when abbreviations are expanded.  

A PubMed search on biomedical AND abbreviations reveals that most
literature contributions to the field consist of outraged editorials
and letters decrying the inappropriate use of abbreviations.
Strangely, the problem of correctly expanding free-text abbreviations
into standard terminology has received very little serious attention
in the medical informatics literature.1–3 In an early study in which
the words from medical reports were enumerated, there was no mention
whatsoever of abbreviations.4 Recently, Liu et al3 studied a set of
abbreviations that could be algorithmically extracted from the Unified
Medical Language System (UMLS), extracting 163666
abbreviation/expansion pairs.3 These authors noted that many
abbreviations have multiple different expansions and that the
expansions cannot always be disambiguated based on separating
abbreviations by their knowledge domain. Liu et al3 indicated that
methods are needed to remove the ambiguity from ambiguous
abbreviations. The current study involves a hand-annotated listing of
about 12000 abbreviations, many of which were encountered in pathology
reports and pathology literature.  

Medical abbreviations come in 2 forms: acronyms and shortened words.
Acronyms are character strings usually composed from the first letters
of a text phrase. Many acronyms are noun phrases. A straightforward
example is CABG, which stands for coronary artery bypass graft.  
There are relatively few abbreviations for adjectives. Examples
include the following: AP = anterior-posterior; L = left. There are
almost no abbreviations for verbs.  

A shortened form is a subset of letters taken from a word; the letters
almost always maintain the same relative order as their original
appearance in the word, and usually they are taken from the beginning
of the word. An example is ceph, which stands for cephalosporin.  
In this article, I provide a useful resource to pathology
informaticians, a listing of more than 12000 abbreviations organized
by polysemy. Upon review of the abbreviations, the author noted that
the abbreviations could be classified by the algorithmic approaches
needed for their correct interpretation. Such a classification may be
of value to informaticians who are designing software to expand
medical abbreviations. The review of abbreviations and the discussion
of the classes of abbreviations are provided with the expectation that
they may have value for programmers in the field. The implementations
of classed algorithms will be the subject of future efforts.  

MATERIALS AND METHODS  

All abbreviations (circa 12000) collected for this article, along with
the Perl script that itemizes the abbreviations, is available at
http://www.julesberman.info/abbtwo.htm

This resource is placed in the public domain by the author with no 
implied or expressed warranties.
  
The UMLS is available (at no cost) from the National Library of
Medicine's web site (http://www.nlm.nih.gov/research/umls/). The 2001 UMLS
was used, and a valid user's license was obtained. The MRCON file,
containing about 1.5 million concepts, was used to match against the
expansions of abbreviations listed in the author's abbreviation file.  

RESULTS  

Counting abbreviations can be very revealing. Table 1  summarizes the
feature data of the supplemental list of annotated abbreviations. The
following observations from the list illustrate the difficulties that
would be encountered by any direct algorithmic approach to predicting
abbreviations from expansions, or vice versa.  

Table 1. Summary of the List of Abbreviations (http://www.julesberman.info/abbtwo.htm)
table 1
Abbreviations That Are Neither Acronyms or Shortened Forms of Expansions 

Sometimes short forms contain letters not found in the long form of
the abbreviated word. For example, the short form of the word
diagnosis is dx, although no x is contained therein. The same
applies to the x in tx, the abbreviation for therapy, but not
the x in the TX that stands for Texas. For that matter, the
short form of times is an x, relating to the notation for the
multiplication operator. Roman numerals I, V, X, and L and M are
abbreviations for words assigned to numbers, but they are not
characters included in the expanded words. EKG is the abbreviation
for electrocardiogram, a word bereft of the letter K. The K comes
from the German orthography. There is no letter q in subcutaneous, but
the abbreviation for the word is sometimes subq and never subc.  
Medical records borrow heavily from the Periodic Table. Is KCl an
abbreviation for potassium chloride or is it the full term for the
chemical symbol? What about Calc for calcium. Calc is indeed a short
form, but any chemist would tell you that it is the wrong form (forme
fruste). What about EtOH? Nurses and doctors perform alchemy (not
chemistry) to convert ethanol to etoh. Certainly EtOH is a type of
abbreviation, but the linguistic method employed to create the
abbreviation is obscure.  

Abbreviations That Are Sometimes Both Acronyms and Shortened Forms 

The letter l (for the word left) is both an acronym and a shortened
form. This is true for almost all single-letter abbreviations. Another
term that is difficult to assign as either acronym or short form is
DNA (deoxyribonucleic acid). DNA may well be an acronym, because the D
is the first letter of deoxyribonucleic and the A is the first letter
of acid, but the N comes up in the middle of a deoxyribonucleic. The
letter N is the first letter of a word that could stand as an
individual word (nucleic), even though it does not in this case. DNA
can also be thought of as a simple shortened form of a long word, the
same way that cmpd is a shortened form of the word compound. In
both, the letters are pulled from their order of appearance in the
full word but are chosen from scattered sites within the word.  
An example of a mixed acronym/abbreviation is dsv, representing the
dermatome of the fifth sacral nerve. Here a preposition, an article,
and a noun (of, the, nerve) have been dropped for the abbreviation,
the order or the acronym components has been changed (dermatome sacral
fifth), an ordinal has been changed to a cardinal (fifth changed to
five), and the cardinal has been shortened to its roman numeral
equivalent (v).  

Prepositions and Articles Retained in an Acronym 

When forming an acronym from a phrase, it is difficult to guess when
to use or abandon prepositions. Many acronyms exclude prepositions and
articles. CAP is the acronym for College of American Pathologists,
snubbing the of. Other abbreviations are not so snobbish (DOB = date
OF birth). The word NIH (for National Institutes of Health) denies any
generosity to its sole preposition. Sometimes both forms are accepted
abbreviations (eg, edd = estimated date of delivery and edod =
estimated date of delivery).  

Single Expansions With Multiple Abbreviations 

Just as abbreviations can map to many different expansions, the
reverse can occur. For instance, high-grade squamous intraepithelial
lesion can be abbreviated as HGSIL or HSIL. Xanthogranulomatous
pyelonephritis can be abbreviated as xgp or xgpn.
Angioimmunoblastic lymphadenopathy can be abbreviated as abl, ail,
or aiml.  

Nonsense Abbreviations 

ANNL and ANLL represent the phrase acute nonlymphoblastic leukemia. It
is impossible to imagine how the term ANNL ever became the
abbreviation for a phrase that contains a solitary letter N, but this
abbreviation appears occasionally in the pathology literature.  
The term PT-LPD represents the phrase posttransplantation
lymphoproliferative disorders. The only location for a hyphen in the
expansion is between the letters p and t. Why does the acronym move
the hyphen? 
 
The term GNU (Gnu is not UNIX) is an example of a self-referring
acronym. Fully expanded, this acronym is of infinite length. Although
the N and the U expand to words (not Unix), the letter G is
forever inscrutable.  
The expansion for OK is simply okay, the phonetic spelling of the
sound made by the pronunciation of the abbreviation. Neither the
abbreviation nor the expansion has any obvious entomologic derivation.  

Common Usage That Confounds Meaning 

The term TREC is the acronym for text retrieval conference. However,
it seems that whenever TREC appears in a sentence, it occurs in the
phrase TREC conference (trec.nist.gov/). Clearly, the word
conference is redundant in this example. Apparently people would
rather attend a TREC Conference than either a Tre Conference or a
TREC.
  
Sometimes straightforward abbreviations adopt phonetic forms with
features of shibboleths. For instance, the term peripheral
neuroectodermal tumors is abbreviated as PNET, but PNET sounds like
peanut, and peanut is now the abbreviated form used in conversation
and literature for these tumors.5 Examples of other phonetic
expansions are cabbage for the phrase coronary artery bypass graft,
and the term Tobasco, used for the phrase total abdominal hysterectomy
and bilateral oophorectomy.
  
At times a word's abbreviation looks enough like an expansion to goad
a spell-checker into action. The word cameleon is an abbreviation for
the phrase cytosine arabinoside, high-dose methotrexate, leucovorin,
oncovin. Spell-checkers should not replace the abbreviation with a
chameleon.
  
Though not a medical abbreviation, the following practice exemplifies
the horrors of recursive abbreviations. The term SMETE is the
abbreviation for the phrase science, math, engineering, and technology
education. The term NSDL is the abbreviation for the term National
SMETE Digital Library community (found at www.osti.gov/speeches/asist.html). 
Assuming that the term requires an abbreviation, wouldn't the form 
of the abbreviation holding a clue to the identity of the expansion be 
NSMETEDL?  

Pejorative Abbreviations 

Pejorative or disrespectful terms should never appear in the medical
record. When they occur, it would be better if they were not expanded:
flk = funny looking kid and gomer = get out of my emergency
room. Pejorative abbreviations have been omitted from the author's
file.  

Locale-Dependent Abbreviations 

Americans sometimes forget that most of the English-speaking countries
use British English. Americans contribute a minor share of English
free-text. So TOF makes no sense as an abbreviation of
tracheo-esophageal fistula here in Bethesda, Md, but this abbreviation
makes perfect sense in London, where patients may have
trancheo-oesophageal fistulas. The term GERD (representing the phrase
gastroesophageal reflux disease) makes perfect sense to Americans, but
it must be confusing to Australians.  

COMMENT  

Classifying Abbreviations by Their Expansion Algorithms 

Different types of abbreviations create different types of
interpretive problems. When a document is parsed into words, it is
relatively easy to determine whether a given word string matches a
term in a long list of abbreviation/expansion pairs. However, an
algorithm is needed to determine if the word string is correctly
mapped to its intended expansion. The algorithm used to perform this
task may depend on the context of the parsed document word and on the
class or classes of abbreviations matching the parsed word. The
following classification of abbreviations is chosen to separate
abbreviations by the algorithmic tasks required for their accurate
selection and expansion.  

Ephemeral Abbreviations 

The most common form of abbreviation is the ephemeral abbreviation.
The ephemeral abbreviation is invented on the fly by a writer and is
intended to exist within a single document. These are the
abbreviations that are usually found early in an article as an
uppercase string within a parenthetical expression following the first
appearance of the expansion. Elsewhere in the article they appear as
stand-alone uppercase character strings. Ephemeral abbreviations are
typically highly coordinated noun terms that appear sufficiently often
within a particular document to justify their creation. For example, a
pathology article may contain many references to an unidentified
eosinophilic nodule of basement membrane-like material (abbreviated as
UENBMM). The author probably has no intention of incorporating the
abbreviation into the permanent medical literature. It is easy to
build a parser that automatically extracts such terms from text
because they are almost always introduced in a structured way (ie,
expansion immediately followed by parenthetical abbreviation).  
The ephemeral abbreviation exists only within a specific document. An
algorithm might look for an uppercase string (often enclosed by
parentheses) preceded by or following a text phrase, the first letters
of which equal or approximate the uppercase string. This text phrase
would be the expansion of the ephemeral abbreviation. Whenever the
same uppercase word appears later in the same document, it could be
tagged with a metadata tag, indicating that the uppercase string is an
abbreviation and that its expansion is the previously determined text
phrase. The abbreviation and its expansion would disappear at the end
of the document. An algorithm for expanding ephemeral abbreviations
has been discussed by Liu et al.3  

Hyponymous Abbreviations 

The entity A is a hyponym or subordinate of B if A is a specific kind
of B. So poodle is a hyponym of dog. The term HSIL (representing the
phrase high-grade squamous intraepithelial lesion) is a hyponym of
SIL. The phrase AIDH (representing the phrase atypical intraductal
hyperplasia) is a hyponym for IDH (intraductal hyperplasia). In many
instances, there are abbreviations for the hyponym, but no
abbreviation for the more general term. For example, the term DVT
expands to deep vein thrombosis, but there is no medical
abbreviation for the phrase venous thrombosis of undetermined depth
(ie, no VT). PE stands for the term pulmonary embolus, but E is
not in use as an abbreviation for the word embolus.
  
The most common hyponym examples relate to singular/plural forms.
After all, every singular form is a hyponym of its plural. So, the
term rbc represents the phrase red blood cell. Some people use rbc
to refer to either the singular or the plural (because C expands to
cell or to cells). But some people prefer to turn the abbreviation
into a familiar plural form, rbcs. In many cases, when a plural is
added to an abbreviation, people will demarcate the plural form from
the singleton by an awkward use of uppercase and lowercase characters.
So, erythrocytes may be abbreviated as RBCs. It is also common to
engage the possessive form when converting an abbreviation into its
plural form (eg, RBC's). This is grammatically the wrong thing to do
when the single form does not end in an s. Occasionally, the plural
form of an abbreviation is used, even when it defies rational
analysis. So, a man with withdrawal symptoms may have the DTs, even
though he is only suffering from one case of delirium tremens.  
What do you do when the single form properly ends with a word that
begins with s? The abbreviation for the phrase Hospital Information
System is HIS. If you wish to refer to multiple systems, is the
plural HISs, HIS, or HISes? One may surmise that all 3 forms occur
in nature.  

Unfortunately, unless the plural abbreviation comes in the form of an
uppercase acronym followed by a lowercase s, confusion may arise
with acronyms whose last expanded word is syndrome. So, how would
you otherwise distinguish Lesch-Nyhan Syndrome (lns) from the plural
of the abbreviation of the phrase lymph nodes (lns)? In the
supplemental abbreviation file, there were 970 abbreviations ending
with s and 245 expansions that included the word syndrome.  
Single hyponyms of plural forms that do not end with an s are really
not a problem. Nobody will care whether a parser expands rbc to red
blood cell when the intended expansion was red blood cells. There
may be some minor annoyance when tia is expanded to transient
ischemic attack when it should have been expanded to transient
ischemic attacks. A smart parser can take its contextual cues from
the word preceding the abbreviation. Three tia in 24 hours should be
mapped to the plural form, while a tia should be mapped to the
singular.  

How do you deal with parsed abbreviations that end with the letter
s? Abbreviation hyponyms that have a plural form ending with s can
all be put into a single list. If the parser determines that the
abbreviation was optimally formatted, with uppercase letters for the
abbreviation and a lowercase s at the end, then the parser should
only match against the singular hyponym (ie, match TIAs against TIA).
In other cases, the parser algorithm may choose to determine from the
context of the sentence whether the abbreviation is a plural form. If
so, it can look for a match among the list of abbreviations whose
plural form ends with an s. If there is a match, that may be
sufficient. If there is not a match, the s can be truncated and
matches should be sought in the large list of abbreviations not ending
with a plural form designated by s.  

Monosemous Abbreviations 

The monosemous abbreviation has a unique expansion. Therefore, it is
relatively simple to write algorithms that correctly match expansions
against abbreviations parsed from medical text. Fortunately, about
half of abbreviations (5772 in the supplemental abbreviation file)
seem to be monosemous. In general, the longer the abbreviation, the
more likely it will be to have a unique expansion.  

Polysemous Abbreviations 

Polysemy is the condition whereby a single term has multiple meanings.
The most polysemous abbreviation is PA, which has 41 different
expansions. There are many different algorithmic approaches to the
problem of assigning a correct expansion to a polysemic abbreviation.  
An algorithm can simply use a frequency of occurrence list for the
different possible expansions, choosing the most often–encountered
expansion as the correct expansion for any abbreviation. The term
PA appearing in a radiology report is much more likely to expand to
posterior-anterior than to propionic acid. However, a good
algorithm may need to reckon with pulmonary artery as a reasonable
alternative.  

Another algorithm may use the nonabbreviated words found in the
paragraph or sentence containing the abbreviation as clues to the
abbreviation's intended expansion. UMLS contains long lists of
concepts that relate to other concepts. Choosing an expansion (from a
list of expansions matching an abbreviation) on a relatedness index
is certainly a reasonable approach to dealing with polysemous
abbreviations.  

Abbreviations Masquerading as Words 

Particularly irksome are abbreviations that map to often-used general
words, such as the phrases axillary node dissection (AND), acute
lymphocytic leukemia (ALL), optic neuritis (ON), and acanthosis
nigricans (AN). The most difficult abbreviations map to commonly used
medical terms, such as Acquired Immune Deficiency Syndrome (AIDS),
Bornholm Eye Disease (BED), and Expired Air Resuscitation (EAR).  
Many acronyms will almost always appear as uppercase strings or as
strings internally punctuated by periods. For instance, the phrase
United States is often abbreviated as US or as U.S., thus
distinguishing it from us. But health professionals will not always
play by the rules. A pin sometimes lurks in a diaper and sometimes
lurks in a prostate (prostatic intraepithelial neoplasia). Nadkarni
and coworkers3 noted that some abbreviations are words. They also
found that these could not reliably be distinguished based on the
uppercase format of the abbreviation, it being too inconsistent to be
relied on in medical notes.  

In the supplemental abbreviation file, there are at least 321
abbreviations that are also common words. A sampling of these
masqueraders is listed in Table 2 . The full list of masqueraders
would be needed by any algorithm that parses medical text. Certainly,
if a word appears in all uppercase letters in a sentence (that is
otherwise lowercase), it seems reasonable to assume that the word is
an abbreviation. Abbreviations can be matched directly against the
list of expansions for the abbreviations that masquerade as words.  
If a word parsed from medical text matches an abbreviation from the
list of abbreviations that masquerade as words, and if the word has no
distinguishing format, then an algorithm may be designed to consider
the frequency of occurrence of the expansion compared to the frequency
of occurrence of the nonabbreviated word. For instance, and will
appear more often than axillary node dissection, although ash, the
abbreviation for atrial septal hypertrophy, may occur more often
than ash, the crumbly black material in the tray. As in the
algorithms created for the polysemous abbreviations, it is feasible to
look for relatedness between the considered expansion and the words
and concepts found in the vicinity of the parsed word.  

Table 2. Abbreviations That Masquerade as Words: Sampling From the A's (http://www.julesberman.info/abbtwo.htm)
table 2
Fatal Abbreviations: Innocent Victims of Abbreviation Drift 

It is tempting to assume that abbreviations can be expanded whenever
their context is known. For instance, the term cea would expand to the
phrase carcinoembryonic antigen in a blood test for a patient who is
status postcolectomy for colon cancer. The term CEA would expand to
the phrase carotid endarterectomy in a patient whose carotids were
being duplex-scanned for occlusive vascular disease.  
Table 3  contains many instances of abbreviations whose different
expansions could not be easily distinguished based on context.
Excluded from this list are indistinguishable expansions whose
meanings are virtually equivalent (eg, ich = intracranial hemorrhage
or intracerebral hemorrhage). Fatal abbreviations probably devolved
through imprecise use (a phenomenon I call abbreviation drift).
Unfortunately, these expansions are virtually impossible to
disambiguate, even by human experts in the knowledge domain. In the
case of the fatal abbreviation, it seems appropriate for algorithms
not to try to pick the correct expansion but to display an output that
lists all the different expansions for the term. For example, The
patient has a history of aha; possible expansions are acquired
hemolytic anemia OR autoimmune hemolytic anemia. Of all the
abbreviations collected in the master list, it is the author's opinion
that the list of fatal abbreviations is the most important to map.
Once a programmer has a list of these abbreviations and their
alternate expansions, it is exceedingly easy to write a program that
will parse the abbreviations from medical text and append a listing of
all the possible abbreviations that might be correctly or mistakenly
applied.  

Table 3. Fatal Abbreviations: Victims of Abbreviation Drift (http://www.julesberman.info/abbtwo.htm)
table 3
In summary, abbreviations can be classified according to the different
algorithmic protocols needed to assign an expansion. This study lays
the logical foundation for future work that collects the annotated
abbreviations into object classes whose methods are the software
implementations of the algorithmic approaches described herein. Any
future efforts will need to take special account of the so-called
fatal abbreviations (Table 3 ). When expanded incorrectly, these
expansions may lead to medical errors.  

References  

1. Berman JJ. Survey of medical abbreviations in pathology text. Arch
Pathol Lab Med 2002;126:781-802. (abstract section).   
2. Nadkarni P, Chen R, Brandt C. UMLS concept indexing for production
databases. JAMIA 2001;8:80-91.
3. Liu H, Lussier YA, Friedman C. A Study of abbreviations in the
UMLS. Proc AMIA Symp. 2001;393-397.   
4. Wong RL, Reno JD, Hain TC, Platt RC, Gaynon PS, Joseph DM. Profile
of a dictionary compiled from scanning over a million words of
surgical pathology narrative text. Comp Biomed Res 1980;13:382–388.
5. Kretschmar CS. Ewing's sarcoma and the "peanut" tumors. New Engl J
Med 1994;331:325–327.


Last modified: April 7, 2014
Books by Jules J. Berman, covers