Berman JJ, Moore GW, Donnelly WH, Massey JK, Craig B.
A SNOMED analysis of three years' accessioned cases of
(40,124) of a surgical pathology department: implications
for pathology-based demographic studies. Journal of the
American Medical Informatics Association (JAMIA),
Symposium Supplement 1994 and the Proceedings of the
18th Annual Symposium for Computer Appllications in
Medicine (SCAMC), pp 188-192, 1994
This study was done in 1994. SNOMED has gone through
several versions since that time, and the findings in this paper
would be somewhat different if the study were repeated today
with the latest SNOMED version.
Jules J. Berman, January 4, 2007
A SNOMED ANALYSIS OF THREE YEARS' ACCESSIONED CASES (40,124)
OF A GENERAL SURGICAL PATHOLOGY DEPARTMENT:
IMPLICATIONS FOR PATHOLOGY-BASED DEMOGRAPHIC STUDIES
Jules J. Berman, G. William Moore, William H. Donnelly, James K. Massey, Brian Craig
Running Title: ANALYSIS OF 3 YEARS' SURGICAL PATHOLOGY SPECIMENS
ABSTRACT
Pathology departments devote considerable energy toward
indexing lesions. To date, there have been no detailed tabulations
of the results of these efforts. We have thoroughly analyzed three
years' surgical pathology reports (40,124) generated for 29,127
different patients from the University of Florida at Gainesville
between Jan 1, 1990 and December 31, 1992. 64,921 SNOMED code entries
(averaging 1.6 codes per specimen and 1.4 specimens per patient),
that were accounted for by 1,998 distinct SNOMED morphologies. A
mere 21 entities accounted for 50% of the code occurrences. 265
entities accounted for 90% of the code occurrences, indicating that
the diagnostic efforts of pathology departments are primarily devoted
to a small fraction of the many thousands of described pathologic
entities. In this study, SNOMED information was stratified on the
basis of patient age and sex using data fields present in the reports
and was age adjusted to provide incidence profiles in a form comparable
to that used by the National Cancer Institute in its SEER program.
These data represent the first public analysis of SNOMED data from a
large pathology service and demonstrates how data can be compiled in
a usable form that preserves patient privacy.
Keywords: SNOMED, ICD-9, nomenclature, epidemiology, indexing, coding
INTRODUCTION
Before the advent of computerized laboratory information
systems, pathologists were severely limited in the way that they
could obtain information related to the scope of their activities.
A pathologist could easily determine the specimens received for a
specific patient, but there was no practical way of summarizing
collected pathologic data. In the past, when pathologists were
asked to comment on the incidence of a lesion, they might quote
a published statistic (from a report reflecting the experience of
another hospital in another geographic and social environment) or
they might offer a recollection from their own experience
(e.g., "I've seen half a dozen of these things and they
seem to occur in older people").
Despite the fact that modern pathology information systems all index
their reports under retrievable and universally recognized diagnostic
codes (e.g., the International Classification of Disease (ICD), or
the Systematized nomenclature of Medicine (SNOMED)), few services
take the step of analyzing their own surgical pathology data. The
reason for this is simple. Modern laboratory information systems
are designed to answer queries related to a designated report,
patient or diagnosis. Most information systems can give you all
the reports on a given patient, or all the reports with a specific
SNOMED number. But they can not support a query that calls for all
the data on all the patients and all the data on all the diagnoses.
Such an undertaking would consume all of the computational resources
of the institution, would require additional programming effort and
would provide a service of no direct clinical necessity to any
specific patient. Furthermore, coding efforts may vary greatly
in their accuracy. The reliability of indexed data related to
diagnosis has received very little discussion in the medical
literature. Hall and Lemoine, in one of the few available studies,
found errors in more than 10% of indexed codes. Currently, many
pathology departments have employed automatic coding software and
thus relieved themselves from the time-consuming burden of manual
coding. In a recent study, we have shown that accurate automatic
coding can only be achieved by constantly monitoring the quality
of the coded output and adding appropriate changes in the code
look-up dictionary and in the manner that reports are written.
Perhaps the most telling indicator of the hazards of analyzing
SNOMED databases resides in the virtual absence of published
reports that include organized global data summaries (i.e. data
that encompasses all of the diagnostic entities encountered during
a studied time period, as opposed to reports dealing with the
retrieval of reports related to one or several chosen diagnostic
entities). The lack of such studies underscores the failure of
pathology departments to satisfy the intended goals of indexed
coding. According to Cote and Robboy, our systems of disease
nomenclature and classification are directly descended from
earlier classifications (beginning with the London Bills of
Mortality in the early 1700's) addressing the need of any society
to understand the prevalence of diseases in its population. Cote
and Robboy, both principals in the development of SNOMED, disease
serves the needs of the entire health care and maintenance system
providing a number of advantages including epidemiologic studies
and medical audit.
We have analyzed three years' data obtained from a general hospital
in Florida. This study is of value to pathologists, epidemiologists
and administrators for several reasons: 1) it provides for the first
time a quantitative description of what pathologists see and do; 2)
it offers a sample database to illustrate the values and limitations
of a surgical pathology databases and to serve as a baseline for
comparison with databases from other pathology services; and 3) it
specifies the the steps involved in preparing a database summary
from raw data retrieved from the electronic files of a laboratory
information system.
MATERIALS AND METHODS
We examined data from 40,124 cases accessioned at the Shands Hospital,
Gainesville, FL, between January 1, 1990, and December 31, 1992,
inclusive. From these, there were 29,127 patients with complete
demographics and 304 patients with incompletely coded reports,
for a total of 28,823 patients with complete reports. Shands
Hospital is a general hospital, the teaching hospital for the
University of Florida College of Medicine in Gainesville, Florida,
which covers all major areas of medicine and surgery. The
consultation cases were primarily referrals from oncology patients.
Approximately 90% of cases were coded by pathology residents,
the remainder by faculty members. All coders participated in a
two-hour tutorial course on SNOMED coding, taught by one of us
(WHD). All coding was performed by referring to paper publications,
namely the complete book of SNOMED codes (), sometimes referred to as
SNOMED-II, the most widely used version of SNOMED. SNOMED
microglossaries (subsets of SNOMED) for surgical pathology and
pediatric pathology, were also used. As a rule, each specimen
received one topography code and one morphology code per diagnosis.
Redundant coding (assigning more than one morphology code to a specimen)
was performed only for special cases, such as unusual tumors.
Approximately 30 minutes per day was spent on coding.
On a daily basis, during the computer session in which the pathologist
electronically signs, or `releases' reports for general hospital access,
the pathologist enters terms into the various SNOMED fields. Although
all six SNOMED fields are accessible to the pathologist, only topography
and morphology fields are default selected by the computer system, and
the pathologist must request special access to the fields for etiology,
function, procedure, and disease, through a cumbersome user interface.
Nearly all cases signed out in the department have topography,
morphology, and procedure codes. When the pathologist enters a
term at the prompt, the computer selects a match and displays the
match term and its corresponding SNOMED code. Only one match is
selected per lesion (i.e. a specimen is never multiple-coded). The
pathologist is given an opportunity to delete the SNOMED code offered,
if desired.
The computer used for the present study was an IBM PC/AT-compatible
computer (COMTEX, 30368 microprocessor, 25MHz, 330 Mb Priam hard disk),
programmed with American National Standard MUMPS (MGlobal, Inc., Houston,
TX), and the public-domain File Manager (FileMan) database management
system of the United States Department of Veterans Affairs used routinely
in 169 VA medical centers.
Reports were obtained as a raw ASCII file of the MUMPS global variable
that contained all of the textual material and data fields for every
surgical pathology report. downloaded from the mainframe computer at
the Shands Hospital, Gainesville, FL, and containing the complete text
of all consecutive surgical pathology reports obtained between January 1,
1990, and December 31, 1992. The entire contents of each report,
including patient demographics, date and time of accessioning and
signout, specimen source, gross description, final microscopic diagnosis,
pathologist's identification, and manually-entered SNOMED codes, were
passed into the ASCII file, a total of 24 Megabytes. All routines were
written with MGlobal (Houston, TX) M.
RESULTS
The distribution frequency of patients by age is shown in
Table 1. The average age of patients who contributed tissue
to surgical pathology was 35.8 years. The ability
to stratify the ages of the population is extremely important,
as it permits comparison of the data to
other data sets for which the age distributions are known
(i.e., age adjustment).
Table 1. Age distribution of patients contributing surgical
pathology material
0-10 years old 3,096
10-20 years old 2,596
20-30 years old 5,038
30-40 years old 4,578
40-50 years old 3,301
50-60 years old 2,881
60-70 years old 3,958
70-80 years old 2,971
80-90 years old 665
>90 years old 43
One of the most difficult problems is extracting
epidemiologically useful data from a SNOMED databsae is
data redundancy. For instance, a single patient may have
many basal cell carcinomas of the skin removed from
various skin sites. A simple count of coded specimens
may provide a false impression of the prevalency of basal
cell carcinoma in the population. For epidemiologic
purposes, the total number of people with basal cell carcinoma
would, ingeneral, be more useful than the total
number of basal cell carcinoma specimens. The frequency
distributions of the number of specimens submitted per patient is
shown in Table 2.
Table 2. Frequency Distribution, Number of specimens submitted
per patient
Specimens submitted number of patients with the specified
number of submitted specimens
1 22206
2 4378
3 1318
4 462
5 186
6 90
7 56
8 18
9 20
10 19
11 16
12 10
13 6
14 13
15 4
16 9
17 4
18 1
19 3
20 3
21 1
Among the patients who had tissue submitted to
pathology, there were, on average, 1.37 specimens per
patient. The greatest number of specimens submitted
for any patient in the 3-year study period was 21.
The total number of morphology codes in the database is
64,921. Redundant codes for patients were eliminated by
preparing a list of all of the topography and morphology
codes for each patient and eliminating topography-
morphology pairs that shared the same first two digits of
their morphology codes. The reason for matching only the
first two digit-pairs was to allow for differences among
pathologists in their choice of a morphology code (i.e.,
idiosyncratic differences in the last theree digits).
Considering the example of basal cell carcinomas in the
patient population, the tumors may all have different
topography codes (skin of face T02120, skin of neck
T02300, skin of forearm T02630, etc.) and they may have
different morphology codes (basal cell carcinoma M80903,
morphea type basal cell carcinoma M80923,
basosquamous carcinoma M80943)/
But for this example, any of the topography/morphology
code-pair permutations deriving
from the different topography and morphology listings will
have the same pair of 2-digit leading strings (in this case T02/M80).
Using matches in teh first 2 digits of
topography/morphology code pairs effectively catches
most redundancies due to coding idiosyncracy.
Code idiosyncracy is a commonly occurring phenomenon [5].
It occurs when the
same lesion is coded differently by different coders (e.g., on
coder's basal cell carcinoma is another coder's basosquamous carcinoma).
After elimination of redundancies (defined as two or more
topography/morphology pairs identical to the first
digits of code) there were a total of 58,712
topography/morphology pairs. The ability to perform this
ellimination reliably is an essential step in SNOMED database
interpretation.
An interesting finding was that a very small number of
morphologic entitites account for the majority of morphology
and topography codes. As shown in Tables 3 and 4, the 'median
morphology code' (i.e., the 50-percentile morpholoy code representing the ahlfway point in the morphology code ranking)
for manual coding occurs at rank 21. This means that at least
50% of all morphology codes are covered by the 21 most
frequent (i.e., highest-ranking) diagnoses. 90% of all manual
morphology codes are covered by the 265 most frequent diagnoses.
Table 3. Summary of coded morphologies for 40,124 consecutive
specimens accessioned between jan 1, 1990 and dec 31, 1992
Total number of morphology codes 64921
Number of disease entities 21
accounting for 50% of the
coded morphologies
Number of disease entities 265
accounting for 90% of the
coded morphologies
Number of disease entities 1998
accounting for 100% of the
coded morphologies
Average number of coded morphologies 1.6
per accessioned specimen
Number of uniquely coded entities 865
(entities coded only once in the
accession year)
Table 4 shows a distribution of the 21 most common
morphologies and their occurrences, ranked in descending
frequency of occurrence, and accounting for 50% of all
diagnoses made in the period of study. Non-diagnostic and non-specific
morphologic codes account for the bulk of the high frequency
morphologies (e.g., normal tissue, no evidence of malignancy,
inflammation).
Table 4. List of 21 entities accounting for 50% of all
morphology codes
Number of
cases
Normal tissue morphology 8712
Acute and chronic inflammation 2797
Chronic inflammation 2545
No evidence of malignancy 1774
Acute inflammation 1745
Adenocarcinoma 1441
Condyloma acuminatum 1315
Squamous cell carcinoma 1314
Protein deposition 1193
Fibrosis 1063
Inflammation 968
Necrosis 882
Basal cell hyperplasia 871
Calcium deposition 864
Edema 710
Mild dysplasia 658
Products of conception 628
Proliferative Endometrium 588
Ulcer 587
Severe dysplasia 550
As shown in Table 5, the 'median topography code' (i.e., the
50-precentile morphology code representing the halfway point in the
morphology code ranking) for manual coding occurs at rank 24.
this
means that at least 50% of all manual morphology
codes are covered by the 24 most frequent (i.e., highest-ranking
topographic codes are cover4ed by the 213 most frequent sites.
Table 5. Summary of coded topographies for 40.124 consecutive
specimens accessioned between jan 1, 1990 and dec 31, 1992
Total number of topography codes 64921
Number of anatomic sites 24
accounting for 50% of the
coded topographies
Number of anatomic sites 213
accounting for 90% of the
coded topographies
Number of anatomic sites occurring 933
once only
Number of anatomic sites 1554
accounting for 100% of the
coded topographies
Average number of coded topographies 1.6
per accessioned specimen
Number of uniquely coded entities 621
(entities coded only once in the
accession year)
The distribution frequencies for any topographic code
or
for any leading string of topographic code could be assessed
by age or by sex or both. Table 6 is an example of the age
distribution of all pancreatic neoplastic lesions encountered
in the 3 year period of study. A pancreas topography code
was cosndiered to be any topographic code that gbegan with the two-digit
numeric string 59... This would capture
T59000, Pancreas NOS (not otherwise specified), as well
as head of pancreas (T59100), pancreatic duct (T59010), etc.
Just as table 6 demonstrates the age distribution for all
pancreatic lesions, a smimilar distribution could be achieved
for lesions of any specified morphology code or leading
numeric string of morpholocgti codes. All pancreatic
neoplastic morphology codes were accounted for by 8 sets
of 2-digit leading strings (M80, M81, M82, M83, M84, M88, M89 and
M93).
A table could be compised that lists the age/sex distributio for all lesions of all topographic sites, but a single topographic site was selected due to limitations of space.
Table 6. Distribution frequency of all
pancreatic neoplastic lesions by age
0-10 years old 0
10-20 years old 1
20-30 years old 3
30-40 years old 4
40-50 years old 8
50-60 years old 4
60-70 years old 16
70-80 years old 11
80-90 years old 1
>90 years old 43 0
DISCUSSION
A pathologist's understanding of the incidence of diseases is
determined by how often a lesion is encountered. This frame of reference
is inherently biased and can lead to misleading impressions. For
instance, in the 3-year database of the University of Florida, there
were 415 hernia sacs and 26 cases of hemorrhoids. Hemorrhoids occur
much more frequently than inguinal hernias, but a pathologist's
experience would indicate otherwise. Actually, surgery is almost
always performed for inguinal hernias, whereas patients with
hemorrhoids seldom seek surgical relief. Thus, we should not use a
surgical pathology database to determine the relative incidences of
diagnosis or treatment for diseases that may not involve surgery.
Surgical pathology databases are good sources of data pertaining to
lesions that must have biopsy confirmation or surgical treatment.
We can probably get a reasonably good idea of the incidence of
clinically detected hernias in the patient population,
because 1) a hernia repair is a general surgical procedure
performed at virtually every medical center (i.e. patients
do not cluster toward a few facilities that specialize in
hernia repair); 2) a procedure is performed on the majority of
patients with an inguinal hernia; and 3) tissue is received on
almost every hernia repair.
Another error that results from gauging disease incidence
by frequency of occurrence in a surgical pathology database relates
to the multiplicity of biopsies associated with a disease process
in a single patient. For instance, a sigle patient with chronic
lymphocytic leukemia (CLL) may, over a period of several years,
have the SNOMED morphology for CLL entered when a blood smear is
assessed, when a lymph node is biopsied, when a skin infiltrate
is sampled, when a spleen is removed, etc. For this reason, any
analysis of disease frequencies must be able to represent data in
a form where repeat morphologies for a patient are eliminated.
In this study, redundant specimens for a patient were eliminated by searching for repeated
topography/morphology code-pairs listed for a patient. However,
this solution to the problem of specimen redundancy has its own drawbacks, and may not be
appropriate for all types of studies. For instance, patients may develop
separate lesions of the same morphologic type over a period of time (e.g.,
bilateral breast cancer), and an epidemiologist interested in this phenomenon may need to account
for both tumors in a valid analysis of the incidence of cancers occurring in a population.
Partly as a result of these difficulties, commercial laboratory information
systems do not lend themselves to direct epidemiologic analysis, and database
queries must be carefully designed to produce useful
results.
In an effort toinsure that diagnoses can be retrieved from
databases, a variety of coding systems have been developed, all with the intention of categorizing disease entities
as a unique number. This a renal cell carcinoma, which may appear on a report as renal cell adenocarcinoma,
hypernephroma, clear cell carcinoma, kidney carcinoma, kidney adenocarcinoma,
adenocarcinoma of kindey, or even as Grawitz tumor, call allb e coded under the same,
unique morphology and topography codes. Reports written in English, French,
German, or any language may all use the same code numbers ot index their reports. Unfortunately,
coding efforts may vary greatly in their accuracy. The reliability
of indexed data related to diagnosis has received very little discussion in the
medical literature. Hall ahd Lemoine, in one of the few available studies,
found errors in more than 10% of indexed codes [5]. Currently,
many pathology departments have employed automatic coding software and thus relieved
themselves from the time-consuming burden of manual coding. In a recent
study, we have shown that accuratre automatic coding can only be achieved
by monitoring the quality of the coded output and adding appropriate changes
in the code look-up dictionary and in the mannter that reportrs are
written [6]. Furthermore, automatic coding can potentially produce databases with codes
chosen in a uniform and predictable way optimized to support epidemiologic studies [6].
In the current study, the Shands Hospital laboratory information system was used only as a source of raw data files, not as a database engine supporting queries. Commercial
laboratoy information systems cannot budget their computational resources (the amount of
computer time required to respond to a query) to perform in-depth database analyses. It is our observation that departments desiring full
query access totheir databases must acquire a devoted database application and then query their raw database file with their owh programs written in a database specific language
(e.g., SQL, System Query Language).
Using reoutines written in the M programming environment, we have shown that SNOMED databse can be fully analyzed, that
the problem of code redundancy can be overcome, and that data relating the frequency of SNOMED morphology and topography entries according to patient demographics
(age and sex) can be performed. SNOMED databases are one of the fastes growing
and comprehensive databses, in that all U.S. hospitals seeking accreditation
by the College of American Patholgists or the Joint Commission for Accreditation
of Healthcare Organizations must index all surgical pathology case. In the
last decade, most of the medical centers that had previously indexed their
cases using card filing systems, have switched to electroniccoding.
SNOMED (specifically SNOMED version II) is, in our estimation, the most commonly used surgical
pathology indexing system. A formidable amount of SNOMED data is accruing daily, and it would be a terrible waste if these data were not shared and analyzed. Unlike
tumor registry data, which only provides cancer statistics, the SNOMED databases produced
by surgical pathology departments cover every aspect of medicine. Prepared in the manner described in this study, SNOMED data
can be tabulated as listings of topography and morphology
codes, devoid of patient identifiers.
Each record in a distributable database might consist of: 1) a unique
patient identifier number that can be linked to a specific patient name by
the contributing medical center only; 2) a list of topography
and morphology code-airs that describe all the different lesions
biopsied for the patient exclusive of lesion redundancies; 3) the
date of birth of the patient and 4) the sex of the patient.
REFERENCES
1. The International Classification of Diseases, 9th Revision: ICD-9CM, Second Edition, U.S. Department of Health and Human Services, Public Health Service, Health Care Financing Administration, U.S. Government Printing Office, 1980.
2. College of American Pathologists. Systematized nomenclature of medicine (SNOMED), Skokie: College of American Pathologists, 1976.
3. Cote RA, Robboy S: Progress in Medical Information Management: systematized nomenclature of medicine (SNOMED). JAMA 243:756, 1980
4. Davis RG: FileMan: A User Manual. National Association of VA Physicians, Bethesda, 1987
5. Hall PA, Lemoine NR: Comparison of manual data coding errors in two hospitals. J Clin Pathol 39:622, 1986
6. Moore GW, Berman JJ: Performance analysis of manual and automated Systemized Nomenclature of Medicine (SNOMED) coding. Am J Clin Pathol 101:253, 1994
Last modified: January 4, 2007