Biomedical books by Jules J. Berman, Ph.D., M.D.


biomedical informatics cover Perl Programming for Medicine and Biology Cover Ruby for Medicine and Biology Cover Ruby: The Programming Language




















Berman JJ, Moore GW, Donnelly WH, Massey JK, Craig B. 
A SNOMED analysis of three years' accessioned cases of 
(40,124) of a surgical pathology department: implications 
for pathology-based demographic studies. Journal of the 
American Medical Informatics Association (JAMIA), 
Symposium Supplement 1994 and the Proceedings of the 
18th Annual Symposium for Computer Appllications in 
Medicine (SCAMC), pp 188-192, 1994 

This study was done in 1994.  SNOMED has gone through
several versions since that time, and the findings in this paper
would be somewhat different if the study were repeated today
with the latest SNOMED version.

Jules J. Berman, January 4, 2007

A SNOMED ANALYSIS OF THREE YEARS' ACCESSIONED CASES (40,124) OF A GENERAL SURGICAL PATHOLOGY DEPARTMENT: IMPLICATIONS FOR PATHOLOGY-BASED DEMOGRAPHIC STUDIES

Jules J. Berman, G. William Moore, William H. Donnelly, James K. Massey, Brian Craig

Running Title: ANALYSIS OF 3 YEARS' SURGICAL PATHOLOGY SPECIMENS

ABSTRACT

Pathology departments devote considerable energy toward indexing lesions. To date, there have been no detailed tabulations of the results of these efforts. We have thoroughly analyzed three years' surgical pathology reports (40,124) generated for 29,127 different patients from the University of Florida at Gainesville between Jan 1, 1990 and December 31, 1992. 64,921 SNOMED code entries (averaging 1.6 codes per specimen and 1.4 specimens per patient), that were accounted for by 1,998 distinct SNOMED morphologies. A mere 21 entities accounted for 50% of the code occurrences. 265 entities accounted for 90% of the code occurrences, indicating that the diagnostic efforts of pathology departments are primarily devoted to a small fraction of the many thousands of described pathologic entities. In this study, SNOMED information was stratified on the basis of patient age and sex using data fields present in the reports and was age adjusted to provide incidence profiles in a form comparable to that used by the National Cancer Institute in its SEER program. These data represent the first public analysis of SNOMED data from a large pathology service and demonstrates how data can be compiled in a usable form that preserves patient privacy.

Keywords: SNOMED, ICD-9, nomenclature, epidemiology, indexing, coding

INTRODUCTION

Before the advent of computerized laboratory information systems, pathologists were severely limited in the way that they could obtain information related to the scope of their activities. A pathologist could easily determine the specimens received for a specific patient, but there was no practical way of summarizing collected pathologic data. In the past, when pathologists were asked to comment on the incidence of a lesion, they might quote a published statistic (from a report reflecting the experience of another hospital in another geographic and social environment) or they might offer a recollection from their own experience (e.g., "I've seen half a dozen of these things and they seem to occur in older people").

Despite the fact that modern pathology information systems all index their reports under retrievable and universally recognized diagnostic codes (e.g., the International Classification of Disease (ICD), or the Systematized nomenclature of Medicine (SNOMED)), few services take the step of analyzing their own surgical pathology data. The reason for this is simple. Modern laboratory information systems are designed to answer queries related to a designated report, patient or diagnosis. Most information systems can give you all the reports on a given patient, or all the reports with a specific SNOMED number. But they can not support a query that calls for all the data on all the patients and all the data on all the diagnoses. Such an undertaking would consume all of the computational resources of the institution, would require additional programming effort and would provide a service of no direct clinical necessity to any specific patient. Furthermore, coding efforts may vary greatly in their accuracy. The reliability of indexed data related to diagnosis has received very little discussion in the medical literature. Hall and Lemoine, in one of the few available studies, found errors in more than 10% of indexed codes. Currently, many pathology departments have employed automatic coding software and thus relieved themselves from the time-consuming burden of manual coding. In a recent study, we have shown that accurate automatic coding can only be achieved by constantly monitoring the quality of the coded output and adding appropriate changes in the code look-up dictionary and in the manner that reports are written.

Perhaps the most telling indicator of the hazards of analyzing SNOMED databases resides in the virtual absence of published reports that include organized global data summaries (i.e. data that encompasses all of the diagnostic entities encountered during a studied time period, as opposed to reports dealing with the retrieval of reports related to one or several chosen diagnostic entities). The lack of such studies underscores the failure of pathology departments to satisfy the intended goals of indexed coding. According to Cote and Robboy, our systems of disease nomenclature and classification are directly descended from earlier classifications (beginning with the London Bills of Mortality in the early 1700's) addressing the need of any society to understand the prevalence of diseases in its population. Cote and Robboy, both principals in the development of SNOMED, disease serves the needs of the entire health care and maintenance system providing a number of advantages including epidemiologic studies and medical audit.

We have analyzed three years' data obtained from a general hospital in Florida. This study is of value to pathologists, epidemiologists and administrators for several reasons: 1) it provides for the first time a quantitative description of what pathologists see and do; 2) it offers a sample database to illustrate the values and limitations of a surgical pathology databases and to serve as a baseline for comparison with databases from other pathology services; and 3) it specifies the the steps involved in preparing a database summary from raw data retrieved from the electronic files of a laboratory information system.

MATERIALS AND METHODS

We examined data from 40,124 cases accessioned at the Shands Hospital, Gainesville, FL, between January 1, 1990, and December 31, 1992, inclusive. From these, there were 29,127 patients with complete demographics and 304 patients with incompletely coded reports, for a total of 28,823 patients with complete reports. Shands Hospital is a general hospital, the teaching hospital for the University of Florida College of Medicine in Gainesville, Florida, which covers all major areas of medicine and surgery. The consultation cases were primarily referrals from oncology patients.

Approximately 90% of cases were coded by pathology residents, the remainder by faculty members. All coders participated in a two-hour tutorial course on SNOMED coding, taught by one of us (WHD). All coding was performed by referring to paper publications, namely the complete book of SNOMED codes (), sometimes referred to as SNOMED-II, the most widely used version of SNOMED. SNOMED microglossaries (subsets of SNOMED) for surgical pathology and pediatric pathology, were also used. As a rule, each specimen received one topography code and one morphology code per diagnosis. Redundant coding (assigning more than one morphology code to a specimen) was performed only for special cases, such as unusual tumors. Approximately 30 minutes per day was spent on coding.

On a daily basis, during the computer session in which the pathologist electronically signs, or `releases' reports for general hospital access, the pathologist enters terms into the various SNOMED fields. Although all six SNOMED fields are accessible to the pathologist, only topography and morphology fields are default selected by the computer system, and the pathologist must request special access to the fields for etiology, function, procedure, and disease, through a cumbersome user interface. Nearly all cases signed out in the department have topography, morphology, and procedure codes. When the pathologist enters a term at the prompt, the computer selects a match and displays the match term and its corresponding SNOMED code. Only one match is selected per lesion (i.e. a specimen is never multiple-coded). The pathologist is given an opportunity to delete the SNOMED code offered, if desired.

The computer used for the present study was an IBM PC/AT-compatible computer (COMTEX, 30368 microprocessor, 25MHz, 330 Mb Priam hard disk), programmed with American National Standard MUMPS (MGlobal, Inc., Houston, TX), and the public-domain File Manager (FileMan) database management system of the United States Department of Veterans Affairs used routinely in 169 VA medical centers.

Reports were obtained as a raw ASCII file of the MUMPS global variable that contained all of the textual material and data fields for every surgical pathology report. downloaded from the mainframe computer at the Shands Hospital, Gainesville, FL, and containing the complete text of all consecutive surgical pathology reports obtained between January 1, 1990, and December 31, 1992. The entire contents of each report, including patient demographics, date and time of accessioning and signout, specimen source, gross description, final microscopic diagnosis, pathologist's identification, and manually-entered SNOMED codes, were passed into the ASCII file, a total of 24 Megabytes. All routines were written with MGlobal (Houston, TX) M.

RESULTS

The distribution frequency of patients by age is shown in Table 1. The average age of patients who contributed tissue to surgical pathology was 35.8 years. The ability to stratify the ages of the population is extremely important, as it permits comparison of the data to other data sets for which the age distributions are known (i.e., age adjustment).

Table 1. Age distribution of patients contributing surgical 
pathology material

0-10 years old 3,096
10-20 years old 2,596
20-30 years old 5,038
30-40 years old 4,578
40-50 years old 3,301
50-60 years old 2,881
60-70 years old 3,958
70-80 years old 2,971
80-90 years old 665
>90 years old 43

One of the most difficult problems is extracting epidemiologically useful data from a SNOMED databsae is data redundancy. For instance, a single patient may have many basal cell carcinomas of the skin removed from various skin sites. A simple count of coded specimens may provide a false impression of the prevalency of basal cell carcinoma in the population. For epidemiologic purposes, the total number of people with basal cell carcinoma would, ingeneral, be more useful than the total number of basal cell carcinoma specimens. The frequency distributions of the number of specimens submitted per patient is shown in Table 2.

Table 2. Frequency Distribution, Number of specimens submitted 
per patient

Specimens submitted   number of patients with the specified 
                      number of submitted specimens

1                                  22206
2                                  4378
3                                  1318
4                                  462
5                                  186
6                                  90
7                                  56
8                                  18
9                                  20
10                                 19
11                                 16
12                                 10
13                                 6
14                                 13
15                                 4
16                                 9
17                                 4
18                                 1
19                                 3
20                                 3
21                                 1



Among the patients who had tissue submitted to pathology, there were, on average, 1.37 specimens per patient. The greatest number of specimens submitted for any patient in the 3-year study period was 21.

The total number of morphology codes in the database is 64,921. Redundant codes for patients were eliminated by preparing a list of all of the topography and morphology codes for each patient and eliminating topography- morphology pairs that shared the same first two digits of their morphology codes. The reason for matching only the first two digit-pairs was to allow for differences among pathologists in their choice of a morphology code (i.e., idiosyncratic differences in the last theree digits). Considering the example of basal cell carcinomas in the patient population, the tumors may all have different topography codes (skin of face T02120, skin of neck T02300, skin of forearm T02630, etc.) and they may have different morphology codes (basal cell carcinoma M80903, morphea type basal cell carcinoma M80923, basosquamous carcinoma M80943)/ But for this example, any of the topography/morphology code-pair permutations deriving from the different topography and morphology listings will have the same pair of 2-digit leading strings (in this case T02/M80). Using matches in teh first 2 digits of topography/morphology code pairs effectively catches most redundancies due to coding idiosyncracy. Code idiosyncracy is a commonly occurring phenomenon [5]. It occurs when the same lesion is coded differently by different coders (e.g., on coder's basal cell carcinoma is another coder's basosquamous carcinoma). After elimination of redundancies (defined as two or more topography/morphology pairs identical to the first digits of code) there were a total of 58,712 topography/morphology pairs. The ability to perform this ellimination reliably is an essential step in SNOMED database interpretation.

An interesting finding was that a very small number of morphologic entitites account for the majority of morphology and topography codes. As shown in Tables 3 and 4, the 'median morphology code' (i.e., the 50-percentile morpholoy code representing the ahlfway point in the morphology code ranking) for manual coding occurs at rank 21. This means that at least 50% of all morphology codes are covered by the 21 most frequent (i.e., highest-ranking) diagnoses. 90% of all manual morphology codes are covered by the 265 most frequent diagnoses.

Table 3. Summary of coded morphologies for 40,124 consecutive 
specimens accessioned between jan 1, 1990 and dec 31, 1992

Total number of morphology codes            64921

Number of disease entities                     21
accounting for 50% of the
coded morphologies

Number of disease entities                    265
accounting for 90% of the
coded morphologies

Number of disease entities                   1998
accounting for 100% of the
coded morphologies

Average number of coded morphologies            1.6
per accessioned specimen

Number of uniquely coded entities             865
(entities coded only once in the
accession year)



Table 4 shows a distribution of the 21 most common morphologies and their occurrences, ranked in descending frequency of occurrence, and accounting for 50% of all diagnoses made in the period of study. Non-diagnostic and non-specific morphologic codes account for the bulk of the high frequency morphologies (e.g., normal tissue, no evidence of malignancy, inflammation).

Table 4. List of 21 entities accounting for 50% of all 
morphology codes

                                   Number of
                                    cases

Normal tissue morphology             8712
Acute and chronic inflammation       2797
Chronic inflammation                 2545
No evidence of malignancy            1774
Acute inflammation                   1745
Adenocarcinoma                       1441
Condyloma acuminatum                 1315
Squamous cell carcinoma              1314
Protein deposition                   1193
Fibrosis                             1063
Inflammation                          968
Necrosis                              882
Basal cell hyperplasia                871
Calcium deposition                    864
Edema                                 710
Mild dysplasia                        658
Products of conception                628
Proliferative Endometrium             588
Ulcer                                 587
Severe dysplasia                      550


As shown in Table 5, the 'median topography code' (i.e., the 50-precentile morphology code representing the halfway point in the morphology code ranking) for manual coding occurs at rank 24. this means that at least 50% of all manual morphology codes are covered by the 24 most frequent (i.e., highest-ranking topographic codes are cover4ed by the 213 most frequent sites.

Table 5.  Summary of coded topographies for 40.124 consecutive 
specimens accessioned between jan 1, 1990 and dec 31, 1992

Total number of topography codes          64921

Number of anatomic sites                     24
accounting for 50% of the
coded topographies

Number of anatomic sites                    213
accounting for 90% of the
coded topographies

Number of anatomic sites occurring          933
once only

Number of anatomic sites                   1554
accounting for 100% of the
coded topographies

Average number of coded topographies        1.6
per accessioned specimen

Number of uniquely coded entities           621
(entities coded only once in the
accession year)

The distribution frequencies for any topographic code or for any leading string of topographic code could be assessed by age or by sex or both. Table 6 is an example of the age distribution of all pancreatic neoplastic lesions encountered in the 3 year period of study. A pancreas topography code was cosndiered to be any topographic code that gbegan with the two-digit numeric string 59... This would capture T59000, Pancreas NOS (not otherwise specified), as well as head of pancreas (T59100), pancreatic duct (T59010), etc. Just as table 6 demonstrates the age distribution for all pancreatic lesions, a smimilar distribution could be achieved for lesions of any specified morphology code or leading numeric string of morpholocgti codes. All pancreatic neoplastic morphology codes were accounted for by 8 sets of 2-digit leading strings (M80, M81, M82, M83, M84, M88, M89 and M93). A table could be compised that lists the age/sex distributio for all lesions of all topographic sites, but a single topographic site was selected due to limitations of space.

Table 6. Distribution frequency of all
pancreatic neoplastic lesions by age

0-10 years old      0
10-20 years old     1
20-30 years old     3
30-40 years old     4 
40-50 years old     8
50-60 years old     4
60-70 years old    16 
70-80 years old    11
80-90 years old     1
>90 years old 43    0



DISCUSSION

A pathologist's understanding of the incidence of diseases is determined by how often a lesion is encountered. This frame of reference is inherently biased and can lead to misleading impressions. For instance, in the 3-year database of the University of Florida, there were 415 hernia sacs and 26 cases of hemorrhoids. Hemorrhoids occur much more frequently than inguinal hernias, but a pathologist's experience would indicate otherwise. Actually, surgery is almost always performed for inguinal hernias, whereas patients with hemorrhoids seldom seek surgical relief. Thus, we should not use a surgical pathology database to determine the relative incidences of diagnosis or treatment for diseases that may not involve surgery. Surgical pathology databases are good sources of data pertaining to lesions that must have biopsy confirmation or surgical treatment. We can probably get a reasonably good idea of the incidence of clinically detected hernias in the patient population, because 1) a hernia repair is a general surgical procedure performed at virtually every medical center (i.e. patients do not cluster toward a few facilities that specialize in hernia repair); 2) a procedure is performed on the majority of patients with an inguinal hernia; and 3) tissue is received on almost every hernia repair.

Another error that results from gauging disease incidence by frequency of occurrence in a surgical pathology database relates to the multiplicity of biopsies associated with a disease process in a single patient. For instance, a sigle patient with chronic lymphocytic leukemia (CLL) may, over a period of several years, have the SNOMED morphology for CLL entered when a blood smear is assessed, when a lymph node is biopsied, when a skin infiltrate is sampled, when a spleen is removed, etc. For this reason, any analysis of disease frequencies must be able to represent data in a form where repeat morphologies for a patient are eliminated. In this study, redundant specimens for a patient were eliminated by searching for repeated topography/morphology code-pairs listed for a patient. However, this solution to the problem of specimen redundancy has its own drawbacks, and may not be appropriate for all types of studies. For instance, patients may develop separate lesions of the same morphologic type over a period of time (e.g., bilateral breast cancer), and an epidemiologist interested in this phenomenon may need to account for both tumors in a valid analysis of the incidence of cancers occurring in a population. Partly as a result of these difficulties, commercial laboratory information systems do not lend themselves to direct epidemiologic analysis, and database queries must be carefully designed to produce useful results.

In an effort toinsure that diagnoses can be retrieved from databases, a variety of coding systems have been developed, all with the intention of categorizing disease entities as a unique number. This a renal cell carcinoma, which may appear on a report as renal cell adenocarcinoma, hypernephroma, clear cell carcinoma, kidney carcinoma, kidney adenocarcinoma, adenocarcinoma of kindey, or even as Grawitz tumor, call allb e coded under the same, unique morphology and topography codes. Reports written in English, French, German, or any language may all use the same code numbers ot index their reports. Unfortunately, coding efforts may vary greatly in their accuracy. The reliability of indexed data related to diagnosis has received very little discussion in the medical literature. Hall ahd Lemoine, in one of the few available studies, found errors in more than 10% of indexed codes [5]. Currently, many pathology departments have employed automatic coding software and thus relieved themselves from the time-consuming burden of manual coding. In a recent study, we have shown that accuratre automatic coding can only be achieved by monitoring the quality of the coded output and adding appropriate changes in the code look-up dictionary and in the mannter that reportrs are written [6]. Furthermore, automatic coding can potentially produce databases with codes chosen in a uniform and predictable way optimized to support epidemiologic studies [6]. In the current study, the Shands Hospital laboratory information system was used only as a source of raw data files, not as a database engine supporting queries. Commercial laboratoy information systems cannot budget their computational resources (the amount of computer time required to respond to a query) to perform in-depth database analyses. It is our observation that departments desiring full query access totheir databases must acquire a devoted database application and then query their raw database file with their owh programs written in a database specific language (e.g., SQL, System Query Language). Using reoutines written in the M programming environment, we have shown that SNOMED databse can be fully analyzed, that the problem of code redundancy can be overcome, and that data relating the frequency of SNOMED morphology and topography entries according to patient demographics (age and sex) can be performed. SNOMED databases are one of the fastes growing and comprehensive databses, in that all U.S. hospitals seeking accreditation by the College of American Patholgists or the Joint Commission for Accreditation of Healthcare Organizations must index all surgical pathology case. In the last decade, most of the medical centers that had previously indexed their cases using card filing systems, have switched to electroniccoding. SNOMED (specifically SNOMED version II) is, in our estimation, the most commonly used surgical pathology indexing system. A formidable amount of SNOMED data is accruing daily, and it would be a terrible waste if these data were not shared and analyzed. Unlike tumor registry data, which only provides cancer statistics, the SNOMED databases produced by surgical pathology departments cover every aspect of medicine. Prepared in the manner described in this study, SNOMED data can be tabulated as listings of topography and morphology codes, devoid of patient identifiers. Each record in a distributable database might consist of: 1) a unique patient identifier number that can be linked to a specific patient name by the contributing medical center only; 2) a list of topography and morphology code-airs that describe all the different lesions biopsied for the patient exclusive of lesion redundancies; 3) the date of birth of the patient and 4) the sex of the patient.

REFERENCES

1. The International Classification of Diseases, 9th Revision: ICD-9CM, Second Edition, U.S. Department of Health and Human Services, Public Health Service, Health Care Financing Administration, U.S. Government Printing Office, 1980.

2. College of American Pathologists. Systematized nomenclature of medicine (SNOMED), Skokie: College of American Pathologists, 1976.

3. Cote RA, Robboy S: Progress in Medical Information Management: systematized nomenclature of medicine (SNOMED). JAMA 243:756, 1980

4. Davis RG: FileMan: A User Manual. National Association of VA Physicians, Bethesda, 1987

5. Hall PA, Lemoine NR: Comparison of manual data coding errors in two hospitals. J Clin Pathol 39:622, 1986

6. Moore GW, Berman JJ: Performance analysis of manual and automated Systemized Nomenclature of Medicine (SNOMED) coding. Am J Clin Pathol 101:253, 1994

Last modified: January 4, 2007



biomedical informatics cover Perl Programming for Medicine and Biology Cover Ruby for Medicine and Biology Cover Ruby: The Programming Language