biomedical informatics cover Perl Programming for Medicine and Biology Cover

Biomedical Informatics books
by Jules J. Berman

  • Jones & Bartlett sales and informational website for Biomedical Informatics
  • Amazon.com U.S. book site for Biomedical Informatics
  • Full Table of Contents from Library of Congress for Biomedical Informatics
  • List of book-related resources
  • Brief author biography on Association for Pathology Informatics Website
  • Quick link to PubMed listing for Jules J. Berman
  • Full list of Publications for Jules J. Berman
  • Dr. Bruce Friedman's review of Biomedical Informatics
  • Perl Programming for Medicine and Biology companion site.
  • Author's blog on data specifications
  • Contact author



  • Workshop: SA35

    Data Mining and Data Warehousing

    2000 ASCP/CAP Annual Meeting

    Anatomic Pathology Data Mining

    J Harrison, S Moser, JJ Berman

    Jules J. Berman, Ph.D., M.D. *

    Program Director for Pathology Informatics

    Cancer Diagnosis Program

    National Cancer Institute

    National Institutes of Health

    bermanj@mail.nih.gov

    Hand-out Notes*

    *This hand-out is not copyrighted and is in the public domain. The contents are exclusively the personal opinions of the author and do not represent the authorized statements from any federal agency. None of the contents should be be considered legal or medical advice.

    1.0 Workshop Overview

    It is now technically feasible to perform analysis on large datasets containing hundreds of thousands of surgical pathology reports all linked to clinical data residing in integrated hospital information systems. Such datasets could support many types of research, including data mining, hypothesis generation, marker development and validation, outcomes analysis. Since surgical pathology reports are linked to archived tissue blocks, large institutional pathology datasets could be used as resource locators for tissue samples (i.e. virtual tissue archives). When large numbers of tissue samples annotated with demographic and clinical information become available to researchers, it will be possible to design and implement previously prohibitive studies on stratified patient populations and on rare tumors.

    Two very important factors that have limited the usefulness of hospital-based pathology archives are: 1) the confidential nature of medical information, which imposes restrictions on how data can be shared, 2) the free-text format of pathology reports, which makes it difficult to prepare common data elements representing the conceptual information contained in the electronic medical record, and 3) the indeterminate data quality of many anatomic pathology datasets.

    This workshop, will cover these issues and provide an example of a successful web-based anatomic pathology data mining resource and a soon-to-be developed NCI-sponsored pathology informatics network.

    2.0 The Data Domain

    Approximately 40 million anatomic pathology reports are issued each year in the United States. Large academic institutions and testing laboratories now exist that contain hundreds of thousands of anatomic pathology records in electronic form, each with diagnostic, demographic and clinical data. Internet technology offers the possibility of distributing data mining queries to multiple institutions and returning summary data for literally millions of specimens.

    The principal areas for potential application of anatomic pathology data mining include:

    2.1 Epidemiology

    Using data mining tools, the pathologist can determine the exact occurrence of every entity encountered in the pathology department, and can compare these data with published data, as a means of detecting inconsistent or anomalous data. For instance, upon review of all neoplasm diagnoses, it may be noted that 50% of the adenocarcinomas of lung are designated as bronchioloalveolar type. Comparison with published data would suggest that bronchioloalveolar carcinoma should account for only 4-8% of lung cancer, prompting a review of the microscopic specimens. Specimen review might indicate that the diagnosis of bronchioloalveolar carcinoma is being overused, and that applying stricter diagnostic criteria would bring the incidence down to levels reported at other institutions. On the other hand, if a review of cases shows that the departmental diagnoses are accurate, then a high incidence of bronchioloalveolar carcinomas might become a valid public health issue. In the past, the observation of a higher-than-expected occurrences of a particular type of cancer received by Pathology has prompted important epidemiologic discoveries, exemplified by angiosarcoma of liver in tire plant workers and clear cell carcinoma of vagina in women exposed in utero to DES. Data mining efforts directed toward analyzing the incidence of all pathologic entities stratified by demographic and other clinical data may lead to many advances in disease epidemiology.

    2.2 Evidence-based research and outcomes analysis

    Much of medical practice has developed as an empiric art. Treatment protocols are often accepted as standard of care, with no objective evidence. Radical mastectomies for breast cancer, hiatal hernia repairs, laser endarterectomes for advanced atherosclerotic disease, bone marrow transplants for advanced breast carcinoma, and tonsillectomies for throat infections, are all examples of medical procedures whose uses have been reduced as the result of clinical studies showing general ineffectiveness, or effectiveness restricted to a subset of the diseased populations. The medical community and the public are demanding that the practice of medicine be evidence-based. This means that the validity of a treatment or test must be based on well-designed studies yielding statistically significant results. Unfortunately, such studies are extremely expensive and time-consuming. Considering the enormous number of old, new, and developing medical tests and procedures, it would be impossible to implement expensive prospective studies in every case. However, retrospective studies using large collections of pathology data, linked to clinical data and specimens, may sometimes yield evidence related to the value of medical procedures, treatments, and tests.

    2.3 Monitoring and Improving Patient Care

    There is an enormous potential for monitoring and improving individual patient care with mined anatomic pathology data. Currently the only universal purpose for issuing an anatomic pathology report is to provide a correct and timely diagnosis for the individual patient. However, when analyzed as aggregate data, anatomic pathology datasets may help identify potential risks to individual patients, and may uncover managerial and administrative problems in medical institutions. Data mining procedures could automatically generate lists of problem cases that might require additional medical attention. Through an automated process, pathology departments could alert clinicians to potential problem cases.

    2.4 Developing and validating disease markers

    See 4.1

    3.0 Problems in the Data Domain

    3.1 Confidentiality of the anatomic pathology data

    All anatomic pathology data is confidential medical information. Although pathologists have free access to their own data for the purposes of patient care, pathologists do not have free access to their own data for purposes of research.

    For most academic pathologists who work in academic institutions, research performed with their own records is regulated by 45CFR46 (Title 45 of the Code of Federal Regulations, part 46), the so-called Common Rule that deals with protections for human subjects involved in research. As a generalization, if you are a pathologist hoping to do a data mining project that is publishable, then 45CFR46 will probably apply to you.

    The fact that you may be using your own department's data (and no tissues) will probably not exempt you from 45CFR46. Recent opinions from National Bioethics Advisory Commission essentially equate research with tissue-related data with research with tissues. In both instances there is an attachable patient identifier and thus the potential for human subject risk.

    Performing human subject research regulated by 45CFR46 without IRB (Institutional Review Board) review can get you and your institution into a lot of trouble. In particular, it puts your institution at risk of losing current and future federal research funding.

    The issue of patient confidentiality relates to almost every aspect of anatomic pathology data mining and is perceived by many to be the most formidable obstacle to sharing and publishing data.

    All pathologists interested in Anatomic Pathology Data Mining must be familiar with confidentiality and privacy issues.

    A good place to start is the publication, "Research on Human Specimens," published by NIH and available at:

    http://www-cdp.ims.nci.nih.gov/policy.html

    The act of making large quantities of pathology data confidential, either by anonymization or by deidentification, is itself a complex computational task requiring cryptographic and statistical expertise.

    3.2. Free-text nature of anatomic pathology data

    The free-text nature of anatomic pathology diagnoses immediately separates clinical pathology data mining from anatomic pathology data mining.

    Over the past several decades, there have been numerous attempts to represent free-text data as coded nomenclature entities. These attempts have NOT yielded consistently coded large anatomic pathology datasets that can be merged and analyzed with data mining tools.

    Pathologist-encoded datasets are inconsistent, idiosyncratic, and prone to errors. Studies of coding accuracy show human coding error rates in the range of 10%-15%. These studies have divided manual coding errors into five types:

    (1) Factually correct but unhelpful codes (e.g., coding all benign lesions as `negative for tumor');

    (2) Inconsistent codes (coding `dysplasia' on Monday and `atypia' on Tuesday);

    (3) Idiosyncratic codes (using a mnemonic for a lesion, often inscrutable to other people, such as coding all fungal infections as "fungus ball," under the morphology axis, rather than taking the time to assign a specific code from the infection axis, and remembering that the now private code "fungus ball" must be used for any future fungal searches);

    (4) Entry errors (e.g., entering `lipoma' when one intends to enter `lymphoma' and accepting the wrong code matched by the software);

    (5) Incomplete coding due to impatience or laziness.

    The successful translation of anatomic pathology free-text into encoded datasets that accurately retains the meaning of the free-text report is one of the most important tasks of anatomic pathology data mining.

    3.3 There is very little quality control for anatomic pathology data

    The most unforgivable data entry sin is patient misidentification. If a patient is entered separately as Robert and Bob, or if a woman's unique identity is not preserved when she marries and provides a different married name, the dataset becomes essentially unusable. Many of the early laboratory information systems did not have in place safeguards to ensure that patients are entered as unique dataset entities. When a pathology dataset contains merged records from different patients with similar names and multiple records from a person who uses several names, virtually every medical issue in a hospital including patient care, quality control, confidentiality, and billing is impacted negatively.

    The completeness of reports is another recurrent data integrity issue. The most unstructured part of the pathology report, and perhaps the part in greatest need of accurate data mining, is the microscopic diagnosis. At a minimum, each microscopic diagnosis should contain the disease name and body site. At this time, there is virtually no control on data-entry integrity in the microscopic diagnosis, beyond the fastidiousness of the pathologist and the competence of the typist. Sentences may be written with convoluted grammatical constructions, including run-on sentences and vaguely placed negations. There may be spelling errors, idiosyncratic abbreviations, or ambiguous terms.

    There is the issue of whether reports contain all of the information needed for data mining effort. If there is prostate cancer, is the diagnosis made on a needle core or on a prostatectomy? What is the Gleason score? Is the Gleason score entered using a consistent format that can be parsed? Is there a descriptor for the amount of tumor in the biopsy? Is this descriptor internally consistent, and is it externally consistent with an authoritative standard for prostate reporting?

    Determining the quality of anatomic pathology data is one of the most important tasks that data miners must perform.

    4.0 Special importance of the data domain: Anatomic Pathology Data is linked to tissues

    The linkage between anatomic pathology data and microscope slides and tissue blocks that contain lesions suitable for further analysis greatly amplifies the importance of anatomic pathology data mining efforts. No other area of medicine can associate data with enormous archives of incarnate research material extending over decades or even centuries.

    The task of the pathology informatician, more often than not, includes the identification and retrieval of archived tissue, selected for specified criteria related to diagnosis, stage of disease, patient demographics, treatment status, and clinical history.

    4.1 Example: Tissue Microarrays

    A tissue microarray is a composite tissue-block, containing as many as one thousand small sections of tissue. Each microscope slide cut from this block can be stained in a single procedure. The advantage of a tissue microarray is that it permits a scientist to perform the equivalent of hundreds of experiments at once, on a single slide, using only a small amount of reagent. Since one tissue microarray block can be used to produce up to several hundred near-identical glass slides, different laboratories can compare their results obtained from slides all obtained from the same tissue-block (i.e., from the same set of tissues).

    Each tissue microarray requires an informatics effort, in which large archives of pathology data are mined to find the tissues and associated data needed for the microarray. Informatics is needed to store and access terabytes of image and other data associated with a single microarray block. Once a tissue microarray is used in an experiment, the subsequent analysis will involve quantitating each microarray spot and comparing the results obtained at different laboratories. The informatics aspects of such an undertaking are enormous and unprecedented in medical research.

    5.0 Creating datasets from Anatomic Pathology Data

    5.1 Mapping free-text to a standard nomenclature

    Widely used medical nomenclatures are: SNOMED, ICD, and UMLS.

    UMLS provides a uniform, integrated distribution format from over fifty biomedical vocabularies and classifications. The 1999 UMLS Metathesaurus contains 625,530 biomedical concepts and 1,362,823 different concept-names. UMLS is updated annually and made available to registered users, with a complete listing of Concept Unique Identifiers (CUIs) and synonym-names. Since the UMLS CUIs are available cost-free to researchers, it makes a lot of sense to develop pathology data mining applications in UMLS.

    As an example from the UMLS nomenclature, excluding upper-case-lower-case repeats, there are 40 terms that map to C0017155, the CUI for HYPERTROPHIC GASTRITIS, as follows: adenopapillomatosis gastrica, chronic hypertrophic gastritis; disease, menetrier; disease, menetrier's; gastric hyperplasia; gastric mucosal hyperplasia; gastric mucosal hypertrophy; gastritides, hypertrophic; gastritis giant hypertrophic; gastritis hypertroph gigantica; gastritis hypertrophic; gastritis hypertrophic gigantica; gastritis hypertrophica gigant; gastritis hypertrophica gigantea; gastritis hypertrophica gigantica; gastritis, giant hypertrophic; gastritis, hypertrophic; gastropathy mucous cell type; giant hypertrophic gastritis; giant rugal gastritis; giant rugal hypertrophy; giant rugal hypertrophy of stomach; giant rugal hypertrophy stom; hyperplastic gastropathy; hyperplastic gastropathy of mucous cell type; hypertrophic bulbous gastritis; hypertrophic gastritides; hypertrophic gastritis; hypertrophic gastropathy; hypertrophic prolif gastritis; hypertrophic proliferative gastritis; hypertrophy, gastric mucosa; hypertrophy, giant rugal; massive hypertrophic gastritis; menetrier disease; menetrier's disease; polypoid swell gastric mucosa; polypoid swelling of gastric mucous membrane; prolif chr hypertr gastritis; proliferative chronic hypertrophic gastritis

    When a free-text term is parsed and mapped to a UMLS CUI, not only is it transformed into a standard data element, but it becomes retrievable through any of the synonymous terms associated with the unique CUI.

    5.2 Finding common data elements

    The basic data elements of a report need to have a consistent, canonical form. In order for data to be merged between different pathology laboratories, there needs to be agreement on the common data elements included in a report, including their precise definition and the way that they can be used.

    For instance, will a patient date of birth be classified as birthdate, DOB, D.O.B., birthday, date of birth, date_of_birth, etc. Will the data be captured as October 17, 2000, 10/17/2000, 10/17/00, 17/10/00, 17/10/2000, 10,17,00, etc?

    5.3 Finding a common report format

    How will the common data elements be assembled? Will they be assembled in XML format, or HL7, or an XML representation of HL7, or X12, etc? Will HIPAA guidelines tell pathologists how they must report anatomic pathology data?

    If they are assembled in XML, what descriptors will be XML tags, and what descriptors will be attributes within XML tags. What elements will have flexible descriptors (SNOMED or LOINC or UMLS for diagnostic codes). How will the common data element tags be nested? Who will create the standard DTD (data declaration) for the XML standard. Will LIS (laboratory information system) vendors support the new standards?

    6.0 Data Mining Example: The Johns Hopkins Autopsy Resource

    The Johns Hopkins Autopsy Resource (JHAR) is an Internet website, founded as an institutional database in 1980, and posted publicly in 1995, that lists over 50,000 autopsy facesheets, on patients born over a span of two centuries. An autopsy facesheet is the summary of final diagnoses, typically appearing as the first page in an autopsy report. The JHAR corresponds to an estimated one million tissue blocks, predominantly formalin-fixed and paraffin-embedded, which may be obtained as part of collaborative research investigations.

    Over 1300 publications in scholarly journals have resulted from the cases listed in the JHAR, and all citations, many with PubMed hyperlinks, are available on the JHAR website. Studies based upon data mining the JHAR include case reports, large autopsy case series, and even linguistic studies of medical text

    Under 45CFR46, a subject must be living in order to be a human subject protected under Common Rule. Consequently, autopsy data and tissues from deceased patients may be used freely by the research community. Although autopsy datasets are exempted from Human Subjects Regulation, the designers of the Johns Hopkins Autopsy Resource employed a number of novel procedures to protect the anonymity of the deceased patients included in the resource.

    Moore and Berman had demonstrated the use of a doubly-encrypted brokered tissue database, applying it to the Johns Hopkins Autopsy Resource. In this model, providers of pathology data encrypt patient identifiers and send their data to a database administrator. The database administrator then performs a second encryption on the identifiers, before releasing the data to the scientific community. Data miners could freely use the data in such a pathology database, without access to the identity of the patients included in the database. Suppose, for some reason, a data miner needs to collect additional data on certain patient records (e.g., survival or treatment data). The researcher would send a request for additional information to the database administrator, along with the relevant records, each containing doubly-encrypted patient identifiers. The database administrator would perform a single decryption of the patient identifiers, and forward the request to the institution that contributed the data-record. The provider IRB then reviews the request, and determines whether it would be legal and ethical to perform the final decryption step linking the data-record with the patient. This decision might involve obtaining patient consent. If the final decryption is approved, then the patient is identified. The additional information is returned to the database administrator, after tagging the data-record with the re-encrypted identifier. The database administrator then sends the records to the researcher, after tagging the data with the doubly-encrypted identifier. Throughout the process, the database administrator and the researcher never learn the identity of the patient. It is interesting that the field of cryptography provides the solution (brokered double-encryption) to the greatest legal and ethical obstacle against progress in the field of pathology informatics.

    Further, the text included in the autopsy facesheets was translated to UMLS concepts through an algorithm developed by William Moore. The result of this process is that the autopsy report is essentially converted into a text-free document, consisting of demographic elements and diagnostic codes. Removing prose virtually eliminates the possibility of extracting confidential information that might have been inadvertently left in the report (e.g. names of doctors, addresses, hospital services, dates, etc.).

    7.0 The Shared Pathology Informatics Network

    In April 2001, the National Cancer Institute will launch a new initiative, the Shared Pathology Informatics Network. The objective of this initiative is to create a single Web-based system capable of requesting and receiving clinical data related to archived specimens at multiple institutions. Searchable databases with patient data exist at hospitals and medical institutions. The Network will employ available software tools that can facilitate communications among disparate computer systems, even among those that employ different architectures and search strategies. The Shared Pathology Informatics Network will employ Internet protocols to process researcher-initiated queries on multiple institutional databases. Search engines that are integral to the pathology informatics systems at each institution will interrogate the database and produce query replies. The query replies will then be translated into a structured reply format, standardized for all the institutions in the Network. These replies will be merged into a single document that will be sent electronically to the researcher who placed the query.

    8.0 Conclusion

    Five developments over the past several years have enormously expanded the potential for anatomic pathology data mining:

    8.1 The accumulation of millions of pathology records in electronic form. Approximately 40 million new anatomic pathology reports are created and stored each year in the USA.

    8.2 The emergence of legislative guidelines that ensure patient confidentiality, while establishing ethical and legal paradigms by which researchers can acquire pathology data for legitimate research needs.

    8.3 The availability of comprehensive common medical terminologies (including SNOMED, ICD, Read, and UMLS), to allow translation of diagnostic information into standard coded terms.

    8.4 The availability of standard document structures, such as XML. Electronic reports will contain data, along with information that describes the data, using community-standard data tags (metadata).

    8.5 The availability of Internet technology that allows the rapid and secure transfer of medical data.

    9. Useful Resources

    9.1 Anatomic Pathology Data Mining Resource

    http://www.netautopsy.org

    9.2 U. S. Code of Federal Regulations, 45 CFR 46, the so-called Common Rule

    http://www.nih.gov/grants/oprr/humansubjects/45cfr46.htm

    9.3 Library of Medicine, Unified Medical Language System.

    http://www.nlm.nih.gov/research/umls/

    9.4 U. S. Office of Protection from Research Risks (OPRR). http://grants.nih.gov/grants/oprr/oprr.htm

    9.5 Berman, J.J., Moore, G.W. and Hutchins, G.M. 1996. Maintaining patient confidentiality in the public domain Internet Autopsy Database (IAD). Proc AMIA Annu Fall Symp. 328-332.

    9.6 Moore, G.W. and Berman, J.J. 1994. Performance Analysis of Manual and Automated Systematized Nomenclature of Medicine (SNOMED) Coding. Am J Clin Pathol 101:253-256.

    9.7 Schneier, B. 1996. Applied Cryptography, Second Edition. Protocols, Algorithms, and Source Code in C. John Wiley & Sons Simpson, A. 1996.

    9.8 45 CFR Parts 160 - 164. Standards for Privacy of Individually Identifiable Health Information; Proposed Rule. Department of Health and Human Services. Office of the Secretary. Federal Register. 64:59917-60065.

    http://aspe.hhs.gov/admnsimp/

    9.9 U. S. Health Insurance Portability and Accountability Act. 1996. (HIPAA, Kennedy-Kassebaum Bill, H.R. 3103 of 104th U. S. Congress). U. S. Government Documents at URL: http://thomas.loc.gov

    9.10 U. S. National Bioethics Advisory Commission (NBAC). 1995.Executive Order 12975, October 3, 1995. Federal Register. 60:52063-52065.

    http://bioethics.gov/general.html

    9.11 The Request for Applications (deadline passed July 24, 2000) for the Shared Pathology Informatics Network

    http://grants.nih.gov/grants/guide/rfa-files/RFA-CA-01-006.HTML



    biomedical informatics cover Perl Programming for Medicine and Biology Cover