
Berman JJ. Racing to share pathology data. Am J Clin Pathol. 2004 Feb;121(2):169-171 PMID: 14983928 DOI: 10.1309/F7B40JMQ4F8VPDG6. Editorial Racing to Share Pathology Data Jules J. Berman, Ph.D., M.D. Cancer Diagnosis Program, National Cancer Institute, National Institutes of Health, Bethesda, MD email: bermanj@mail.nih.gov Disclaimers: This editorial contains the opinions of the author and does not represent policy of the NIH or of any federal agency. Article: The NIH policy on data sharing took effect October 1, 2003. The policy specifies that "all investigator-initiated applications with direct costs greater than $500,000 in any single year will be expected to address data sharing in their application."1 The reasons for this policy appeared in the NIH draft statement on data sharing.2 "Sharing data reinforces open scientific inquiry, encourages diversity of analysis and opinion, promotes new research, makes possible the testing of new or alternative hypotheses and methods of analysis, supports studies on data collection methods and measurement, facilitates the education of new researchers, enables the exploration of topics not envisioned by the initial investigators, and permits the creation of new data sets when data from multiple sources are combined. By avoiding the duplication of expensive data collection activities, the NIH is able to support more investigators than it could if similar data had to be collected de novo by each applicant."2 The NIH policy is similar to journal publication guidelines recently released by the National Academy of Sciences and endorsed by many journal editors.3 Data sharing policies reflect a new data-centric research paradigm. In the past, research results were embodied in manuscripts. Primary research data was considered supplemental information. This situation has been largely reversed with the emergence of large publicly available databases. Many research projects explore data collections that were created to support a wide range of studies. The resulting manuscripts can be thought of as supplemental works derived from the [central] data collection. This is the underlying premise for the human genome project, and is the unstated model for many planned efforts that draw their strength from large biomedical datasets. Pathologists hold dominion over the greatest medical data set in existence: the aggregate collection of archived human diseases. Virtually every Progress Review Group commissioned by the National Cancer Institute has urged the creation of databases linking pathology records to archived specimens [http://www.nci.nih.gov/research_programs/priorities/]. The problem for pathologists is that research conducted using archived specimens and pre-existing reports needs to be conducted in a way that protects patients. In this instance, the potential harm to patients is entirely related to issues of confidentiality and privacy.4 In the United States, restrictions on the research uses and electronic transfer of confidential medical records are covered by two Federal Regulations: The Common Rule (Title 45 Code of Federal Regulations, Part 46, Protection of Human Subjects) and the Standards for Privacy of Individually Identifiable Health Information, Final Rule (usually referred to under the broader act, the Health Insurance Portability and Accountability Act, HIPAA).5,6 Both HIPAA and the Common Rule permit sharing of confidential medical data when the patient is informed of the research risks and provides consent. When a research project uses a small number of medical records, it may be feasible to obtain patient consent for each record. However, if large numbers of records are needed, the consent option is impractical. Consenting patients may also be impractical in cases where all the research objectives cannot be adequately described in a consent document [as in data mining studies]. In the absence of patient consent, HIPAA and the Common Rule permit research on pre-existing records that have been stripped of all information that could identify the patient. HIPAA and the Common Rule also permit Institutional Review Boards to waiver consent requirements when the research risks to the patients are negligible and the benefits to society are large. This low-risk, high-benefit situation sometimes applies to medical database studies where patient confidentiality is partly protected, but not to an extent that would exempt the the study from HIPAA or Common Rule regulations.5,6 For almost all research conducted on large databases of pre-existing pathology reports research, methods that can reliably deidentify medical records are essential. Several recent NIH funding announcements clarify the need for data sharing and indicate that NIH is interested in supporting efforts to develop data sharing tools.7,8 Latanya Sweeny was an early proponent of technical approaches to medical record de-identification and has published extensively on the subject.9 Her work formed the foundation for current multi-step approaches to de-identification encompassing the following tasks: 1. Deidentification of data fields that specifically characterize the patient (name, social security number, hospital number, address, age, etc.) 2. Free-text data scrubbing, removing identifiers from the textual portion of medical reports. 3. Rendering the data set ambiguous, ensuring that patients cannot be identified by data records containing a unique set of characterizing information. 4. Free-text data privatizing, removing any information of a private nature that may be contained within the report. In the current issue of AJCP, Gupta and colleagues evaluate a de-identifier designed to automate de-identification tasks 1 and 2.10 In recent papers, I offered computational methods for tasks 2 and 4 and a method for disassembling confidential records into deidentified parts.11,12 Sweeney has published on methods for task 3.9 No single deidentification method is likely to satisfy everyone. Published articles that describe a specific software product cannot account for the needs of every type of research effort. The following performance issues are worth considering: 1. Product availability. Is the software product freely available and open source? Grant applications that propose proprietary data sharing solutions may receive disparaging reviews in study sections. Reviewers may expect large, multi-institutional efforts to implement open source deidentification algorithms. Conversely, proprietary solutions my be ideal for laboratory personnel who lack the resources to implement and test published algorithms and who prefer turnkey applications. 2. Product speed. Is the deidentification process fast? This becomes important when the research project involves millions of records or requires reprocessing records to satisfy research objectives that change over time or that serve different research protocols. Currently, there are no benchmarks for deidentification speed. 3. Product error rate. There is a trade-off between the accurate preservation of textual information and the successful elimination of all identifiers. Approaches that yield the lowest error rates involve replacing free-text with an index of concepts contained in each report. The Concept-Match technique deletes every word of text with the exception of high-frequency words (such as and, if, not, when, etc.) and text phrases that match terms found in a standard nomenclature.11 This method simultaneously codes and scrubs text but is limited by the accuracy of the autocoder and produces a text of reduced readability. A related method is used to by data miners to reduce text to a short concept index that can be used to retrieve individual chunks of text as needed.13 If the research project requires the human review of deidentified reports, it may be necessary to use a deidentification method that preserves as much of the original text as is feasible. Deidentification methods that maximize the preservation of original text will tend to have the highest error rates. 4. Product integration and support issues. Will the deidentification software work with heterogeneous data sources, or is it constrained to work with a specific data input? Will the software permit an interface to the researcher's preferred database, or will the researcher be required to transform the primary data structure to a secondary data structure? If so, will the secondary data structure conform to an open standard, or will it be a proprietary data structure? Will the software be upgraded and will the upgrades be freely available? Can the software be modified without violating license agreements? Somewhat alarming is the proliferation of patent applications covering fundamental processes in deidentification and data sharing. A visit to the U.S. Patent and Trademark web site (http://www.uspto.gov/patft/index.html) indicates that deidentification and anonymization protocols are an area of interest for patent applicants. Granted patent: 6,397,224 Anonymously linking a plurality of data records Current patent applications: 0030204418 Instruments and methods for obtaining informed consent to genetic tests 20020169793 Systems and methods for deidentifying entries in a data source 20030220927 System and method of de-identifying data 20030215092 Managing data in compliance with regulated privacy, security, and electronic transaction standards 20030084339 Hiding sensitive information 20030120458 Patient data mining It would be unfortunate if the well-intentioned act of sharing data engendered legal reprisal. History would suggest that pathologists tend to abandon methodologies encumbered by patent royalties.14,15 The best way of ensuring an open environment for data sharing is by providing opportunities for the publication of data sharing methods. Establishing a wide assortment of published data sharing protocols as "Prior Art" will make it difficult for inventors to include fundamental data sharing processes in their patent claims. References 1. Final NIH Statement on Sharing Research Data. 2003. http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html 2. NIH Draft Statement on Sharing Research Data. 2002. http://grants1.nih.gov/grants/guide/notice-files/NOT-OD-02-035.html. 3. National Academy of Sciences Report. Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences. The National Academies Press, Washington, D.C., 2003. 4. Berman, JJ: Confidentiality for Medical Data Miners. Artificial Intelligence in Medicine 2002, 26:25-36. 5. Title 45 CFR (Code of Federal Regulations), Part 46. Protection of Human Subjects; Common Rule. Federal Register 1991, 56:28003-28032. http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html. 6. Title 45 CFR (Code of Federal Regulations). Parts 160 and 164. Standards for privacy of Individually Identifiable Health Information; Final Rule. Federal Register 2002, 67:53181-53273. 7. Tools for collaborations that involve data sharing (PAR-03-134). June 4, 2003. http://grants.nih.gov/grants/guide/pa-files/PAR-03-134.html 8. Infrastructure for data sharing and archiving. October 17, 2003. http://grants.nih.gov/grants/guide/rfa-files/RFA-HD-03-032.html 9. Sweeney L. Three computational systems for disclosing medical data in the year 1999. Medinfo. 1998;9:1124-1129. 10. Dilip G, Saul M, Gilbertson J. Evaluation of a De-identification (De-IDa) Software Engine To Share Pathology Reports and Clinical Documents for Research. Am J Clin Pathol, in press. 11. Berman JJ. Concept-match medical data scrubbing. How pathology text can be used in research. Arch Pathol Lab Med. 2003 127:680-686. 12. Berman JJ. Threshold protocol for the exchange of confidential medical data. BMC Med Res Methodol. 2002;2:12. 13. Grivell L. Mining the bibliome: searching for a needle in a haystack? EMBO Reports 2002, 3:200-203. 14. Cho MK, Illangasekare S, Weaver MA, Leonard DGB, Merz JF. Effect of patents and licenses on the provision of clinical genetic testing services. J Mol Diag. 2003;5:3-8. 15. Merz JF, Kriss AG, Leonard DG, Cho MK. Diagnostic testing fails the test. Nature, 2002;415:577-579.