Books by Jules J. Berman, covers

Berman JJ. 
Racing to share pathology data.
Am J Clin Pathol. 2004 Feb;121(2):169-171
PMID: 14983928 
DOI: 10.1309/F7B40JMQ4F8VPDG6. 


Racing to Share Pathology Data 

Jules J. Berman, Ph.D., M.D. 

Cancer Diagnosis Program, National Cancer Institute, National
Institutes of Health, Bethesda, MD 



  This editorial contains the opinions of the author and does not
represent policy of the NIH or of any federal agency. 


  The NIH policy on data sharing took effect October 1, 2003. The
policy specifies that "all investigator-initiated applications with
direct costs greater than $500,000 in any single year will be
expected to address data sharing in their application."1 The reasons
for this policy appeared in the NIH draft statement on data sharing.2
"Sharing data reinforces open scientific inquiry, encourages
diversity of analysis and opinion, promotes new research, makes
possible the testing of new or alternative hypotheses and methods of
analysis, supports studies on data collection methods and
measurement, facilitates the education of new researchers, enables
the exploration of topics not envisioned by the initial
investigators, and permits the creation of new data sets when data
from multiple sources are combined. By avoiding the duplication of
expensive data collection activities, the NIH is able to support more
investigators than it could if similar data had to be collected de
novo by each applicant."2 The NIH policy is similar to journal
publication guidelines recently released by the National Academy of
Sciences and endorsed by many journal editors.3 

  Data sharing policies reflect a new data-centric research paradigm.
In the past, research results were embodied in manuscripts. Primary
research data was considered supplemental information. This situation
has been largely reversed with the emergence of large publicly
available databases. Many research projects explore data collections
that were created to support a wide range of studies. The resulting
manuscripts can be thought of as supplemental works derived from the
[central] data collection. This is the underlying premise for the
human genome project, and is the unstated model for many planned
efforts that draw their strength from large biomedical datasets. 

  Pathologists hold dominion over the greatest medical data set in
existence: the aggregate collection of archived human diseases.
Virtually every Progress Review Group commissioned by the National
Cancer Institute has urged the creation of databases linking
pathology records to archived specimens
[]. The problem
for pathologists is that research conducted using archived specimens
and pre-existing reports needs to be conducted in a way that protects
patients. In this instance, the potential harm to patients is
entirely related to issues of confidentiality and privacy.4 In the
United States, restrictions on the research uses and electronic
transfer of confidential medical records are covered by two Federal
Regulations: The Common Rule (Title 45 Code of Federal Regulations,
Part 46, Protection of Human Subjects) and the Standards for Privacy
of Individually Identifiable Health Information, Final Rule (usually
referred to under the broader act, the Health Insurance Portability
and Accountability Act, HIPAA).5,6 

  Both HIPAA and the Common Rule permit sharing of confidential medical
data when the patient is informed of the research risks and provides
consent. When a research project uses a small number of medical
records, it may be feasible to obtain patient consent for each
record. However, if large numbers of records are needed, the consent
option is impractical. Consenting patients may also be impractical in
cases where all the research objectives cannot be adequately
described in a consent document [as in data mining studies]. In the
absence of patient consent, HIPAA and the Common Rule permit research
on pre-existing records that have been stripped of all information
that could identify the patient. HIPAA and the Common Rule also
permit Institutional Review Boards to waiver consent requirements
when the research risks to the patients are negligible and the
benefits to society are large. This low-risk, high-benefit situation
sometimes applies to medical database studies where patient
confidentiality is partly protected, but not to an extent that would
exempt the the study from HIPAA or Common Rule regulations.5,6 

  For almost all research conducted on large databases of pre-existing
pathology reports research, methods that can reliably deidentify
medical records are essential. Several recent NIH funding
announcements clarify the need for data sharing and indicate that NIH
is interested in supporting efforts to develop data sharing tools.7,8
Latanya Sweeny was an early proponent of technical approaches to
medical record de-identification and has published extensively on the
subject.9 Her work formed the foundation for current multi-step
approaches to de-identification encompassing the following tasks: 

  1. Deidentification of data fields that specifically characterize the
patient (name, social security number, hospital number, address, age,

  2. Free-text data scrubbing, removing identifiers from the textual
portion of medical reports. 

  3. Rendering the data set ambiguous, ensuring that patients cannot be
identified by data records containing a unique set of characterizing

  4. Free-text data privatizing, removing any information of a private
nature that may be contained within the report. 

  In the current issue of AJCP, Gupta and colleagues evaluate a
de-identifier designed to automate de-identification tasks 1 and 2.10
In recent papers, I offered computational methods for tasks 2 and 4
and a method for disassembling confidential records into deidentified
parts.11,12 Sweeney has published on methods for task 3.9 

  No single deidentification method is likely to satisfy everyone.
Published articles that describe a specific software product cannot
account for the needs of every type of research effort. The following
performance issues are worth considering: 

  1. Product availability. Is the software product freely available and
open source? Grant applications that propose proprietary data sharing
solutions may receive disparaging reviews in study sections.
Reviewers may expect large, multi-institutional efforts to implement
open source deidentification algorithms. Conversely, proprietary
solutions my be ideal for laboratory personnel who lack the resources
to implement and test published algorithms and who prefer turnkey

  2. Product speed. Is the deidentification process fast? This becomes
important when the research project involves millions of records or
requires reprocessing records to satisfy research objectives that
change over time or that serve different research protocols.
Currently, there are no benchmarks for deidentification speed. 

  3. Product error rate. There is a trade-off between the accurate
preservation of textual information and the successful elimination of
all identifiers. Approaches that yield the lowest error rates involve
replacing free-text with an index of concepts contained in each
report. The Concept-Match technique deletes every word of text with
the exception of high-frequency words (such as and, if, not, when,
etc.) and text phrases that match terms found in a standard
nomenclature.11 This method simultaneously codes and scrubs text but
is limited by the accuracy of the autocoder and produces a text of
reduced readability. A related method is used to by data miners to
reduce text to a short concept index that can be used to retrieve
individual chunks of text as needed.13 If the research project
requires the human review of deidentified reports, it may be
necessary to use a deidentification method that preserves as much of
the original text as is feasible. Deidentification methods that
maximize the preservation of original text will tend to have the
highest error rates. 

  4. Product integration and support issues. Will the deidentification
software work with heterogeneous data sources, or is it constrained
to work with a specific data input? Will the software permit an
interface to the researcher's preferred database, or will the
researcher be required to transform the primary data structure to a
secondary data structure? If so, will the secondary data structure
conform to an open standard, or will it be a proprietary data
structure? Will the software be upgraded and will the upgrades be
freely available? Can the software be modified without violating
license agreements? 

  Somewhat alarming is the proliferation of patent applications
covering fundamental processes in deidentification and data sharing.
A visit to the U.S. Patent and Trademark web site
( indicates that
deidentification and anonymization protocols are an area of interest
for patent applicants. 

Granted patent: 
6,397,224 Anonymously linking a plurality of data records 

Current patent applications: 
0030204418 Instruments and methods for obtaining informed consent to
genetic tests 
20020169793 Systems and methods for deidentifying entries in a data
20030220927 System and method of de-identifying data 
20030215092 Managing data in compliance with regulated privacy,
security, and electronic transaction standards 
20030084339 Hiding sensitive information 
20030120458 Patient data mining 

  It would be unfortunate if the well-intentioned act of sharing data
engendered legal reprisal. History would suggest that pathologists
tend to abandon methodologies encumbered by patent royalties.14,15
The best way of ensuring an open environment for data sharing is by
providing opportunities for the publication of data sharing methods.
Establishing a wide assortment of published data sharing protocols as
"Prior Art" will make it difficult for inventors to include
fundamental data sharing processes in their patent claims. 


  1. Final NIH Statement on Sharing Research Data. 2003. 

  2. NIH Draft Statement on Sharing Research Data. 2002. 

  3. National Academy of Sciences Report. Sharing Publication-Related
Data and Materials: Responsibilities of Authorship in the Life
Sciences. The National Academies Press, Washington, D.C., 2003. 

  4. Berman, JJ: Confidentiality for Medical Data Miners. Artificial
Intelligence in Medicine 2002, 26:25-36. 

  5. Title 45 CFR (Code of Federal Regulations), Part 46. Protection of
Human Subjects; Common Rule. Federal Register 1991, 56:28003-28032. 

  6. Title 45 CFR (Code of Federal Regulations). Parts 160 and 164.
Standards for privacy of Individually Identifiable Health
Information; Final Rule. Federal Register 2002, 67:53181-53273. 

  7. Tools for collaborations that involve data sharing (PAR-03-134).
June 4, 2003. 

  8. Infrastructure for data sharing and archiving. October 17, 2003. 

  9. Sweeney L. Three computational systems for disclosing medical data
in the year 1999. Medinfo. 1998;9:1124-1129. 

  10. Dilip G, Saul M, Gilbertson J. Evaluation of a De-identification
(De-IDa) Software Engine To Share Pathology Reports and Clinical
Documents for Research. Am J Clin Pathol, in press. 

  11. Berman JJ. Concept-match medical data scrubbing. How pathology
text can be used in research. Arch Pathol Lab Med. 2003 127:680-686. 

  12. Berman JJ. Threshold protocol for the exchange of confidential
medical data. BMC Med Res Methodol. 2002;2:12. 

  13. Grivell L. Mining the bibliome: searching for a needle in a
haystack? EMBO Reports 2002, 3:200-203. 

  14. Cho MK, Illangasekare S, Weaver MA, Leonard DGB, Merz JF. Effect
of patents and licenses on the provision of clinical genetic testing
services. J Mol Diag. 2003;5:3-8. 

  15. Merz JF, Kriss AG, Leonard DG, Cho MK. Diagnostic testing fails
the test. Nature, 2002;415:577-579.