Abstract submitted to: The Center for Open Source in Government
Conference: Open Source for National and local eGovernment
Programs in the U.S. and EU
January 2, 2003
Title: Open Source Confidentiality Methods
Scientific progress requires the free exchange of research
data. Because medical research is often conducted using
confidential records, medical researchers have historically
refused to share their primary data, thus denying other
scientists the opportunity of using these data sets for
further research.
Pressured by federal regulations restricting the use of
identified medical records (HIPAA and the Common Rule), and
by recent data-sharing proposals from NIH and from
publishers, researchers have devised a variety of innovative
technical solutions that permit researchers to obtain
and share large data sets derived from medical records
without breaching patient confidentiality. Some of the
methods used are: one-way hashing of patient identification
fields (such as name and social security number),data scrubbing
(removing private information from free-text), and threshold
splitting (dividing text into multiple files, any one of which
can be shared and used for scientific purposes without breaching
confidentiality), and data ambiguating (ensuring non-uniqueness
of records). Using these methods, large medical data sets can
be safely used for research without obtaining patient consent
and can be shared by the scientific community. These methods
and their available open source implementations will be discussed.
Speaker Biography:
Jules Berman is a pathologist/Perl programmer and program
director for pathology informatics in the National Cancer
Institute's Cancer Diagnosis Program. For the past decade,
he has been developing ways to organize, index and share
free-text medical data and large heterogeneous biomedical
data sets.
Jules J. Berman, Ph.D., M.D.
Program Director, Pathology Informatics
Cancer Diagnosis Program, DCTD, NCI, NIH