Books by Jules J. Berman, covers

Demonstration of a short, quick, and accurate medical scrubber:
Scrubbed output of a public domain text

In the field of biomedical informatics, the term "scrubbing" refers to removing patient identifiers from confidential medical records. Once a records have been completely de-identified, with all links to the patient gone, the records can be used for research purposes without acquiring informed consent from patients and without violating HIPAA or the Common Rule (in the U.S.). Most countries have similar regulations that permit the unconsented use of de-identified patient records for medical research purposes.

The scrubber uses the doublet method (described in two of my previously published books:

Perl Programming for Medicine and Biology

and

Ruby Programming for Medicine and Biology

A public domain list of doublets is available , but I cannot guarantee that the list is identifier-free or that it is the best list for your purposes. Feel free to modify the list, add to the list, or create your own list of identifier-free doublets. The list of doublets (minus the html footer and header) is used as an external text file, doublets.txt, in the script below.

The doublet scrubber is small (just a few dozen lines of code) and fast, scrubbing about 1 Megabyte of text per second on my home computer (2.8 GHz, 512 MByte RAM). This is a scrubbing rate of 1 MegaByte per second. At this speed, a 1 GByte file could be parsed in about 15 minutes. It can parse a 1 Terabyte file in about a week. Large hospitals produce about 1 Terabyte of data each week, so this scrubber can, for now, "keep up" with the vast load of data produced by many hospitals (using a modest desktop computer).

The only problem that I have found with the doublet scrubber is that it scrubs too much, blocking all doublets not found in the external doublet list. You can be the judge. The output attached here can be used to assess the effectiveness of the doublet method of text scrubbing.

To demonstrate the versatility of the doublet method, and to serve as a source of comparison with other de-identifiers, I downloaded a public domain book from Project Gutenberg, and posted the de-identified output, of the entire book, at the following URL:

http://www.julesberman.info/aacom10.htm

Project Gutenberg is a remarkable resource that publishes plain-text versions of literary gems that have passed out of copyright. I used Anomalies and Curiosities of Medicine by George M. Gould and Walter Lytle Pyle. This book has lots of medical terminology and vaguely resembles the kind of text that might be included in a pathology report. Anyone can download the same text from:

http://www.gutenberg.org/etext/747

An example output paragraph is shown. As expected with the doublet method, there are many blocked words. This is a limitation of the doublet method. If you use the standard list of doublets on any random book, you're bound to block some innocent doublets that weren't included in the "approved" list. The only way to get around this limitation is to try to add safe doublets (from the text) to the "approved" list.


In this important *, *, * * some historical *, describes a long series of experiments performed on * in order to * the passage of *, *, *, *, *, *, * * the placenta. The placenta shows a real affinity for * substances; in it * copper and mercury, but *, and it is therefore * it that the * * *; in addition to its *, intestinal, and *, * * glycogen and acts as an * *, and so resembles in its action the liver; * * of the fetus * only a potential *. * up of * in the placenta is not so general as * of them in the liver of the mother. It may be * the placenta does not form a barrier to the passage of * the circulation of the fetus; this would seem to * * *, which was always found in the * never in the fetal organs. In * * lead and * accumulation of the * in the fetal tissues is * in the maternal, perhaps from differences in * * or from greater diffusion. * it is * * barrier to the passage of *, * * * * degree of obstruction: it allows copper and * * *, * with greater difficulty. The * toxic substances in the fetus does not follow the same * * the adult. They * more widely in the fetus. In the * liver is the chief * *. *, which in * * to accumulate in the liver, is in the fetus * in the skin; copper accumulates in the fetal liver, * system, and sometimes in the skin; * which is * in the maternal liver, but also in the skin, has * in the skin, liver, * centers, and elsewhere * *. The frequent presence of * in the fetal * its physiologic importance. It has probably not * * influence on its *. On the * in the placenta and nerve * * * * abortion and the birth of dead *) Copper and lead did not cause *, * * so in two out of six *. Arsenic is a * agent in the *, * * * * *. An important * is that * * is frequently and seriously affected in syphilis, * * the special * for the accumulation of *. * * * * * action in this disease? The * of lead in the central nervous system of the * the frequency and serious character of * lesions. The presence of * in the * * * an explanation of the therapeutic results of * of this substance in skin *.


Output of the scrubber on a collection of 15000 PubMed Citations is available at:
http://www.julesberman.info/pathol5.htm

and (in a script that preserves punctuation)
http://www.julesberman.info/pathol6.htm

The doublet method is virtually perfect. I have never encountered a missed identifier in text scrubbed by the doublet method. If you find any identifiers in the de-identified book, please let me know. Finally, the doublet method is simple. The Perl script that I used to scrub the book is shown below, in its entirety. In the script, "aacom10.txt" is the Project Gutenberg file for Anaomalies and Curiosities of Medicine.

As with all my distributed scripts, the following disclaimer applies:

The perl script for deidentifying text using the doublet method is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

The script is distributed under the GNU General Public License (http://www.gnu.org/licenses/gpl-3.0.txt).

#!/usr/local/bin/perl
$begin = time();
open(TEXT,"doublets.txt")||die"cannot";
$line = " ";
while ($line ne "")
  {
  $line = $getline = <TEXT>;
  $getline =~ s/\n//;
  $doublethash{$getline}= "";
  }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time to create ";
print STDERR "the doublet hash is ";
print STDERR "$totaltime seconds.\n\n";
close TEXT;
$begin = time();
$/ = "\n\n";
open(TEXT,"aacom10.txt")||die"cannot";
open(STDOUT,">aacom10.out")||die"cannot";
$line = " "; $oldthing = ""; $state = 0;
while ($line ne "")
   {
   $line = <TEXT>;
   next if ($line eq "\n");
   #print "Original - $line" . "Scrubbed - " ;
   $line =~ s/\n$//;
   $line =~ s/\n/ /o;
   my @linearray = split(/ +/,$line);
   push (@linearray, "lastword");
   foreach $thing (@linearray)
     {
     $originalthing = $thing;
     $thing = lc($thing);
     $thing =~ tr/a-z\'\-//cd;
     if ($oldthing eq "")
        {
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        next;
        }
     $term = "$oldthing $thing";
     if (exists($doublethash{$term}))
        {
        print "$originaloldthing ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        $state = 1;
        next;
        }
     if ($state == 1)
        {
        if ($thing eq "lastword")
          {
          print $originaloldthing;
          print "\n\n";
          $oldthing = "";
          $state = 0;
          next;
          }
        print "$originaloldthing ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        $state = 0;
        next;
        }
     if ($state == 0)
        {
        if ($thing eq "lastword")
          {
          print "\*\.\n\n";
          $oldthing = "";
          next;
          }
        $punctuation = substr($originaloldthing,-1,1);
        if ($punctuation =~ /[a-zA-Z0-9]/)
           {
           $punctuation = "";
           }
        print "\*" . "$punctuation ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        next;
        }
     }
   }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following ";
print STDERR "doublet hash creation";
print STDERR " is $totaltime seconds.";
exit;




key words: scrubber, medical scrubber, medical record de-identification medical record deidentification, medical scrubbing

Last modified April 7, 2014

Books by Jules J. Berman, covers