biomedical informatics cover Perl Programming for Medicine and Biology Cover

Biomedical Informatics books
by Jules J. Berman

  • Jones & Bartlett sales and informational website for Biomedical Informatics
  • Amazon.com U.S. book site for Biomedical Informatics
  • Full Table of Contents from Library of Congress for Biomedical Informatics
  • List of book-related resources
  • Brief author biography on Association for Pathology Informatics Website
  • Quick link to PubMed listing for Jules J. Berman
  • Full list of Publications for Jules J. Berman
  • Dr. Bruce Friedman's review of Biomedical Informatics
  • Perl Programming for Medicine and Biology companion site.
  • Author's blog on data specifications
  • Contact author



  • Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents


    Presented for:

    APIII - Advancing Practice Instruction and Innovation through Informatics

    Marriott City Center, Pittsburgh, PA

    Friday, October 10, 2003

    Session E2

    Perl and Python Programming Workshop

    Session Organizers: Jules Berman and Jim Harrison


    Jules J. Berman, Ph.D., M.D.*

    Program Director for Pathology Informatics

    Cancer Diagnosis Program

    National Cancer Institute

    National Institutes of Health

    Rockville, MD


    *The contents of this manuscript are free of copyright restrictions and are entered into the public domain.  The software is provided "as is" and has no implied or expressed warranties.  The author is not responsible for any uses or abuses of the software.  All statements in the text are the opinion of the author and do not represent NIH policy.


    Introduction

    Everyone  has heard of XML (eXtensible Markup Language) and knows something about what it is and how it can be used.  XML is one of those useful technologies that can be used on a very simple level or on a very complex level:


    Simple XML

    tags

    enclosed data elements

    nothing else


    Complex XML

    Lots of markup elements (e.g., entities, notation)

    Defined content rules (e.g., DTDs and Schemas)

    XML-specific transformation/Query languages (e.g., XSLT, XQuery)

    Document-Specific presentation methods (e.g., CSS)

    XML-specific parsers and data organizers (e.g., DOM)


    Professional programmers may be well-advised to learn the complexities of XML.  However, many people can get all they need from XML using a few very simple Perl scripts that handle the most common tasks encountered by anyone who uses XML.


    These common tasks are:


    1. Converting an Excel file to an XML file.


    2. Converting an HTML file to an XML file


    3. Converting an XML file to an HTML file (e.g. making an XML file presentable while preserving its information content)


    4. Converting an XML file to a different data structure (e.g. moving XML into a standard database)


    5. Querying an XML file


    6. Querying multiple XML files for related information


    The purpose of the workshop is to teach participants how to use Perl to perform these six tasks.  Simple example files will be used, and the Perl scripts used in the workshop will be available at the following URL:


    http://65.222.228.150/jjb/tutor2.htm


    This web page, entitled "Perl tutorial for pathologists and laboratory scientists," contains more than 700  factoids containing a Perl script or a useful Perl hint (see Table 1).  Perl is a free, open-source programming language that will work on almost any operating system.  That means that the same Perl script that works on a Windows machine will also work on a Linux machine or a Macintosh machine (as long as you don't corrupt your scripts with platform-specific instructions).  The Perl tutorial contains information on how to obtain and install Perl.


    During the workshop, participants will be given a short introduction to Perl, so that they can follow the example scripts from the workshop. The Perl scripts used in workshop can be accessed from the website by querying on the term XML in the searchbox at the top of the webpage.


    The purpose of this paper is to provide a short explanation of XML and to provide an appendix containing  several scripts that the more advanced workshop participant my study at their leisure. The long scripts contained in this text will not be covered, in detail, at the workshop, but will be available in the online Perl tutorial (above).


    What is XML?


    XML is an annotation method that permits you to enclose information in tags that describe the information. Tags occur in pairs, each pair enclosed by angle brackets


    For instance, Brigitte Bardot's birthdate:


    <birthday>September 28, 1950</birthday> is an example of how XML tags can flank a piece of data and provide useful information about the enclosed data.


    XML permits you to organize and atomize data to enhance its meaning and the depth of its relationships to other data elements.


    For instance, more specific information might be contained in the following;


    <birthdate>

    <date>

    <month>September</month>

    <day>28</day>

    <year>1934</year>

    </date>

    <person>

    <first_name>Brigitte</first_name>

    <last_name>Bardot</last_name>

    </person>

    </birthdate>


    Notice that every beginning tag (<something>) has an end-tag marked with a slash (</something>). 


    What is the purpose of XML?

    XML exists so that data in one dataset can be related to data from another dataset.  The key to all this is the well-defined XML tag, which imposes a consistency on data elements and a persistent relationship between different XML tags.  Properly executed, data elements in heterogeneous XML documents can be merged, transformed, and analyzed with ease.


    What is a proper XML tag?

    Since tags are such an important part of XML, people who create XML files should know that not everything enclosed by angle brackets is a proper XML tag.  A valid XML tag must obey the following rules:

    1. Must begin with an underscore or a letter

    2. May contain letters, digits, periods, hyphens, underscores and

    colons, but no other types of characters

    3. Can not contain white space (spaces, tabs, return character)

    4. Case sensitive


    The rules for XML tags are discussed thoroughly in White C, Quin L, Burman L. Mastering XML: Premium edition. Sybex, San Francisco, 2001, p. 44


    I wrote this simple Perl script to check whether a given tag complies with the structural rules for XML tags.  Regarding the colon, I've added one rule of my own.  The colon should only be used as a separator between a namespace token and an XML tag (e.g. dc:title).  There's never any non-confusing reason to include two colons in a single tag name, so I've disallowed the use of two

    or more colons. In the tagcheck.pl perl script, the @elements array contains various terms, some of which could be used in valid tags, others of which could not. The script tells you which are good and which are bad.  In order to understand the script, you need to learn something about Perl regular expressions.  The confusing regular expression in the script is:


    $value =~ /^[a-z\_]{1}[a-z0-9\-\.\_\:]*[a-z0-9\-\.\_]+$/i


    Otherwise, the script has the kind of structure and syntax you might expect in many other languages (C, Python, Ruby, Basic).  I will explain a little bit about Perl regular expressions in the workshop, but eager students should read a good Perl book or study some of the examples in the online Perl Tutorial.


       #!/usr/local/bin/perl

       #tagcheck.pl

       #This will check whether a given xml tag or attribute

       #is proper markup.

       @elements = qw (gene 4gene gene:ncbi gene-autry ge::ne

                       gene&autry -gene _gene gene- gene:

                       :gene ge:n:e  ge:ne:  ge,ne ge.ne);

       foreach $value (@elements)

         {

         if ($value =~ /^[a-z\_]{1}[a-z0-9\-\.\_\:]*[a-z0-9\-\.\_]+$/i)

           {

           if ($value =~ /\:.*\:/)  #can't contain two colons

             {

             print "$value is bad\n";

             }

           else

             {

             print "$value is good\n";

             }

           }

         else

           {

           print "$value is bad\n";

           }

         }


    What is a well-formed XML document?

    A well-formed XML document is a collection of nested valid tags and text.  The rules of nesting are simple: 

    1. Every XML document must have a root element marking the beginning and end of the document's tags.

    2. Every begin-tag must have an end-tag

    3. A tag that is enclosed by another tag must have its end-tag enclosed by the other tag's end-tag.

    e.g  <report><accession_number></accession_number></report> is ok, but

    <report><accession_number></report></accession_number>

    These three rules bring an order and sensibility to XML that is lacking in HTML.  HTML can be written as a well-formed document, but the common use of begin-tags that have no end-tag (e.g. <br>) and the use of overlapped tag-pairs that do not follow simple nesting rules (e.g. <center><b>Hello</center></b>) make it difficult to develop a general tool for HTML to XML conversions.

    However, when an HTML file is well-formed, it is simple to substitute XML tags when you wish to have an XML document.


    <body>

    <h1>hello world</hello world>

    </body>


    A simple Perl script will substitute HTML tags with XML tags.  In this case, the <body> tag is changed to a <document> tag, and the <h1> tag is exchanged for the <sentence> tag.  I could have invented any XML tags for this purpose.  The lesson is that well-formed HTML can be converted to XML when you have a named tag-equivalent for the HTML markup.

    The script below will be reviewed in the workshop.  By modifying this script, anyone will be able to interconvert well-formed HTML and XML files.  


    #!/usr/bin/perl

    #html2xml

    open (TEXT, "html.htm")||die"Cannot";

    $line = " ";

    %dictionary = ("body" => "document", "h1" => "sentence");

    @keysarray = keys(%dictionary);

    while ($line ne "")

      {

      $line = <TEXT>;

      foreach $key (@keysarray)

        {

        $line =~ s/(\<[\/]?)$key/$1$dictionary{$key}/g;

        }

      print $line;

      }

    exit;


    The output of the script is:


    c:\ftp>perl html2xml.pl

    <document>

    <sentence>hello world</hello world>

    </document>


    Using the same strategy, it's very easy to design a Perl script that will convert a well-formed XML document to HTML. 


    Attributes


    In addition, tags may contain attributes.  Attribute are name-value pairs appearing in the begin-tag, with the value item enclosed in quotes.

    <pathology value="immense">

    Attributes will only be discussed briefly in the workshop.  Keep in mind that any attribute(s) can be converted to regular XML tags.


    <pathology>

    <pathology_value>immense</pathology_value>

    </pathology>


    What is a valid XML document?


    A valid XML document has two properties.


    1. It is a well-formed XML document

    2.It conforms to a DTD or Schema that specifies the structure of the XML document










    Appendix


    Appendix 1. Creating an XML file from flat-file data records


    The latest version of the complete dataset file is:

    PRECANLN .TXT       319,030  3-03-02


    Records in the file look like:


    C0009324|DUT|P|L2054117|PF|S2392339|Colitis ulcerosa|3|

    C0009324|DUT|S|L2072717|PF|S2410939|Proctocolitis, hemorragische|3|

    C0009324|DUT|S|L2072718|PF|S2410940|Proctocolitis, ulceratieve|3|

    C0009324|DUT|S|L2073703|PF|S2411925|Rectocolitis, hemorragische|3|

    C0009324|DUT|S|L2073704|PF|S2411926|Rectocolitis, ulceratieve|3|

    C0009324|ENG|P|L0009324|PF|S0026432|Colitis, Ulcerative|0|

    C0009324|ENG|P|L0009324|VCW|S0000599|Ulcerative colitis|0|

    C0009324|ENG|P|L0009324|VCW|S0410942|ULCERATIVE COLITIS|0|

    C0009324|ENG|P|L0009324|VCW|S0427016|ulcerative colitis|0|

    C0009324|ENG|P|L0009324|VC|S0255261|COLITIS ULCERATIVE|0|

    C0009324|ENG|P|L0009324|VC|S0361286|COLITIS, ULCERATIVE|0|

    C0009324|ENG|P|L0009324|VC|S0363932|Colitis, ulcerative|3|

    C0009324|ENG|P|L0009324|VC|S1632158|Colitis ulcerative|3|

    C0009324|ENG|P|L0009324|VC|S1913610|Colitis;ulcerative|3|

    C0009324|ENG|P|L0009324|VO|S1052582|Ulcerative colitis, NOS|3|

    C0009324|ENG|P|L0009324|VP|S0026424|Colitides, Ulcerative|0|

    C0009324|ENG|P|L0009324|VWP|S0096521|Ulcerative Colitides|0|

    C0009324|ENG|P|L0009324|VW|S0411375|Ulcerative Colitis|0|



    The file that has the class assignments for all the different

    cuis in the precanln.txt file is:

    PRESHORT .TXT        26,851  3-03-02


    Records in the file look like this:


    4   =  C0001175 =  AIDS

    3   =  C0001436 =  Adenomatosis

    2   =  C0001815 =  mmm

    2   =  C0002893 =  Erythrodysplasia

    2   =  C0002894 =  RAEB

    4   =  C0004763 =  Barrett Syndrome

    1   =  C0006079 =  Bowen Disease

    9   =  C0007099 =  in situ cancer

    1   =  C0007124 =  DCIS

    4   =  C0007570 =  Sprue

    1   =  C0007868 =  Cervix Dysplasia

    4   =  C0009324 =  uc

    5   =  C0012895 =  HPV DNA Probes

    1   =  C0013403 =  B K Mole Syndrome

    3   =  C0014084 =  Ollier Disease

    5   =  C0014644 =  EBV

    3   =  C0015565 =  FAMILIAL ATYPICAL MULTIPLE MOLE

    5   =  C0015624 =  SPONASTRIME

    3   =  C0017531 =  GLNH

    5   =  C0019196 =  HEP C

    5   =  C0019682 =  HIV


    The script that takes the preshort.txt file and precanln.txt file

    and creates an xml file is:

    PRECXML  .PL          3,062  3-04-02


    #/usr/local/bin/perl

    #prexml.pl

    $type0 = "\<precancer_class\>Precancer superclass and associated modifiers\<\/precancer_class\>";

    $type1 = "\<precancer_class\>Acquired microscopic precancers\<\/precancer_class\>";

    $type2 = "\<precancer_class\>Acquired large lesions with cancerous potential\<\/precancer_class\>";

    $type3 = "\<precancer_class\>Precancers in Inherited hyperplastic syndromes\<\/precancer_class\>";

    $type4 = "\<precancer_class\>Acquired diffuse hyperplasias/metaplasias\<\/precancer_class\>";

    $type5 = "\<precancer_class\>Unclassified\<\/precancer_class\>";

    $type6 = "\<precancer_class\>Possible error\<\/precancer_class\>";

    #C0865124|ENG|P|L1706272|PF|S1924849|Neoplasm of uncertain behavior of gingiva|0|

    #C0865125|ENG|P|L1706269|PF|S1924845|Neoplasm of uncertain behavior of bile ducts|0|

    #J0865125|ENG|P|L1706269|PF|S1924845|Biliary papillomatosis|0|

    #3   =  C0015565 =  FAMILIAL ATYPICAL MULTIPLE MOLE

    #5   =  C0015624 =  SPONASTRIME

    #   =  C0017531 =  GLNH

    #5   =  C0019196 =  HEP C

    #5   =  C0019682 =  HIV

    open (HASHER, "precanln.txt")||die"Can't open file";

    open (TYPE, "preshort.txt")||die"Can't open file";

    $typeline = " ";

    while ($typeline ne "")

      {

      $typeline = <TYPE>;

      $typeline =~ s/ +//g;

      @types = split(/\=/,$typeline);

      $cui = $types[1];

      $cui =~ s/[CJ]//o;

      if ($types[0] == 9)

        {

        $index{$cui} = $type0;

        }

      if ($types[0] == 1)

        {

        $index{$cui} = $type1;

        }

      if ($types[0] == 2)

        {

        $index{$cui} = $type2;

        }

      if ($types[0] == 3)

        {

        $index{$cui} = $type3;

        }

      if ($types[0] == 4)

        {

        $index{$cui} = $type4;

        }

      if ($types[0] == 5)

        {

        $index{$cui} = $type5;

        }

      if ($types[0] == "")

        {

        $index{$cui} = $type6;

        }

      }

    open (OUT, ">presum.xml");


    print OUT "\<\?xml version\=\"1\.0\"\?\>\n";

    print OUT "\<\!DOCTYPE precancer\>\n";

    print OUT "\<precancer\>\n";

    $oldkey = "";

    $line = " ";

    while ($line ne "")

      {

      $line = <HASHER>;

      if ($line eq "")

        {

        next;

        }

      chomp($line);

      $line =~ s/\</\'/o;

      $line =~ s/\>/\'/o;

      $line =~ s/\&/AND/o;

      @linerecord = split(/\|/,$line);

      $key = shift(@linerecord);

      $key =~ s/[CJ]//o;

      $value = \@linerecord;

      if ($key == $oldkey)

        {

        if ($value->[0] ne "ENG")

          {

          print OUT "\n     \<synonym\> \<term\> $value->[5] \<\/term\>  \<language\> $value->[0] \<\/language\> \<\/synonym\> ";

          }

        else

          {

          print OUT "\n    \<synonym\>  \<term\> $value->[5] \<\/term\>  \<\/synonym\> ";

          }

        }

      else

        {

        if ($oldkey eq "")

          {

          print OUT "\n\<concept\>\<cui\>$key \n  \<\/cui\> $index{$key} \<synonym\> \<term\> $value->[5] \<\/term\> \<\/synonym\>";

          }

        else

          {

          print OUT "\<\/concept\>\n\<concept\>\<cui\>$key \n \<\/cui\> $index{$key}\<synonym\> \<term\> $value->[5] \<\/term\> \<\/synonym\> ";

          }

        }

      $oldkey = $key;

      }

    close HASHER;

    print OUT "\n\<\/cui\>";

    print OUT "\n\<\/precancer\>";

    close OUT;

    exit;


    Appendix 2. Converting an XML file to an HTML file


    #!/usr/bin/perl

    #presum.pl 2,012 12-03-02  10:20am  A

    #parses the presum.xml file so that it looks like a nice

    #html file

    #This perl script was created by Jules Berman on November 27, 2002

    #This perl script is entered into the public domain

    #No warranties apply to this perl script

    #

    #

    use XML::Parser;

    $filename = "presum2.htm";

    print  "Your output will be in file $filename\n";

    open (STDOUT, ">$filename");


    my $parser = XML::Parser->new( Handlers => {

      Init => \&handle_doc_start,

      Final => \&handle_doc_end,

      Start => \&handle_elem_start,

      End => \&handle_elem_end,

      Char => \&handle_char_data,

      });


    my $file = "presum.xml";


    $parser -> parsefile($file);



    sub handle_doc_start

    {

    my $header = <<HEADER;

    <html>

    <head>

    <title>

    Precancer Classification

    </title>

    </head>

    <body>

    <center><h1>Precancer Classification</h1></center>

    <br>

    <br>

    HEADER

    print $header;

    }


    sub handle_doc_end

    {

    my $header = <<HEADER;

    <br>

    </body>

    </html>

    HEADER

    print $header;

    }


    my $count = 0;

    my $syncount = 0;


    sub handle_elem_start

      {

      my ($expat, $name, %atts) = @_;

      if ($name eq "concept")

        {

        $count++;

        print "\<br\><font color=\"0000ff\">$name $count</font><ul>\n";

        return;

        }

      if ($name eq "comment")

        {

        print "\<br\>\<hr\>\<pre\>";

        return;

        }

      if ($name eq "synonym")

        {

        $syncount++;

        print "\<br\><font color=\"CC6600\">$name $syncount</font><ul>\n";

        return;

        }

      print "\<li\><font color=\"ff0000\">$name</font>\n";

      }


    sub handle_elem_end

      {

      my($expat, $name) = @_;

      if ($name eq "concept")

        {

        print "<\/ul>\n";

        return;

        }

      if ($name eq "comment")

        {

        print "\<\/pre\>\<hr\>\<br\>";

        return;

        }

      if ($name eq "synonym")

        {

        print "<\/ul>\n";

        return;

        }

      #print "\<br\><font color=\"ff0000\">$name</font><br>\n";

      }


    sub handle_char_data

      {

      my ($expat, $text) = @_;

      $text =~ s/\n/ /g;

      print "$text\n";

      $text = "";

      }

    exit;


    The output file can be viewed at:

    http://65.222.228.150/jjb/presum.htm



    Appendix 3. Determining if a file is a file is a valid Tissue Micorarray (TMA) file, as described in the API's current draft of the TMA Data Exchange Specification


    The Association for Pathology Informatics has published a TMA data exchange specifcation (Berman JJ, Edgerton ME, Friedman BA.  The tissue microarray data exchange specification: A community-based, open source tool for sharing tissue microarray data. BMC Medical Informatics and Decision Making 2003 3:5

    http://www.biomedcentral.com/1472-6947/3/5


    #!/usr/bin/perl

    #validtma.pl

    use XML::Parser;

    $opening = <<"EOF";


    This script was created by Jules J. Berman on Jan 15, 2002.

    The purpose of this script is to validate Tissue Microarray (TMA)

    files that were created using the Common Data Elements (CDEs)

    and sematics rules developed by the Association for Pathology

    Informatics TMA Data Exchange Specification


    This script requires the following two external Perl modules:

    XML::Parser and MD5, both of which can be downloaded from

    CPAN (Comprehensive Perl Archive Network) at www.cpan.org


    The specification and all necessary URIs are listed and described

    in a recently submitted manuscript.  All publications that refer

    to or derive from the TMA Data Exchange Specification should

    cite the following:


    Berman JJ, Edgerton ME, Friedman B.  The Tissue Microarray Data

    Exchange Specification: A Community-based, Open Source Tool for

    Sharing Tissue Microarray Data. In preparation, BMC Medical

    Informatics and Decision Making, 2003.


    This script may be used freely.  Bugs should be reported

    to Jules Berman at bermanj\@mail.nih.gov


    The software is provided "as is", without warranty of any kind,

    express or implied, including but not limited to the warranties

    of merchantability, fitness for a particular purpose and

    noninfringement. in no event shall the authors or copyright

    holders be liable for any claim, damages or other liability,

    whether in an action of contract, tort or otherwise, arising

    from, out of or in connection with the software or the use or

    other dealings in the software.


    Enter a file name you would like to validate.

    The file must be in the same directory as the

    validator script, validtma.pl and must end with the

    extension .xml or .XML


    The first part of the filename, prior to the extension

    may not contain spaces and must be 8 characters

    or less in length


    Enter your filename now


    EOF

    print $opening;

    my %index = &build_cde;

    my $line = " ";

    my $parser = XML::Parser->new( Handlers => {

      Init => \&handle_doc_start,

      Final => \&handle_doc_end,

      Start => \&handle_elem_start,

      End => \&handle_elem_end,

      Char => \&handle_char_data,

      });

    my %filetags;

    $file = <STDIN>;

    $file =~ s/\n//;

    if ($file eq "")

      {

      die "You didn't enter a filename.  Try again later.\n";

      }

    $parser -> parsefile($file);


    sub handle_doc_start

    {

    print "\nBegining to parse $file now\n\n";

    }


    sub handle_doc_end

    {

    exists($filetags{"header"})||die "No header tag. TMA file is invalid.\n";

    exists($filetags{"tma"})||die "No tma tag. TMA file is invalid.\n";

    exists($filetags{"histo"})||die "No histo tag. TMA file is invalid.\n";

    exists($filetags{"slide"})||die "No slide tag. TMA file is invalid.\n";

    exists($filetags{"core"})||die "No core tag. TMA file is invalid.\n";

    exists($filetags{"block"})||die "No block tag. TMA file is invalid.\n";

    print "\nFinished. $file is a valid Tissue Microarray File.\n"

    }


    my @build;

    sub handle_elem_start

      {

      my $count = 0;

      my $previous;

      my ($expat, $name, %atts) = @_;

      $name =~ s/^[0-9a-z\-\_]*\://i;

      if (exists $index{$name})

        {

        $filetags{$name}++;

        print "$name\n";

        if ($name eq "histo")

          {

          if (scalar(@build) > 0)

            {

            die "Your first element must be the root element, histo\n";

            }

          undef @build;

          unshift (@build, $name)

          }

        if ($name eq "tma")

          {

          $previous = shift(@build);

          if ($previous ne "histo")

            {

            die "You have a tma element not preceded by a histo element\n";

            }

          else

            {

            unshift (@build, $name)

            }

          }

        if ($name eq "header")

          {

          $previous = shift(@build);

          if ($previous ne "tma")

            {

            die "You have a header element not preceded by a tma element\n";

            }

          else

            {

            unshift (@build, $name);

            }

          }

        if ($name eq "block")

          {

          $previous = shift(@build);

          if ($previous ne "header")

            {

            die "You have a block element not preceded by a header element\n";

            }

          else

            {

            unshift (@build, $name);

            }

          }

        if ($name eq "slide")

          {

          $previous = shift(@build);

          unshift(@build, $previous); #lets's put this back in the pile

          if (($previous eq "block") || ($previous eq "core")|| ($previous eq "slide"))

            {

            unshift (@build, $name);

            }

          else

            {

            die "You have a slide element not preceded by a block or core or slide element\n";

            }

          }

        if ($name eq "core")

          {

          $previous = shift(@build);

          unshift(@build, $previous);

          if (($previous eq "block") || ($previous eq "slide")|| ($previous eq "core"))

            {

            unshift (@build, $name);

            }

          else

            {

            die "You have a core element not preceded by a block or slide or core element\n";

            }

          }

        if ($name =~ /\_/)

          {

          my @name_array = split(/\_/, $name);

          my $last = pop(@name_array);

          my $next_to_last = pop(@name_array);

          $previous = shift (@build);

          if ($last eq $previous)

             {

             unshift (@build,$previous);  #The element has been repeated

             return;                      #That's ok.  Just put it back in the

             }                            #build array and return

          if ($previous eq $next_to_last)

             {

             unshift (@build, $previous); #the element has a new hierarchical

             unshift (@build, $last);     #extension.  add the previous last

             return;                      #and the new last to build array

             }                            #and return

          $previous = shift (@build);     #let's look back one more in build array

          if ($last eq $previous)

             {

             unshift (@build,$previous);  #The element has stepped back

             return;                      #one in the hierarchy. That's ok. Just put it back in the

             }                            #build array and return

          if ($previous eq $next_to_last)

             {

             unshift (@build, $previous); #the element has backed up one and

             unshift (@build, $last);     #added a new extensions.  Add the

             return;                      #previous and the last to the build array

             }                            #and return

          die "The element $name does not fit in the hierarchy\n";

          }

        }

      }


    sub handle_elem_end

      {

      my($expat, $name) = @_;

      $name =~ s/^[0-9a-z\-\_]*\://i;

      if (exists $index{$name})

        {

        if ($name eq "tma")

          {

          undef @build;

          unshift(@build, $name)

          }

        if ($name =~ /\_/)

          {

          my @name_array = split(/\_/, $name);

          my $last = pop(@name_array);

          $previous = shift (@build);

          unless ($last eq $previous)

             {

             unshift (@build, $previous)

             }

           }

        }

      }


    sub handle_char_data

      {

      my $expat = $_[0];

      $text .= "$_[1]";

      }



    use MD5;

    open (TEXT,"$file");

    $context = new MD5;

    $context->addfile(TEXT);

    $digest = $context->digest();

    $readable_digest = unpack ("H*", $digest);

    print "\nThe one-way hash of your file is $readable_digest";


    sub build_cde

    {

    my %build_index = qw(

    histo 1

    header 1

    Title 1

    Creator 1

    SubjectandKeywords 1

    Description 1

    Publisher 1

    Contributor 1

    Date 1

    ResourceType 1

    Format 1

    ResourceIdentifier 1

    Source 1

    Language 1

    Relation 1

    Coverage 1

    RightsManagement 1

    filename 1

    tma 1

    block 1

    block_purpose 1

    block_identifier 1

    block_description 1

    block_meshnumbers 1

    block_related_blocks 1

    block_creator 1

    block_creation-date 1

    block_protocol 1

    block_number-of-cores 1

    block_core-size 1

    block_core-spacing 1

    block_qa 1

    block_status 1

    block_array-hash 1

    slide 1

    slide_identifier 1

    slide_cut-date 1

    slide_storage 1

    slide_section-thickness 1

    slide_level 1

    slide_sectioning-protocol 1

    slide_test-date 1

    slide_test-category 1

    slide_test-protocol 1

    slide_stain 1

    slide_test-control 1

    core 1

    core_array-id 1

    core_tissue-fixative 1

    core_organism 1

    core_demographics-record 1

    core_anatomic-site 1

    core_organ 1

    core_tissue-origin 1

    core_data-descriptor 1

    core_data-descriptor_QC 1

    core_clinical-info 1

    core_clinical-info_id 1

    core_clinical-info_availability 1

    core_histo-repository 1

    core_histo-repository_specimen-id 1

    core_histo-repository_specimen-diagnosis 1

    core_histo-repository_donor-block 1

    core_histo-repository_donor-block_drill-site 1

    core_histo-repository_donor-block_drill-site_diagnosis 1

    core_results 1

    core_results_slide-id 1

    core_results_method-of-reading-data 1

    core_results_number-of-nuclei-counted 1

    core_results_tissue-intensity 1

    core_results_percent-tissue-staining 1

    core_results_compartment-staining 1

    core_results_staining-pattern 1

    core_image 1

    core_image_slide-id 1

    core_image_identifier 1

    core_image_format 1

    core_image_checksum 1

    core_image_magnification 1

    core_image_pixel_size 1);

    return %build_index;

    }

    exit;



    Typical output:

    Typical screen interaction and output:


    Enter your filename now

    c:\tmacpctr.xml

    Begining to parse c:\tmacpctr.xml now.

    histo

    tma

    header

    filename

    block

    block_purpose

    block_identifier

    block_description

    block_meshnumbers

    block_creator

    block_protocol

    block_qa

    slide

    slide_identifier

    slide_storage

    slide_level

    slide_stain

    core

    Finished. c:\tmacpctr.xml is a valid Tissue Microarray File.

    The one-way hash of your file is e2ad62a75974628b7499bd7d771b82f0


    Appendix 4. Using DOM, the Document Object Model


    The dom.pl script uses the xml::dom module that can be downloaded

    from activestate


    #!/usr/bin/perl

    #dom.pl

    #Creates a reference data structure from the xml file named "presum"

    #and produces the value key for the first concept's first synonym's

    #first term.

    #$VAR1 = {'concept' => [{'synonym' => [{'term' => ' Colitis ulcerosa '

    use XML::DOM;

    #use Data::Dumper;

    #Data::Dumper is not needed for this script, but I like to have it

    #as a back-up way of digging into the data structures

    open (OUT, ">domout.txt");

    open (PRE, ">presum.txt");

    while(<DATA>)

      {

      $line = <DATA>;

      print PRE $line;

      }

    close PRE;

    $infile = "presum.xml";

    my $dom_parser = new XML::DOM::Parser;

    my $doc = $dom_parser->parsefile($infile);

    $termnodes = $doc->getElementsByTagName("term");

    for ($i=1;$i < $termnodes->getLength;$i++)

       {

       $termitem = $termnodes->item($i);

       #the child of term is the text

       @childnodes = $termitem->getChildNodes;

       foreach $value (@childnodes)

          {

          $text = $value->getNodeValue;

          print "$text\n";

          }

       }

    #print OUT Dumper($doc); #If we wanted, this would print out the

    #print Dumper($doc);     #entire doc datastructure

    exit;

    __DATA__

    <?xml version="1.0"?>

    <!DOCTYPE precancer>

    <precancer>

    <concept><cui>0009324

      </cui> <precancer_class>Acquired diffuse hyperplasias/metaplasias</precancer_class> <synonym> <term> Colitis ulcerosa </term> </synonym>

         <synonym> <term> Proctocolitis, hemorragische </term>  <language> DUT </language> </synonym>

         <synonym> <term> Proctocolitis, ulceratieve </term>  <language> DUT </language> </synonym>

         <synonym> <term> Rectocolitis, hemorragische </term>  <language> DUT </language> </synonym>

         <synonym> <term> Rectocolitis, ulceratieve </term>  <language> DUT </language> </synonym>

        <synonym>  <term> Colitis, Ulcerative </term>  </synonym>

        <synonym>  <term> Ulcerative colitis </term>  </synonym>

        <synonym>  <term> ULCERATIVE COLITIS </term>  </synonym>

        <synonym>  <term> ulcerative colitis </term>  </synonym>

        <synonym>  <term> COLITIS ULCERATIVE </term>  </synonym>

        <synonym>  <term> COLITIS, ULCERATIVE </term>  </synonym>

        <synonym>  <term> Colitis, ulcerative </term>  </synonym>

         <synonym> <term> THROMBOZYTHAEMIE </term>  <language> GER </language> </synonym>

         <synonym> <term> Thrombozythaemie </term>  <language> GER </language> </synonym>

         <synonym> <term> Trombocitosi </term>  <language> ITA </language> </synonym>

         <synonym> <term> TROMBOCITOSE </term>  <language> POR </language> </synonym>

         <synonym> <term> TROMBOCITEMIA </term>  <language> POR </language> </synonym>

         <synonym> <term> AUMENTO DO NUMERO DE PLAQUETAS </term>  <language> POR </language> </synonym>

         <synonym> <term> TROMBOTSITOZ </term>  <language> RUS </language> </synonym>

         <synonym> <term> TROMBOTSITEMIIA </term>  <language> RUS </language> </synonym>

         <synonym> <term> TROMBOCITOSIS </term>  <language> SPA </language> </synonym>

         <synonym> <term> TROMBOCITEMIA </term>  <language> SPA </language> </synonym>

         <synonym> <term> PLAQUETAS AUMENTADAS </term>  <language> SPA </language> </synonym>

    </concept>

    </precancer>









    biomedical informatics cover Perl Programming for Medicine and Biology Cover