Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents
Presented for:
APIII - Advancing Practice Instruction and Innovation through Informatics
Marriott City Center, Pittsburgh, PA
Friday, October 10, 2003
Session E2
Perl and Python Programming Workshop
Session Organizers: Jules Berman and Jim Harrison
Jules J. Berman, Ph.D., M.D.*
Program Director for Pathology Informatics
Cancer Diagnosis Program
National Cancer Institute
National Institutes of Health
Rockville, MD
*The contents of this manuscript are free of copyright restrictions and are entered into the public domain. The software is provided "as is" and has no implied or expressed warranties. The author is not responsible for any uses or abuses of the software. All statements in the text are the opinion of the author and do not represent NIH policy.
Introduction
Everyone has heard of XML (eXtensible Markup Language) and knows something about what it is and how it can be used. XML is one of those useful technologies that can be used on a very simple level or on a very complex level:
Simple XML
tags
enclosed data elements
nothing else
Complex XML
Lots of markup elements (e.g., entities, notation)
Defined content rules (e.g., DTDs and Schemas)
XML-specific transformation/Query languages (e.g., XSLT, XQuery)
Document-Specific presentation methods (e.g., CSS)
XML-specific parsers and data organizers (e.g., DOM)
Professional programmers may be well-advised to learn the complexities of XML. However, many people can get all they need from XML using a few very simple Perl scripts that handle the most common tasks encountered by anyone who uses XML.
These common tasks are:
1. Converting an Excel file to an XML file.
2. Converting an HTML file to an XML file
3. Converting an XML file to an HTML file (e.g. making an XML file presentable while preserving its information content)
4. Converting an XML file to a different data structure (e.g. moving XML into a standard database)
5. Querying an XML file
6. Querying multiple XML files for related information
The purpose of the workshop is to teach participants how to use Perl to perform these six tasks. Simple example files will be used, and the Perl scripts used in the workshop will be available at the following URL:
http://65.222.228.150/jjb/tutor2.htm
This web page, entitled "Perl tutorial for pathologists and laboratory scientists," contains more than 700 factoids containing a Perl script or a useful Perl hint (see Table 1). Perl is a free, open-source programming language that will work on almost any operating system. That means that the same Perl script that works on a Windows machine will also work on a Linux machine or a Macintosh machine (as long as you don't corrupt your scripts with platform-specific instructions). The Perl tutorial contains information on how to obtain and install Perl.
During the workshop, participants will be given a short introduction to Perl, so that they can follow the example scripts from the workshop. The Perl scripts used in workshop can be accessed from the website by querying on the term XML in the searchbox at the top of the webpage.
The purpose of this paper is to provide a short explanation of XML and to provide an appendix containing several scripts that the more advanced workshop participant my study at their leisure. The long scripts contained in this text will not be covered, in detail, at the workshop, but will be available in the online Perl tutorial (above).
What is XML?
XML is an annotation method that permits you to enclose information in tags that describe the information. Tags occur in pairs, each pair enclosed by angle brackets
For instance, Brigitte Bardot's birthdate:
<birthday>September 28, 1950</birthday> is an example of how XML tags can flank a piece of data and provide useful information about the enclosed data.
XML permits you to organize and atomize data to enhance its meaning and the depth of its relationships to other data elements.
For instance, more specific information might be contained in the following;
<birthdate>
<date>
<month>September</month>
<day>28</day>
<year>1934</year>
</date>
<person>
<first_name>Brigitte</first_name>
<last_name>Bardot</last_name>
</person>
</birthdate>
Notice that every beginning tag (<something>) has an end-tag marked with a slash (</something>).
What is the purpose of XML?
XML exists so that data in one dataset can be related to data from another dataset. The key to all this is the well-defined XML tag, which imposes a consistency on data elements and a persistent relationship between different XML tags. Properly executed, data elements in heterogeneous XML documents can be merged, transformed, and analyzed with ease.
What is a proper XML tag?
Since tags are such an important part of XML, people who create XML files should know that not everything enclosed by angle brackets is a proper XML tag. A valid XML tag must obey the following rules:
1. Must begin with an underscore or a letter
2. May contain letters, digits, periods, hyphens, underscores and
colons, but no other types of characters
3. Can not contain white space (spaces, tabs, return character)
4. Case sensitive
The rules for XML tags are discussed thoroughly in White C, Quin L, Burman L. Mastering XML: Premium edition. Sybex, San Francisco, 2001, p. 44
I wrote this simple Perl script to check whether a given tag complies with the structural rules for XML tags. Regarding the colon, I've added one rule of my own. The colon should only be used as a separator between a namespace token and an XML tag (e.g. dc:title). There's never any non-confusing reason to include two colons in a single tag name, so I've disallowed the use of two
or more colons. In the tagcheck.pl perl script, the @elements array contains various terms, some of which could be used in valid tags, others of which could not. The script tells you which are good and which are bad. In order to understand the script, you need to learn something about Perl regular expressions. The confusing regular expression in the script is:
$value =~ /^[a-z\_]{1}[a-z0-9\-\.\_\:]*[a-z0-9\-\.\_]+$/i
Otherwise, the script has the kind of structure and syntax you might expect in many other languages (C, Python, Ruby, Basic). I will explain a little bit about Perl regular expressions in the workshop, but eager students should read a good Perl book or study some of the examples in the online Perl Tutorial.
#!/usr/local/bin/perl
#tagcheck.pl
#This will check whether a given xml tag or attribute
#is proper markup.
@elements = qw (gene 4gene gene:ncbi gene-autry ge::ne
gene&autry -gene _gene gene- gene:
:gene ge:n:e ge:ne: ge,ne ge.ne);
foreach $value (@elements)
{
if ($value =~ /^[a-z\_]{1}[a-z0-9\-\.\_\:]*[a-z0-9\-\.\_]+$/i)
{
if ($value =~ /\:.*\:/) #can't contain two colons
{
print "$value is bad\n";
}
else
{
print "$value is good\n";
}
}
else
{
print "$value is bad\n";
}
}
What is a well-formed XML document?
A well-formed XML document is a collection of nested valid tags and text. The rules of nesting are simple:
1. Every XML document must have a root element marking the beginning and end of the document's tags.
2. Every begin-tag must have an end-tag
3. A tag that is enclosed by another tag must have its end-tag enclosed by the other tag's end-tag.
e.g <report><accession_number></accession_number></report> is ok, but
<report><accession_number></report></accession_number>
These three rules bring an order and sensibility to XML that is lacking in HTML. HTML can be written as a well-formed document, but the common use of begin-tags that have no end-tag (e.g. <br>) and the use of overlapped tag-pairs that do not follow simple nesting rules (e.g. <center><b>Hello</center></b>) make it difficult to develop a general tool for HTML to XML conversions.
However, when an HTML file is well-formed, it is simple to substitute XML tags when you wish to have an XML document.
<body>
<h1>hello world</hello world>
</body>
A simple Perl script will substitute HTML tags with XML tags. In this case, the <body> tag is changed to a <document> tag, and the <h1> tag is exchanged for the <sentence> tag. I could have invented any XML tags for this purpose. The lesson is that well-formed HTML can be converted to XML when you have a named tag-equivalent for the HTML markup.
The script below will be reviewed in the workshop. By modifying this script, anyone will be able to interconvert well-formed HTML and XML files.
#!/usr/bin/perl
#html2xml
open (TEXT, "html.htm")||die"Cannot";
$line = " ";
%dictionary = ("body" => "document", "h1" => "sentence");
@keysarray = keys(%dictionary);
while ($line ne "")
{
$line = <TEXT>;
foreach $key (@keysarray)
{
$line =~ s/(\<[\/]?)$key/$1$dictionary{$key}/g;
}
print $line;
}
exit;
The output of the script is:
c:\ftp>perl html2xml.pl
<document>
<sentence>hello world</hello world>
</document>
Using the same strategy, it's very easy to design a Perl script that will convert a well-formed XML document to HTML.
Attributes
In addition, tags may contain attributes. Attribute are name-value pairs appearing in the begin-tag, with the value item enclosed in quotes.
<pathology value="immense">
Attributes will only be discussed briefly in the workshop. Keep in mind that any attribute(s) can be converted to regular XML tags.
<pathology>
<pathology_value>immense</pathology_value>
</pathology>
What is a valid XML document?
A valid XML document has two properties.
1. It is a well-formed XML document
2.It conforms to a DTD or Schema that specifies the structure of the XML document
Appendix
Appendix 1. Creating an XML file from flat-file data records
The latest version of the complete dataset file is:
PRECANLN .TXT 319,030 3-03-02
Records in the file look like:
C0009324|DUT|P|L2054117|PF|S2392339|Colitis ulcerosa|3|
C0009324|DUT|S|L2072717|PF|S2410939|Proctocolitis, hemorragische|3|
C0009324|DUT|S|L2072718|PF|S2410940|Proctocolitis, ulceratieve|3|
C0009324|DUT|S|L2073703|PF|S2411925|Rectocolitis, hemorragische|3|
C0009324|DUT|S|L2073704|PF|S2411926|Rectocolitis, ulceratieve|3|
C0009324|ENG|P|L0009324|PF|S0026432|Colitis, Ulcerative|0|
C0009324|ENG|P|L0009324|VCW|S0000599|Ulcerative colitis|0|
C0009324|ENG|P|L0009324|VCW|S0410942|ULCERATIVE COLITIS|0|
C0009324|ENG|P|L0009324|VCW|S0427016|ulcerative colitis|0|
C0009324|ENG|P|L0009324|VC|S0255261|COLITIS ULCERATIVE|0|
C0009324|ENG|P|L0009324|VC|S0361286|COLITIS, ULCERATIVE|0|
C0009324|ENG|P|L0009324|VC|S0363932|Colitis, ulcerative|3|
C0009324|ENG|P|L0009324|VC|S1632158|Colitis ulcerative|3|
C0009324|ENG|P|L0009324|VC|S1913610|Colitis;ulcerative|3|
C0009324|ENG|P|L0009324|VO|S1052582|Ulcerative colitis, NOS|3|
C0009324|ENG|P|L0009324|VP|S0026424|Colitides, Ulcerative|0|
C0009324|ENG|P|L0009324|VWP|S0096521|Ulcerative Colitides|0|
C0009324|ENG|P|L0009324|VW|S0411375|Ulcerative Colitis|0|
The file that has the class assignments for all the different
cuis in the precanln.txt file is:
PRESHORT .TXT 26,851 3-03-02
Records in the file look like this:
4 = C0001175 = AIDS
3 = C0001436 = Adenomatosis
2 = C0001815 = mmm
2 = C0002893 = Erythrodysplasia
2 = C0002894 = RAEB
4 = C0004763 = Barrett Syndrome
1 = C0006079 = Bowen Disease
9 = C0007099 = in situ cancer
1 = C0007124 = DCIS
4 = C0007570 = Sprue
1 = C0007868 = Cervix Dysplasia
4 = C0009324 = uc
5 = C0012895 = HPV DNA Probes
1 = C0013403 = B K Mole Syndrome
3 = C0014084 = Ollier Disease
5 = C0014644 = EBV
3 = C0015565 = FAMILIAL ATYPICAL MULTIPLE MOLE
5 = C0015624 = SPONASTRIME
3 = C0017531 = GLNH
5 = C0019196 = HEP C
5 = C0019682 = HIV
The script that takes the preshort.txt file and precanln.txt file
and creates an xml file is:
PRECXML .PL 3,062 3-04-02
#/usr/local/bin/perl
#prexml.pl
$type0 = "\<precancer_class\>Precancer superclass and associated modifiers\<\/precancer_class\>";
$type1 = "\<precancer_class\>Acquired microscopic precancers\<\/precancer_class\>";
$type2 = "\<precancer_class\>Acquired large lesions with cancerous potential\<\/precancer_class\>";
$type3 = "\<precancer_class\>Precancers in Inherited hyperplastic syndromes\<\/precancer_class\>";
$type4 = "\<precancer_class\>Acquired diffuse hyperplasias/metaplasias\<\/precancer_class\>";
$type5 = "\<precancer_class\>Unclassified\<\/precancer_class\>";
$type6 = "\<precancer_class\>Possible error\<\/precancer_class\>";
#C0865124|ENG|P|L1706272|PF|S1924849|Neoplasm of uncertain behavior of gingiva|0|
#C0865125|ENG|P|L1706269|PF|S1924845|Neoplasm of uncertain behavior of bile ducts|0|
#J0865125|ENG|P|L1706269|PF|S1924845|Biliary papillomatosis|0|
#3 = C0015565 = FAMILIAL ATYPICAL MULTIPLE MOLE
#5 = C0015624 = SPONASTRIME
# = C0017531 = GLNH
#5 = C0019196 = HEP C
#5 = C0019682 = HIV
open (HASHER, "precanln.txt")||die"Can't open file";
open (TYPE, "preshort.txt")||die"Can't open file";
$typeline = " ";
while ($typeline ne "")
{
$typeline = <TYPE>;
$typeline =~ s/ +//g;
@types = split(/\=/,$typeline);
$cui = $types[1];
$cui =~ s/[CJ]//o;
if ($types[0] == 9)
{
$index{$cui} = $type0;
}
if ($types[0] == 1)
{
$index{$cui} = $type1;
}
if ($types[0] == 2)
{
$index{$cui} = $type2;
}
if ($types[0] == 3)
{
$index{$cui} = $type3;
}
if ($types[0] == 4)
{
$index{$cui} = $type4;
}
if ($types[0] == 5)
{
$index{$cui} = $type5;
}
if ($types[0] == "")
{
$index{$cui} = $type6;
}
}
open (OUT, ">presum.xml");
print OUT "\<\?xml version\=\"1\.0\"\?\>\n";
print OUT "\<\!DOCTYPE precancer\>\n";
print OUT "\<precancer\>\n";
$oldkey = "";
$line = " ";
while ($line ne "")
{
$line = <HASHER>;
if ($line eq "")
{
next;
}
chomp($line);
$line =~ s/\</\'/o;
$line =~ s/\>/\'/o;
$line =~ s/\&/AND/o;
@linerecord = split(/\|/,$line);
$key = shift(@linerecord);
$key =~ s/[CJ]//o;
$value = \@linerecord;
if ($key == $oldkey)
{
if ($value->[0] ne "ENG")
{
print OUT "\n \<synonym\> \<term\> $value->[5] \<\/term\> \<language\> $value->[0] \<\/language\> \<\/synonym\> ";
}
else
{
print OUT "\n \<synonym\> \<term\> $value->[5] \<\/term\> \<\/synonym\> ";
}
}
else
{
if ($oldkey eq "")
{
print OUT "\n\<concept\>\<cui\>$key \n \<\/cui\> $index{$key} \<synonym\> \<term\> $value->[5] \<\/term\> \<\/synonym\>";
}
else
{
print OUT "\<\/concept\>\n\<concept\>\<cui\>$key \n \<\/cui\> $index{$key}\<synonym\> \<term\> $value->[5] \<\/term\> \<\/synonym\> ";
}
}
$oldkey = $key;
}
close HASHER;
print OUT "\n\<\/cui\>";
print OUT "\n\<\/precancer\>";
close OUT;
exit;
Appendix 2. Converting an XML file to an HTML file
#!/usr/bin/perl
#presum.pl 2,012 12-03-02 10:20am A
#parses the presum.xml file so that it looks like a nice
#html file
#This perl script was created by Jules Berman on November 27, 2002
#This perl script is entered into the public domain
#No warranties apply to this perl script
#
#
use XML::Parser;
$filename = "presum2.htm";
print "Your output will be in file $filename\n";
open (STDOUT, ">$filename");
my $parser = XML::Parser->new( Handlers => {
Init => \&handle_doc_start,
Final => \&handle_doc_end,
Start => \&handle_elem_start,
End => \&handle_elem_end,
Char => \&handle_char_data,
});
my $file = "presum.xml";
$parser -> parsefile($file);
sub handle_doc_start
{
my $header = <<HEADER;
<html>
<head>
<title>
Precancer Classification
</title>
</head>
<body>
<center><h1>Precancer Classification</h1></center>
<br>
<br>
HEADER
print $header;
}
sub handle_doc_end
{
my $header = <<HEADER;
<br>
</body>
</html>
HEADER
print $header;
}
my $count = 0;
my $syncount = 0;
sub handle_elem_start
{
my ($expat, $name, %atts) = @_;
if ($name eq "concept")
{
$count++;
print "\<br\><font color=\"0000ff\">$name $count</font><ul>\n";
return;
}
if ($name eq "comment")
{
print "\<br\>\<hr\>\<pre\>";
return;
}
if ($name eq "synonym")
{
$syncount++;
print "\<br\><font color=\"CC6600\">$name $syncount</font><ul>\n";
return;
}
print "\<li\><font color=\"ff0000\">$name</font>\n";
}
sub handle_elem_end
{
my($expat, $name) = @_;
if ($name eq "concept")
{
print "<\/ul>\n";
return;
}
if ($name eq "comment")
{
print "\<\/pre\>\<hr\>\<br\>";
return;
}
if ($name eq "synonym")
{
print "<\/ul>\n";
return;
}
#print "\<br\><font color=\"ff0000\">$name</font><br>\n";
}
sub handle_char_data
{
my ($expat, $text) = @_;
$text =~ s/\n/ /g;
print "$text\n";
$text = "";
}
exit;
The output file can be viewed at:
http://65.222.228.150/jjb/presum.htm
Appendix 3. Determining if a file is a file is a valid Tissue Micorarray (TMA) file, as described in the API's current draft of the TMA Data Exchange Specification
The Association for Pathology Informatics has published a TMA data exchange specifcation (Berman JJ, Edgerton ME, Friedman BA. The tissue microarray data exchange specification: A community-based, open source tool for sharing tissue microarray data. BMC Medical Informatics and Decision Making 2003 3:5
http://www.biomedcentral.com/1472-6947/3/5
#!/usr/bin/perl
#validtma.pl
use XML::Parser;
$opening = <<"EOF";
This script was created by Jules J. Berman on Jan 15, 2002.
The purpose of this script is to validate Tissue Microarray (TMA)
files that were created using the Common Data Elements (CDEs)
and sematics rules developed by the Association for Pathology
Informatics TMA Data Exchange Specification
This script requires the following two external Perl modules:
XML::Parser and MD5, both of which can be downloaded from
CPAN (Comprehensive Perl Archive Network) at www.cpan.org
The specification and all necessary URIs are listed and described
in a recently submitted manuscript. All publications that refer
to or derive from the TMA Data Exchange Specification should
cite the following:
Berman JJ, Edgerton ME, Friedman B. The Tissue Microarray Data
Exchange Specification: A Community-based, Open Source Tool for
Sharing Tissue Microarray Data. In preparation, BMC Medical
Informatics and Decision Making, 2003.
This script may be used freely. Bugs should be reported
to Jules Berman at bermanj\@mail.nih.gov
The software is provided "as is", without warranty of any kind,
express or implied, including but not limited to the warranties
of merchantability, fitness for a particular purpose and
noninfringement. in no event shall the authors or copyright
holders be liable for any claim, damages or other liability,
whether in an action of contract, tort or otherwise, arising
from, out of or in connection with the software or the use or
other dealings in the software.
Enter a file name you would like to validate.
The file must be in the same directory as the
validator script, validtma.pl and must end with the
extension .xml or .XML
The first part of the filename, prior to the extension
may not contain spaces and must be 8 characters
or less in length
Enter your filename now
EOF
print $opening;
my %index = &build_cde;
my $line = " ";
my $parser = XML::Parser->new( Handlers => {
Init => \&handle_doc_start,
Final => \&handle_doc_end,
Start => \&handle_elem_start,
End => \&handle_elem_end,
Char => \&handle_char_data,
});
my %filetags;
$file = <STDIN>;
$file =~ s/\n//;
if ($file eq "")
{
die "You didn't enter a filename. Try again later.\n";
}
$parser -> parsefile($file);
sub handle_doc_start
{
print "\nBegining to parse $file now\n\n";
}
sub handle_doc_end
{
exists($filetags{"header"})||die "No header tag. TMA file is invalid.\n";
exists($filetags{"tma"})||die "No tma tag. TMA file is invalid.\n";
exists($filetags{"histo"})||die "No histo tag. TMA file is invalid.\n";
exists($filetags{"slide"})||die "No slide tag. TMA file is invalid.\n";
exists($filetags{"core"})||die "No core tag. TMA file is invalid.\n";
exists($filetags{"block"})||die "No block tag. TMA file is invalid.\n";
print "\nFinished. $file is a valid Tissue Microarray File.\n"
}
my @build;
sub handle_elem_start
{
my $count = 0;
my $previous;
my ($expat, $name, %atts) = @_;
$name =~ s/^[0-9a-z\-\_]*\://i;
if (exists $index{$name})
{
$filetags{$name}++;
print "$name\n";
if ($name eq "histo")
{
if (scalar(@build) > 0)
{
die "Your first element must be the root element, histo\n";
}
undef @build;
unshift (@build, $name)
}
if ($name eq "tma")
{
$previous = shift(@build);
if ($previous ne "histo")
{
die "You have a tma element not preceded by a histo element\n";
}
else
{
unshift (@build, $name)
}
}
if ($name eq "header")
{
$previous = shift(@build);
if ($previous ne "tma")
{
die "You have a header element not preceded by a tma element\n";
}
else
{
unshift (@build, $name);
}
}
if ($name eq "block")
{
$previous = shift(@build);
if ($previous ne "header")
{
die "You have a block element not preceded by a header element\n";
}
else
{
unshift (@build, $name);
}
}
if ($name eq "slide")
{
$previous = shift(@build);
unshift(@build, $previous); #lets's put this back in the pile
if (($previous eq "block") || ($previous eq "core")|| ($previous eq "slide"))
{
unshift (@build, $name);
}
else
&nbs