Pathology Data Integration with eXtensible Markup Language.
Human Pathology 2005, 36(2):139-145.
words: XML, pathology informatics, data integration, data standards
is impossible to overstate the importance of XML (eXtensible Markup
Language) as a data organization tool. With XML, pathologists can
annotate all of their data (clinical and anatomic) in a format that
can transform every pathology report into a database, without
compromising narrative structure. The purpose of this manuscript is
to provide an overview of XML basics for pathologists. Examples
will demonstrate how pathologists can use XML to annotate the
individual data elements and to structure reports in a common format
that can be merged with other XML files or queried using standard XML
tools. This manuscript gives pathologists a glimpse into how XML
annotation can benefit patients, enhance their ability to compete for
research funding, and reduce their dependence on centralized,
are interesting times for pathologists charged with managing
laboratory information. Everyone seems to want pathology data, but
federal regulations restrict access to medical records.1-4
The world of information technology encourages open source software
solutions, but large medical centers are opting for megamillion
dollar commercial information systems that lock hospital data into
proprietary systems. Bionformaticians luxuriate in freely available
standards and databases, while the procedure terminologies and
disease nomenclatures used by pathologists, surgeons and clinicians
are closely guarded intellectual properties.5
Pathologists are told that they must include specific data elements
in their cancer-related reports,6 but they have no
standard methods to utilize this data (i.e., retrieve the data,
integrate the data with related clinical or research datasets, and
merge the data from multiple institutions). The Internet and
broadband telecommunications permit the rapid transmission of huge
amounts of information, but institutional Security Officers seem
devoted to ensuring that pathology information never crosses the
seems that two distinct tracks of biomedical data have emerged: the
medical track that captures data in proprietary information systems,
and the research model that has developed a startling array of free
and open methods for data organization, data integration and data
pathologists seem to be unaware of the progress made by biologists in
the field of data annotation and data integration.16-22
Data annotation is the very simple concept that every piece of data
in a record can be annotated with another set of information that
describes the data (so-called metadata). Once data has been
annotated, it can be associated with other, related data [data
integration], even when the other data is found in a seemingly
unrelated database.23 In the past few years, biomedical
informatics has transformed into the science that derives biomedical
value from computations performed on annotated databases.
purpose of this manuscript is to review some of the advances in
biomedical informatics, identifying specific methods that can be used
in a pathology setting. The manuscript will provide an overview of
XML (eXtensible Markup Language), the most important advance in data
organization since the invention of the book.23-25 XML
documents have properties that permit data integration between
research databases and pathology data. There are mutual advantages
for pathologists who understand and utilize data integrating
technologies and for biomedical researchers who require specific
pathologic data related to human diseases.
importance of XML as a data organizing tools cannot be overstated.
As a data organizing technology, it is as important as the invention
of written language (circa 3000 BC), or the mass-printed book (circa
1450 AD). At its simplest, XML is a method for marking up files so
that every piece of data is surrounded by bracketed text that
describes the piece of data (e.g. <number>5</number>).
Markup allows us to convey any message as XML (a pathology report, a
radiology image, a genome database, a software program, an email).
markup tags are sets of alphanumeric descriptors enclosed by angle
brackets. Each tag is repeated at the beginning and end of the data
element, the ending tag demarcated by a slant character "/".
following are examples of XML markup.
of these annotations reveals that XML lets us describe data. When
the data element "25 years" is flanked by an
<age_of_patient> tag, we can be sure that it's not referring to
an anniversary event or a mortgage.
tags can appear in narrative text. For example:
of colon</pathologic_diagnosis>extending into the
extra verbiage carried by XML annotations are unsightly, but browsers
can easily remove annotation tags from the text made visible on
computer screens. To everyone concerned, an XML document can display
just like any other text. In fact, I am composing this paper on ABI
Word, a free, open source XML editor.26 As I type, the
editor adds XML formatting tags to the document file, but all those
tags are invisible to me and you.
XML documents fall into one of two different types: structural or
the earliest published example of an XML specification for surgical
pathology was published by Berman and Moore in 2000.19 It
is an example of a structural XML document. Excerpts are shown, with
clipped text represented by "....":
JOHN Q PATHOLOGIST MD </pathologist>
VETERAN,JOHN Q. </patient_name>
THE SPECIMEN IS RECEIVED FRESH, LABELED WITH THE PATIENT'S NAME, AND
ADDITIONALLY LABELED "LARYNGECTOMY".
SPECIMEN CONSISTS OF A LARYNGECTOMY RESECTION, MEASURING 10.5 X 5.5 X
3.5 CM. THE LARYNX IS EDEMATOUS. THE LARYNX IS OPENED POSTERIORLY, TO
REVEAL AN IRREGULARITY OF APPARENT TUMOR, ON THE SURFACE OF THE LEFT
TRUE VOCAL CORD, MEASURING 3.0 X 1.5 CM. THE TUMOR DOES NOT APPEAR TO
INVOLVE THE SUBGLOTTIS, NOR THE ANTERIOR COMMISSURE. THE SUPERIOR,
INFERIOR, ANTERIOR, AND POSTERIOR MARGINS ARE GROSSLY UNINVOLVED BY
TUMOR REPRESENTATIVE SECTIONS OF TUMOR ARE SUBMITTED,
SQUAMOUS CELL CARCINOMA </disease_concept>
XML-based pathology report is very easy to read. The XML tags
roughly correspond to the familiar sections of a pathology report.
This structural XML document can be compared with a data-centric XML
document. The following is an excerpt from a tissue microarray data
file that conforms to the recently proposed Tissue Microarray Data
Exchange Specification developed by the Association for Pathology
9, column 18|row 10, column 4</array_locations>
fragment of XML is relatively easy to read, but it does not provide
an obvious structure for the data elements. It consists of data
flanked by XML tags, and little else. Data-centric XML files are
roughly equivalent to databases or spreadsheets. In fact, it is
exceedingly easy to port data-centric XML files to and from other
is XML so important if it serves only one of two simple purposes: 1)
markup that divides chunks of text, or 2) markup that flanks
individual data elements? The value of XML comes from a handful of
Six Special Properties of XML
is endowed with a set of six properties that permit XML files to be
self-descriptive (able to describe every aspect of its own content
and organization) and "aware" of their own data and the
data in the internet universe. These are:
Enforced and defined structure (XML rules and schema)
Formal metadata (through ISO11179 specification)
Namespaces (permits sharing of uniquely identifiable CDEs)
Linking of data via the internet (through Unique Resource
Logic and meaning (the Semantic Web)
Self-awareness (embedded protocols and commands)
and defined structure (XML rules and schemas)
terms describe XML conformance: well-formedness and validity. A
file that contains XML markup is considered an XML file only if it is
well-formed. That is, it must have a proper XML header; it must
consist of text in a readable form (typically the simple letters and
punctuation found on a keyboard), and it must follow the general
rules for using tagging data. The header can vary somewhat, but it
usually looks something like: <?xml version="1.0" ?>.
Tags must have a certain form (e.g. spaces are not permitted within
a tag), and tags must be properly nested (i.e. no overlapping). For
example, <chapter><chapter_title>Pathlogists love
XML</chapter_title></chapter> is nicely nested XML.
XML</chapter></chapter_title> is improperly nested.
Most current browsers parse through files that have a .XML suffix to
determine if they are well-formed. If they break any of the rules,
an error message is generated.
well-formed XML file must be structured according to either a DTD
(Document Type Definition) or to a schema before it can be considered
a valid XML document. DTDs and schemas are blocks of descriptors
that specify the structure and content of an XML file. Nothing in
the world of XML has engendered as much confusion and acrimony as the
issue of how best to specify the descriptor block. Suffice it to say
that a variety of schema languages have appeared.
schemas, the only point worth noting is that when an XML file is
described by a schema and parsed by a validating browser (or by a
so-called XML parser), you can be certain that the data contained in
the file conforms to a specified structure and content. Files using
the same schema will have the same data organization, and this
greatly facilities data integration between files. A valid XML file
may contain the schema within the file (always near the start of the
document). Or, a valid XML file may have a linking tag that contains
the unique identification/location of an external schema file.
metadata (through the ISO11179 specification)
concept of formalized metadata is quite simple. Unfortunately,
formalized metadata is seldom implemented by XML designers.
the seemingly obvious metadata tag, <date>. Pathologists may
think of this tag as a calendar day. Farmers may think of this tag
as a fruit. Co-eds may think of this as something that is often
the following: <date>09/05/15</date>
American may think this represents September 5, 1915. A Englishman
may think this is May 15, 1909. Others may interpret this date as
September 5, 2015 or May 15, 2009.
fact, the International Standards Organization thought it so
important to have standards for the representation of time and dates
that they created ISO 86015 for four metadata tags: date, time,
dateTime and timePeriod.29 Incidentally, the standard
representation for calendar date is YYYY-MM-DD.
International Standards Organization has created a standard way of
defining metadata tags (also known as Common Data Elements or CDEs).
This standard, the ISO 11179 specifies that metadata should have a
qualified name or identifier, an authority that registers the name, a
versioning history (allowing for modifications), a language or
origin, a statement relating to usage, a data typing statement, and a
definition that is unambiguous.16 XML files should always
include a pointer to the internet location for the metadata
definition file. The creators of the TMA data Exchange Specification
have created a file that lists each of the 80 XML tags used in the
specification, along with the ISO 11179 descriptors for each tag.27
metatdata definition file for the TMA data exchange specification
currently resides at: [http://188.8.131.52/jjb/tma_cde.htm]. An
example is shown for the metadata tag <slide_level>:
Authority: Association for Pathology Informatics
This is an integer corresponding to the level of the block for this
slide, e.g. 25. Comment: Some people may include the level number in
the slide_identifier element (as a suffix). If this is the case,
they should either redundantly include the suffix in this element or
describe how the level may be extracted from the slide_identifier in
clear definitions for XML metadata tags, the meaning of XML data is
(permits sharing of uniquely identifiable metadata tags)
word may mean different things to different people, and that's why we
carefully define metadata (using the ISO 11179 specification).
Defining the metadata elements in an XML file ensures that anyone can
understand the use of a tag within the file. A problem arises when
two different XML files use the equivalently named tag to mean
different things. For instance, the farmer's XML file defines the
<date> tag as a fruit, while the astronmer's XML file defines
the <date> tag as something else entirely. XML deals with this
problem by creating protected namespaces for data elements.30
data element can be prefixed by a specific namespace (defined
metadata collection). Consider the header section for the TMA Data
Exchange specification. Within the root element three different
namespaces are announced using the xmlns (XML namespace) attribute.
tags used in the file are taken from three different external
sources! These sources are the previously described listing of
metadata tags provided by the tissue microarray data exchange
metadata tags provided by the Cooperative Prostate Cancer Tissue
and the metadata tags provided by the Dublin Core, an association of
library scientists who have devised a set of standard header elements
for XML files (xmlns:dc="http://dublincore.org"). The
prefix designated for each source (e.g., "dc" or "cpctr")
is conveyed to the metadata tags used in the XML file.
instance: <dc:creator>CPCTR</dc:creator>. This indicates
that the "creator" tag is derived from the "dc"
metadata source. Or: <cpctr:slide_stain>FISH</cpctr:slide_stain>.
This indicates that the "slide_stain" is derived from the
"cpctr" metadata source.
fact, a single XML file can use the "date" metadata to mean
the calendar date or the fruit date, so long as each use of a
metadata element is prefixed with the correct namespace.
simple annotation is a powerful informatics tool. It allows XML
creators to choose metadata from a variety of different sources,
ensuring that every data element is associated with unambiguous
related data via the internet
obtained from tissue microarray studies provide experimental data
related to many biopy cores. The tissue specimen sampled in the
tissue microarray core may have been provided by a tissue bank, and
the tissue bank may have created an XML file that contains data
related to all the tissue specimens in its collection. Snippets from
two files may look something like this:
banker's XML file:
same data element, representing the tissue banker's identifier, is
present in both files. This means that it is possible for a software
agent encountering the tissue bank entry in the tissue microarray
file to reach through the internet, locating additional information
related to the same tissue bank sample in the tissue banker's file.
Tissue banks sometimes update their datasets with treatment or
outcome data related to patients with banked tissue samples. The
software agent that interrogates the tissue bank XML database may
retrieve information that enhances the value of the information
contained in the tissue microarray file.
is this actually accomplished? How does the software agent know
where to go? The topic of internet software agents and the
methodology for connecting XML files is complex. Suffice it to say
that methods for locating and retrieving data from external XML files
are widely used.31 All such methods depend on the notion
of a Uniform Resource Identifier (URI) for every information
document. The type of URI most familiar to web surfers is the web
address, also known as the Uniform Resource Locator (URL).
the Cooperative Prostate Cancer Tissue Resource's implementation of
the Tissue Microarray Data Exchange Standard, the protocol used to
create the tissue microarray block and the protocol for producing
slide sections are both referenced using a URL.
URL naming system employed by the World Wide Web is the underlying
standard that permits software to reach data anywhere on the
and meaning (the Semantic Web)
the technical methologies associated with XML can be daunting, the
most difficult issues always relate to the meaning things. A
variety of formal approaches have been proposed to reach the level of
meaning within the context of XML. By far, the simplest of these is
the Resource Description Framework (RDF).23 This model
proposes that all pairs of data and associated metadata are about
something. If you simply specify a relationship between data,
metadata and the subject of the data, you take a giant step toward
providing meaning to your XML records.
trivial example demonstrates the basic RDF model. The first line
indicates that the RDF description concerns a unique object specified
by a filename, report1.htm, located at a specified web address. This
is followed by a data/metadata pair indicating that the pathologist
associated with the report is Dr. Tumori. In an actual
implementation, the pathology report may be associated with many
different data/metadata pairs.
importance of the RDF model is that it binds data and metadata to a
unique object with a web location. Consistent use of the RDF model
assures that data anywhere on the the web can always be connected
through unique objects with RDF descriptions. The association of
described data with a unique object confers meaning and greatly
advances our ability to integrate data over the internet.
(embedded protocols and commands)
databases, which are nothing more than structured collections of
data, XML documents are not limited to descriptions of data elements.
An XML file may contain simple data or it may contain logical
assertions, or queries, or program commands in a designated
programming language. In fact, anything can be expressed in XML. A
variety of methods permit cross-internet communication between XML
files. These technologies make it possible for XML files to attain a
type of intelligence. When an XML file is capable of displaying
autonomous behavior, posing questions to external files, generating
replies to received questions and modifying its own content, it is
usually referred to as a software agent. Needless to say, this is a
exciting area for future work.
would be misleading to suggest that XML is a panacea. The design of
XML files requires great wisdom. roducing a complex data structure
of well-described data does not always yield benefit.32
Personally, I tend to abandon XML schemas that I cannot understand.
Although software tools can resolve any valid schema, rendering the
contents of an XML file as a computer-ready data structure, I tend
not to trust abstractions that I cannot grasp.
pathologists are unaware that two standards now exist for pathology
Imaging and Communications in Medicine) is an ISO standard for
medical images. DICOM was created as a machine standard for the
transfer of electronic radiology images. A visible light version of
DICOM was created so that microscopists and endoscopists can exchange
images according to a standard.33 To my knowledge (based
on considerable investigation) no pathology department in the world
is currently using the DICOM visible light standard for their
histologic images. The standard is written using a technical
specification that is too difficult to implement.
ASTM (American Society of Testing and Material Data Type Definition)
has created a standard schema for pathology reports.34 I
have never encountered a pathologist who is even aware of this XML
standard. It seems that new standards are created faster than they
can be embraced.
of the value of XML derives from the common usage of standard XML
schemas and standard metadata. Successful data integration requires
that colleagues from many different biomedical disciplines adhere to
the rules of XML and apply those rules in a manner that makes sense
to their interdisciplinary colleagues. This is a difficult task.
pathologists are uncomfortable with informatics issues. They like to
think of themselves as "docs" and take great pride in their
special diagnostic skills. The world stands in awe when a
pathologist determines a patient's fate by viewing a few cells under
a microscope. One of the unintentional by-products of service
pathology is lots and lots of data. Pathology reports captured by
modern laboratory information systems are richly annotated with such
data elements as transaction time-stamps, bar-code values, encryption
hashes, patient unique identifiers, message headers, and data element
tags. It is tempting to allow others more skilled in these matters to
unburden us from data-intensive tasks. The problem is that
pathologists are the only people who understand pathology data, and
pathologists are the only people who can sensibly integrate pathology
data with related clinical and biological data.
the most computer-phobic pathologists can play a vital role if they
they understand the important role of pathology data in the grand
scheme of things. Despite the complexities of biomedical data, XML is
a simple artifact whose basic purpose (making data understandable) is
manuscript represents the opinion of the author and does not
represent official policy of the NIH or of any other federal agency.
Department of Health and Human Services. 45 CFR (Code of Federal
Regulations), Parts 160 through 164. Standards for Privacy of
Individually Identifiable Health Information (Final Rule). Federal
Register: December 28, 2000 (Volume 65, Number 250)], Pages
Department of Health and Human Services.45 CFR (Code of Federal
Regulations), 46. Protection of Human Subjects (Common Rule). 56
Federal Register, June 18, 1991, volume 56, p. 28003
Final NIH statement on sharing research data.
Berman JJ. Racing to share pathology data. Am J Clin Pathol
Michael Y. Galperin.The Molecular Biology Database Collection: 2004
Acids Res 32: D3-D22, 2004
Connolly JL, Fletcher CD. What is needed to satisfy the American
College of Surgeons Commission on Cancer (COC) requirements for the
pathologic reporting of cancer specimens? Hum Pathol 34:111, 2003
Berman JJ. Concept-Match Medical Data Scrubbing: How pathology
datasets can be used in research. Arch Pathol Lab Med 127:680-686,
Berman JJ. Threshold protocol for the exchange of confidential
medical data. BMC Medical Research Methodology 2:12, 2002
Berman JJ. Confidentiality for Medical Data Miners. Artif Intell Med
Berman JJ. Zero-Check: A Zero-Knowledge Protocol for Reconciling
Patient Identities Across Institutions. Archives of Pathology and
Laboratory Medicine 128:344-346, 2004
Malin B, Sweeney L. How (not) to protect genomic data privacy in a
distributed network: using trail re-identification to evaluate and
design anonymity protection systems. J Biomed Inform 37:179-92, 2004
L. Sweeney. Guaranteeing anonymity when sharing medical data, the
Datafly system. Proc AMIA Ann Fall Symp 51-55, 1997
Spellman PT, Miller M, Stewart J: Design and implementation of
microarray gene expression markup language (MAGE-ML). Genome Biol
Harris MA, Clark J, Ireland A, et al. The Gene Ontology (GO)
database and informatics resource. Nucleic Acids Res. 2004 Jan 1;32
Covitz PA, Hartel F, Schaefer C, De Coronado S, Fragoso G, Sahni H,
Gustafson S, Buetow KH. caCORE: a common infrastructure for cancer
Solbrig HR: Metadata and the reintegration of clinical information:
ISO 11179. MD Comput 2000, 3:25-28
Karasavvas KA, Baldock R, Burger A. Bioinformatics integrations and
agent technology. J of Biomedical Informatics 37:205-219, 2004.
Cantor MN, Lussier YA. Putting data integration into practice: using
biomedical terminologies to add structure to existing data sources.
Proc AMIA Symp 125-129, 2003
Moore GW and Berman JJ. Anatomic Pathology Data Mining. In: Cios KJ,
ed, Medical Data Mining and Knowledge Discovery. Springer-Verlag,
Berlin/Heidelberg, 2000, pp 72-117
Paterson GI, Shepherd M, Wang X, Watters C. Using the XML-based
Clinical Document Architecture for exchange of Structured Discharge
Summaries. Proceedings of the 35th Annual Hawaii International
Conference on System Sciences. 2002
Baorto DM, Cimino JJ, Parvin CA, Kahn MG. Combining laboratory data
sets from multiple institutions using the logical observation
identifier names and codes (LOINC).
Int J Med Inf. 51:29-37, 1998
Marti'n-Sanchez F, Maojo V, Lo'pez-Campos G. Integrating genomics
into health information systems. Methods Inf Med 41:25-30, 2002
Ahmed K, Ayers D, Birbeck M, Cousins J, Dodds D, Lubell J, Nic M,
Rivers-Moore D, Watt A, Worden R, Wrightson A: Professional XML Meta
Data. Wrox Press Ltd. Birmingham 2001.
W3C Architecture Domain. Extensible Markup Language (XML)
White C, Quin L, Burman L: Mastering XML: Premium Edition.
San Francisco 2001
AbiWord: Word Processing for Everyone. [http://www.abisource.com/].
Berman JJ, Edgerton ME, Friedman B. The Tissue Microarray Data
Exchange Specification: A Community-based, Open Source Tool for
Sharing Tissue Microarray Data. BMC Medical Informatics and Decision
Making. 3:5, 2003
Berman JJ, Datta MW, Kajdacsy-Balla A, Melamed J, Orenstein J, Dobbin
K, Patel A, Dhir R, Becich MJ. Tissue microarray data exchange
specification: implementation by the Cooperative Prostate Cancer
Tissue Resource. BMC Bioinformatics 5:19, 2004
Numeric representation of Dates and Time: The ISO solution to a
long-standing source of confusion
XML Namespaces. [http://www.w3schools.com/xml/xml_namespaces.asp]
W3C XML Pointer, XML Base and XML Linking.
Smith CA. Effect of XML Markup on Retrieval of Clinical Documents.
Proc AMIA Symp. 614-618, 2003
Korman LY, Delvaux M, Bidgood D. Structured reporting in
gastrointestinal endoscopy: integration with DICOM and minimal
J Med Inform 48:201-206, 1998
American Society of Testing and Material Data Type Definition (DTD)
Pathology Report version 1.0.