Biomedical Data Integration: Using XML to Link Clinical and Research Datasets
Pre-publication draft for: Berman JJ and Bhatia K. Biomedical Data Integration: Using XML to Link Clinical and Research Datasets. Expert Reviews in Molecular Diagnostics. Expert Rev Mol Diagn. 2005 May;5(3):329-36.
Authors' names and affiliations:
Jules J. Berman1 and Kishor Bhatia2
1 Jules J. Berman, Ph.D., M.D., Program Director, Pathology Informatics, Cancer Diagnosis Program, National Cancer Institute, Bethesda, MD
2 Kishor Bhatia, Ph.D., MRCPath, Program Director, Resources Development Branch, Cancer Diagnosis Program, National Cancer Institute, Bethesda, MD.
Abstract/Summary
Data integration occurs when a query proceeds through multiple datasets, relating diverse data extracted from different data sources. Data integration is particularly important to biomedical researchers because data obtained from experiments on human tissue specimens has little applied value unless it can be combined with medical data (i.e., pathological and clinical information). In the past, research data was correlated with medical data by manually retrieving, reading, assembling, and abstracting patient charts, pathology reports, radiology reports and the results of special tests and procedures. Manual annotation of research data is not feasible when experiments involve hundreds or thousands of tissue specimens. The purpose of this paper is to review how XML (eXtensible Markup Language) provides the fundamental tools that support biomedical data integration. The paper also discusses some of the most important societal challenges that block the widespread availability of annotated biomedical datasets.
Short running title:
Biomedical data integration with XML
Keywords:
XML, data integration, translational research, interoperability, common data elements
Expert Opinion
Overview
Modern biomedical science is data-intensive. Gene, protein and tissue microarrays allow us to look at changes in thousands of variables, all at once. Vast amounts of data are generated from single experiments. Likewise, large hospital information systems routinely collect and store terabytes of patient-related data. A human biopsy sample used in a high throughput array experiment is likely to have a wealth of clinical information stored in one or more hospital databases. Each biopsy has a surgical pathology report describing the specimen and listing pathologic findings. Pathologic findings are often supplemented with clinically important annotations (e.g., grade or stage of the lesion). The surgical pathology report may contain an archived image of the lesion and a variety of special ancillary tests, including immunohistochemistry and cytogenetics findings. The biopsy used in the array experiment may be one of many different biopsies excised from the same patient, and these non-sampled biopsies may contain information pertinent to the experimental study. The patient's entire medical record, with demographic information (age, ethnicity, gender), history, physical examination, treatment, and outcome may reside in one or more hospital databases.
Research data on tissue samples leads to clinical advances only after it is integrated with pathological and clinical data. Traditionally, data integration has involved manually retrieving, assembling, reading and abstracting patient charts, pathology reports, radiology reports and clinical follow-up notes. As research and clinical datasets have grown in size and complexity, manual data integration efforts have become more expensive, more time-consuming, and much less practical.
A recently issued white paper from the Food and Drug Administration (FDA) suggests that the pace of medical progress is slowing [1]. The FDA report notes that despite many achievements in genomics, proteomics and nanotechnology, there has been a downturn in medical product applications at the FDA. Furthermore, the number of new, approved medical diagnostics in the U.S. has declined to just one per year on average. The report also asserts that products recently approved by the FDA have not yet had much impact on patient care. Likewise, a review of serum-based diagnostics by Anderson et al, reported that despite rapid advances in the field of proteomics, few new serum proteins are currently used in routine clinical diagnosis. Their study indicates the rate of introduction of new protein tests has declined over the last decade [2]. In the cancer field, according to Benowitz, only one marker has appeared in the last decade that has modified patient treatment (i.e., HER2 for some cases of breast cancer). Just a few years ago, the expectation of therapeutic options based on genotyping individual patients promised a new era of personalized medicine [3]. The new era of pharmacogenomics has been slow in coming [4]. Researchers are learning that observations need to be validated, and validation often requires the integration of research observations with clinical data on large numbers of patients [5,6]. In many instances, basic researchers simply do not have access to clinical data.
Many research policy analysts have asserted that the integration of biomedical data with clinical data is essential for medical progress [7-19]. It is the opinion of the authors that the scientific community has failed to use readily available informatics technology that can hasten translational medical research by connecting research data with clinical data. The purpose of this manuscript is to describe the fundamental properties of XML (eXtensible Markup Language) that support the integration of all kinds of data, wherever that data may be located, and without creating centralized databases. The manuscript also discusses the social and legal impediments that hamper data integration.
The Technical Solution - XML
Biomedical informatics uses the data produced by research laboratories (sometimes called discovery data) and the data obtained from clinical repositories to obtain clinically useful results (e.g. new discoveries, tests, therapies, services or procedures). Because biomedical informatics translates basic science into clinical reality, it is typically regarded as a translational or applied science.
XML is an informatics technology that allows any data element (e.g., a gene sequence, the weight of a patient, a biopsy diagnosis) to be bound to other data that describes the data element (metadata) [20,21,100]. Surprisingly, this simple relationship between data and the data that describes data is the most powerful innovation in data organization since the invention of the book. Seldom does a technology emerge with all the techniques required for its success, but this seems to be the case for XML.
In XML, data descriptors (known as XML tags) enclose the data they describe with angle-brackets.
<birthdate>September 28, 1950</birthdate>
<birthdate> is the XML tag. The tag and its end-tag enclose a data element, which in this case is the unabbreviated month, beginning with an uppercase letter and followed by lowercase letters, followed by a space, followed by a two-digit numeric for the date of the month, followed by a comma and space, followed by the 4-digit year. The XML tag could have been defined in a separate document detailing the data format of the data element described by the XML tag. ISO-11179 is a standard that tells people how they should specify the properties of Common Data Elements [7]. In this case, the Common Data Element is the XML tag, <birthdate>. If we had chosen, we could have broken the <birthdate> tag into its constituent parts.
<birthdate>
<month_of_birth>September</month_of_birth>
<month_day_of_birth>28<month_day_of_birth>
<year_of_birth>1950<year_of_birth>
</birthdate>
As described in detail in our prior publication, six properties of XML explain its extraordinary utility [16]. These are:
1. Enforced and defined structure (XML rules and schema)
2. Formal metadata (through ISO11179 specification)
3. Namespaces (permits sharing of uniquely identifiable CDEs)
4. Linking data via the internet (through Unique Resource Identifiers)
5. Logic and meaning (the Semantic Web)
6. Self-awareness (embedded protocols and commands)
These properties of XML are powerful because they permit us to understand the meaning of data and because they permit us to reach data anywhere on the internet.
Enforced and defined structure (XML rules and schemas)
An XML file is well-formed if it conforms to the basic rules for XML file construction recommended by the W3C (Worldwide Web Consortium). This means that it must be a plain-text file, with a header indicating that it is an XML file, and must enclose data elements with metadata tags that declare the start and end of the data element. The tags must conform to certain rules (e.g. alphnumeric strings without intervening spaces) and must also obey the rules for nesting data elements [21,100].
An example of a well-nested set of metadata tags follows:
<name>
<first_name>Jules</first_name>
<last_name>Berman</last_name>
</name>
An example of poorly-formed tagging is shown:
<name>
<first_name>Jules
<last_name>Berman
</first_name>
</last_name>
</name>
A metadata/data pair may be contained within another metadata/data pair (so-called nesting), but a metadata/data pair cannot straggle over another metadata/data pair. Most browsers will parse XML files, rejecting files that are not well-formed. The ability to ensure that every XML file conforms to basic rules of metadata tagging and nesting makes it possible to extract XML files as sensible data structures.
At this level, the ability to invent new tags and the strict adherence to rules of nesting tags are the only properties that distinguish XML from its better-known cousin, HTML (HyperText Markup Language).
A well-formed XML file must be structured according to either a DTD (Document Type Definition) or to a schema before it can be considered a valid XML document. DTDs and schemas are blocks of descriptors that specify the structure and content of an XML file. A variety of schema languages are available [20]. Regarding schemas, when an XML file is described by a schema and parsed by a validating browser (or by a so-called XML parser), you can be certain that the data contained in the file conforms to a specified structure and content. Files using the same schema will have the same data organization, and this greatly facilities data integration between files.
Formal metadata (through the ISO11179 specification)
The concept of formalized metadata is quite simple. Consider the seemingly obvious metadata tag, <date>. Does this designate a day in a calendar, or does it represent a type of fruit, or does it refer to a social event?
The International Standards Organization has created a standard way of defining metadata tags (also known as Common Data Elements or CDEs). This standard, the ISO 11179 specifies that metadata should have a qualified name or identifier, an authority that registers the name, a versioning history (allowing for modifications), a language or origin, a statement relating to usage, a data typing statement, and a definition that is unambiguous [7]. XML files should always include pointers (i.e. a links) to the web addresses of the files that contain the definitions of the metadata tags appearing in the XML file.
The creators of the TMA Data Exchange Specification have prepared a file that lists each of the 80 XML tags used in the specification, along with the ISO 11179 descriptors for each tag [22]. This metadata definition file for the TMA data exchange specification resides at: [http://www.pathology.pitt.edu/pdf/cpctr/tma_cde.htm]. An example is shown for the metadata tag <core_organism>:
core_organism
Identifier: core_organism
Version: 1.0
Registration Authority: Association for Pathology Informatics
Language: English (en)
Obligation: Optional
Datatype: Character String representing taxonomy.dat identifier number
followed by an allowable taxonomy.dat name for the identifier number
Maximum Occurrence: Unlimited
Definition: Organism name at species level for organism whose tissue
is represented in the donor block,
Comment: URI for taxonomy.dat is ftp://ftp.ebi.ac.uk/pub/databases/taxonomy/taxonomy.dat
The correct entry for human tissue is "9606 human"
Without fully described and defined XML metadata tags, as prescribed by ISO-11179, XML data has no meaning or value.
Namespaces (permits sharing of uniquely identifiable metadata tags)
Problems arise when data from different documents are merged. How can we be sure that XML tags in one document will always mean the same thing when found in another document? This is achieved with namespaces. Whenever you use an element taken from another XML document, you should declare the namespace origin of the element [102].
For example,
<table xmlns="http://www.w3.org/TR/html4/">
This xmlns attribute indicates that the <table> element is the same element that is described for the World Wide Web organization's HTML recommendation. and provides the URL (Uniform Resource Locator) that defines the <table> element. Thus <table>, as it is used in the data element, has a meaning that cannot be confused with "kitchen table" or "periodic table".
If we had multiple elements from the html specification, we may have chosen to list the namespace near the top of the XML document and assigned a prefix specific for elements belonging to the html namespace. Consider this snippet of XML from the Gene Ontology specification [23]. Two different namespaces are declared using the xmlns (XML namespace) attribute.
<go:go xmlns:go="http://www.geneontology.org/dtds/go.dtd#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<go:version timestamp="Wed Jun 19 12:34:48 1974" />
<rdf:RDF>
<go:term rdf:about="http://www.geneontology.org/go#GO:0003673"n_associations="0">
<go:accession>GO:0003673</go:accession>
<go:name>Gene_Ontology</go:name>
</go:term>
In like manner, a single XML file can use a single metadata tag to mean many different things, so long as each use of a metadata element is prefixed with the correct namespace.
The namespace prefix annotation is a powerful informatics tool, permitting XML creators to recycle metadata from different namespaces. When the same metadata is used by many different researchers, the contained data can often be shared between their databases.
Linking data via the internet
XML comes with specifications for linking XML documents with other XML documents, or with any external file that has a specific identifier or web location (Uniform Resource Identifier, URI, or Uniform Resource Locator, URL). This means that there is a logical and standard method for linking any XML document or any part of any XML document (including individual data elements) to any other uniquely identified resource (web file) [103,104].
Logic and meaning (the Semantic Web)
Although the technical methodologies associated with XML can be daunting, the most difficult issues always relate to the meaning of things. A variety of formal approaches have been proposed to reach the level of meaning within the context of XML. The simplest of these is the Resource Description Framework (RDF) [21,105].
The importance of the RDF model is that it binds data and metadata to a unique object with a web location. Consistent use of the RDF model assures that data anywhere on the the web can always be connected through unique objects using RDF descriptions. The association of described data with a unique object confers meaning and greatly advances our ability to integrate data over the internet.
Self-awareness (embedded protocols and commands)
XML provides the methods for querying XML files. Because XML can be used to describe anything, it can certainly be used to describe a query related to an XML page. Furthermore, it can be used to describe protocols for transferring data, performing web services or describing the programmer interface to databases. It can describe the rules for interoperability for any data process, including peer-to-peer data sharing. When an XML file is capable of displaying autonomous behavior, composing queries, merging replies and transforming its own content, it is usually referred to as a software agent [12].
Using XML
Relationships between different data elements allow us to make generalizations about classes of data. For example, if an adept XML dataminer notices that several different tumors seem to respond to a certain new treatment (based on a query routed through many different clinical trial XML datasets), the significance of the finding may be enhanced if the researcher determines that each of the responsive tumors belonged to the same tumor class, as designated in an XML file that provides the biological classification of every known tumor type [24]. This finding might lead the researcher to ask whether other tumors of the same class share this treatment sensitivity. Furthermore, if the researcher had access to another XML file that annotates all cancer classes with experimental data collected from many different laboratories, he or she may find a specific genetic marker common to the tumor class, that may be responsible for the heightened senstivity of the class of tumors to the treatment protocol. This may lead to the development of a new class of drugs with specific activity against a specific class of neoplasms. At this time, such a scenario is fanciful, but only because detailed clinical trial data is not yet publicly available.
Problems with Research data
Research data is seldom annotated in a way that is either complete (everything needing annotation is annotated) and understandable (annotations are clear and unambiguous). A serial analysis of gene expression (SAGE) database that refers to a specimen as "Brain, normal, pooled," without any description of the specific anatomic brain structure sampled, or the reasons for pooling different brains, or the conditions under which the tissue was obtained (fresh, fresh/frozen, formalin-fixed), is certain to disappoint anatomists [106]. A biological specimen designated as "Primary mesothelioma" without any mention of the primary anatomic site (pleura, peritoneum, or elsewhere) would never satisfy a pathologist [107]. Basic information about the patient (age, gender, ethnicity) is required for any clinical study, but lab researchers sometimes fail to see the necessity to including such data with their experimental results. It is my perception that this profound oversight occurs when scientists confine their experimental analyses to the data generated in their laboratory. Having no part in creating the patient's clinical data, experimentalists mistrust clinical data and sometimes forget it even exists.
Why is annotation so often neglected by biomedical scientists? There may be many reasons. Perhaps the most important is the fondly held notion that the exclusive purpose of a scientific study is to prove or disprove a hypothesis. If the only purpose of collecting data is to test a hypothesis, then the life of the data ends when the hypothesis is proven. Data obtained from a carefully designed study for one particular purpose is seldom intended to have an an autonomous existence beyond the published manuscript. It can be very difficult to combine data obtained by different laboratories, even when the researchers think they are using the same techniques on the same experimental platform [25].
Even when researchers intend to share their data, they tend to neglect annotation issues. The typical result of a data-intesive experiment is an Excel spreadsheet, populated by data cells, each cell occupied by a number. Many researchers believe that Excel spreadsheets are a useful de facto standard for scientific data exchange. Spreadsheets are not designed to capture the kinds of annotations that describe the basic details that distinguish one experiment from another (who did the experiment, when was it done, how was it done, the specific protocols used, details describing the tissues, reagents or cell lines used in the experiment, and patient-related descriptors related to tissue specimens). Scientists are mistaken if they believe that the full story of an experiment is told by the spreadsheet values. This perception is sometimes inadvertently strengthened by the statistician, who is comfortable with working with columns of raw spreadsheet data.
The field of biomedical informatics rests on the assumption that data collected from many different experiments can be merged to produce scientifically valid results unanticipated by the designers of the original experiments. For this to actually occur, experiments must be fully annotated with information related to methods, reagents and tissues. The analysis of integrated data can only occur when all data elements are annotated with metadata that conforms to common definitions. Until annotations are added to the results of an experiment, there is no valid way to merge experimental data.
The Problems with Medical Data
Issues related to the use of clinical data in medical research generally fall into the areas of data quality and data access. In many instances, the clinical data of greatest importance to researchers is the surgical pathology report that documents and describes the lesions analyzed by the researcher. Surgical pathology reports are typically written as narrative text, and clinicians are often unable to interpret reports with the meaning intended by the pathologist [26]. It sometimes seems that pathologists purposefully obfuscate their diagnoses. Why is this? Pathologists tend to think that every biopsy report is a privately fashioned communication between the pathologist and clinician, conveying a subtle, complex message. Physicians have a fiduciary obligation to guard their patients' confidentiality, and pathologists have traditionally written reports with the understanding that the results will never be shared. Pathologists who see their role as protectors of data are unlikely to spend much time making it easy for non-pathologists to access the data. The transformation of medical narrative text into scientifically useful data has received considerable attention [27-28].
The problems of data access are particularly frustrating when one considers the enormous amounts of data collected by modern Hospital Information Systems (HISs). HISs are databases that collect transactional information from laboratories and hospital departments and append them to the patient's [electronic] medical record. The typical HIS does not permit users to aggregate records from different patients based on data type. Most physicans cannot collect all the patients with a glucose exceeding 120, or create a collection of all the pathology reports that were positive for a sexually transmitted disease. Under HIPAA privacy regulations, hospital personnel are generally forbidden to access medical records on groups of patients [29]. They receive information on one patient at a time and are generally denied access to information on patients who are not under their direct care. Of course, there are exceptions to this generalization, but the exceptions are becoming fewer and fewer as federal privacy regulations narrow the opportunities for access to HIS data.
What are the practical implications of this? Large medical centers collect terabytes of clinical data every week, but none of this data is directly accessible for research purposes. To obtain aggregated records for the purpose of conducting medical research, a researcher must submit their research plans to an Institutional Review Board and/or Privacy Board. Institutional Review Boards are designed to protect patients from harms that may be associated with medical research. In the case of research that uses pre-existing patient records (and does not put the patient at physical risk through a medical intervention), the risks to the patient are confined to issues of confidentiality or privacy.
The risks to the institution are violations of federal or state regulations that restrict the uses of patient records for research. The two federal laws that apply to this situation are the so-called Common Rule and the HIPAA Privacy Regulations [29,30]. Both of these Rules permit the unrestricted use of patient records when the records are de-identified (i.e., when the data in the record is disengaged from any links to the patient). A variety of technical solutions permit researchers to create and use large numbers of de-identified medical records [31-36]. Decades before current U.S. privacy regulations, researchers sought and found computational solutions to patient confidentiality issues [37]. Unfortunately, researchers and institutions have not been quick to adopt any of these technical solutions. De-identification protocols often have a "smoke and mirrors" quality conferred by their reliance on exotic mathematical constructions (e.g. one-way hash algorithms, prime number factorization and zero-knowledge protocols). It is the perception of the authors, based on many interactions with researchers, that the reticence to use available de-identification algorithms will fade after many institutions safely implement these protocols, without any serious legal challenge. This will likely require at least five more years of collected experience.
Benefits to the data annotators
Everyone who collects data needs to know that the considerable time, energy and intellectual effort devoted to careful annotation is not wasted [38]. There needs to be a clear understanding that this work will benefit patients, enhance the professional goals for the people involved in the effort, and the corporate interests of the entity that funds these activities.
When researchers are asked to forsake their familiar spreadsheets and migrate to XML, the will pose a series of predictable questions:
1. Creating valid XML files with all my experimental data will require many additional hours of time invested by a person who has specialized technical expertise. Who will pay for this? If the scientific study is not designed until after the data has been collected and annotated, how can I hope to attract funding now for work that has no currently existing hypothesis?
2. Do I have any control over my own data once it has been "integrated?" Will I have access to the integrated dataset in support of my own scientific interests? How can I be sure that it will not be misinterpreted? How can I be certain that the uses of this data will not result in erroneous conclusions or in harm to patients or in the violation of laws. Can I or my institution be held responsible for the misconduct of researchers outside our institution who access my data?
3. After my XML file has been used by scientists who had no part in generating the data, how will I be compensated? How can I maintain control over my own data and ensure that I receive fair professional credit for my efforts? How can I be certain that I receive financial compensation when it is warranted, and how can I ascertain that my compensation is fair?
To a certain degree, scientists are asked to take a leap of faith, trusting that the arguments in support of XML technology are sufficiently persuasive to justify a commitment in the absence of any visible reward model. The authors urge new XML converts to adopt the recommendations listed in Table 1.
Five-year view
The authors predict that in the next 5 years, there will be many successful projects integrating clinical data with research data within research institutions. However, these important data resources will be available only to a small set of institutional researchers. The few researchers who can successfully integrate complex biomedical data will make remarkable advances in translational research. The public will reap the benefits of these advances and will eventually applaud the expanded use of de-identified medical records in research.
For the next five years, persistent social and legal factors will severely limit public access to collections of clinically annotated research data. The level of confidence in de-identification protocols will remain low until successful implementations are published that withstand the scrutiny of regulatory agencies and the rebuffed attacks by would-be violators of patient confidentiality.
Key issues
1. Grim reality would suggest that most experimental data has not been prepared in a manner that can be integrated or merged with other datasets. The same is true for clinical data.
2. XML will permit us to understand and integrate the textual and numeric data held in XML files anywhere on the Internet.
3. Even if all research and clinical data were structured as valid XML documents, we currently lack two important cultural constructs:
a) Compensation (professional and financial) for the efforts needed to prepare data in a format that supports public access to annotated biomedical data
b) Assurances that adverse legal actions will be avoided when the researchers prepare data in a manner that follows standard guidelines for the protection of patient privacy.
Table 1. Recommendations to facilitate data integration
1. If you don't need a centralized, proprietary database, consider creating decentralized XML datasets.
2. Use a consistent approach to data integration for all projects in your institution. Institutions and businesses should develop general data-integration strategies and policies, which can be included in grant proposals and business plans.
3. Work with your IRBs or Privacy Boards to develop general data use and data sharing protocols that are safe or low-risk. Use published privacy/confidentiality protocols when they are appropriate for your purposes.
4. Use publicly available data exchange specifications, schemas, common data elements (metadata), vocabularies, classifications and ontologies when feasible. Use namespace notation to avoid collisions between data descriptors borrowed from multiple sources.
5. Avoid duplicating data that resides in pre-existing XML datasets. Use pointers to pre-existing data rather than importing data from one XML file to another.
6. Strive toward creating fully self-describing documents. Use the Dublin Core header elements in every XML file to ensure that everyone who has access to your XML file understands the file properties (who created and owns the file, when the file was created, the purpose of the file, and any restrictions on the use of the file) [108].
7. Use RDF or another semantic-level XML technology to confers meaning to contained data.
8. Place your XML files on public web locations whenever feasible.
9. Submit your XML files as supporting materials accompanying your journal submissions.
References
[1]. Innovation or Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products. U.S. Department of Health and Human Services, Food and Drug Administration (2004).
[2]. Anderson NL, Anderson NG. The human plasma proteome: history, character and diagnostic prospects. Mol. Cell. Proteomics 1, 845-867 (2002).
[3]. Benowitz S. Biomarker boom slowed by validation concerns. J. Natl. Cancer Inst. 96(18), 1356-1357 (2004).
[4]. Evans WE, Relling MV. Pharmacogenomics: translating functional genomics into rational therapeutics. Science. 286, 487-491 (1999).
[6]. Mancinelli L, Cronin M, Sadee W. Pharmacogenomics: the promise of personalized medicine. AAPS PharmSci. 2(1), E4 (2000).
[5]. McCarthy JJ, Hilfiker R. The use of single-nucleotide polymorphism maps in pharmacogenomics. Nat. Biotechnol. 18(5), 505-508 (2000).
[7]. Solbrig HR: Metadata and the reintegration of clinical information: ISO 11179.
MD Comput. 3, 25-28 (2000).
[8]. Cantor MN, Lussier YA. Putting data integration into practice: using biomedical terminologies to add structure to existing data sources. Proc. AMIA Symp. 125-129 (2003).
[9]. Baorto DM, Cimino JJ, Parvin CA, Kahn MG. Combining laboratory data sets from multiple institutions using the logical observation identifier names and codes (LOINC).
Int. J. Med. Inf. 51(1), 29-37 (1998).
[10]. Marti'n-Sanchez F, Maojo V, Lo'pez-Campos G. Integrating genomics into health information systems. Methods Inf. Med. 41, 25-30 (2002).
[11]. Sensmeier J. Advancing the state of data integration in healthcare. J. Healthc. Inf. Manag. 17(4), 58-61 (2003).
[12]. Karasavvas KA, Baldock R, Burger A. Bioinformatics integrations and agent technology. J of Biomed. Inform. 37, 205-219 (2004).
[13]. Sujansky W. Heterogeneous database integration in biomedicine. J. Biomed. Inform. 34, 285-298 (2001).
[14]. Covitz PA, Hartel F, Schaefer C, De Coronado S, Fragoso G, Sahni H, Gustafson S, Buetow KH. caCORE: a common infrastructure for cancer informatics.
Bioinform. 19(18), 2404-2412 (2003).
[15]. Warzel DB, Andonaydis C, McCurry B, Chilukuri R, Ishmukhamedov S, Covitz P. Common Data Element (CDE) Management and Deployment in Clinical Trials. AMIA Annu. Symp. Proc. 1048 (2003).
[16]. Berman JJ. Integrating pathology data with XML. In press, Hum. Pathol..
[17]. Stein LD. Integrating biological databases. Nature Rev. Genetics 4, 337-345 (2003).
[18]. Mork P, Halevy A, Tarczy-Hornoch P. A model for data integration systems of biomedical data applied to online genetic databases. Proc. AMIA Symp. 473-477 (2001).
[19]. Nagarajan R. Database challenges in the integration of biomedical data sets. Proceedings of the 30th VLDB Conference, Toronto, CANADA, 1202-1213 (2004).
[20]. Ahmed K, Ayers D, Birbeck M, Cousins J, Dodds D, Lubell J, Nic M, Rivers-Moore D, Watt A, Worden R, Wrightson A. Professional XML Meta Data. Wrox Press Ltd. Birmingham (2001).
[21]. White C, Quin L, Burman L. Mastering XML: Premium Edition.
Sybex, San Francisco (2001),
[22]. Berman JJ, Edgerton ME, Friedman B. The Tissue Microarray Data Exchange Specification: A Community-based, Open Source Tool for Sharing Tissue Microarray Data. BMC Med. Inform. Dec. Mak. 3:5 (2003).
[23]. Harris MA, Clark J, Ireland A, et al. Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucl. Acids Res. 32, D258-D261 (2004).
[24]. Berman JJ. Tumor classification: molecular analysis meets Aristotle.
BMC Ca. 4:10 (2004).
[25]. Hwang KB, Kong SW, Greenberg SA, Park PJ. Combining gene expression data from different generations of oligonucleotide arrays. BMC Bioinform. 5:159 (2004).
[26]. Powsner SM, Costa J, Homer RJ. Clinicians are from Mars and pathologists are from Venus. Arch. Pathol. Lab. Med. 24:1040-1046 (2000).
[27]. Berman JJ. Doublet method for very fast autocoding. BMC Med. Inform. Decis. Mak. 4(1), 16 (2004).
[28]. Berman JJ. Resources for comparing the speed and performance of medical autocoders. BMC Med. Inform. Decis. Mak. 4(1), 8 (2004).
[29]. Department of Health and Human Services. 45 CFR (Code of Federal Regulations), Parts 160 through 164. Standards for Privacy of Individually Identifiable Health Information (Final Rule). Federal Register: December 28, Volume 65, Number 250, 82461-82510 (2000).
[30]. Department of Health and Human Services.45 CFR (Code of Federal Regulations), 46. Protection of Human Subjects (Common Rule). Federal Register, June 18, Volume 56, 28003 (1991).
[31]. Berman JJ. Zero-Check: A Zero-Knowledge Protocol for Reconciling Patient Identities Across Institutions. Arch. Pathol. Lab. Med. 128, 344-346 (2004).
[32]. Berman JJ. Concept-Match Medical Data Scrubbing: How pathology datasets can be used in research. Arch. Pathol. Lab. Med. 127, 680-686 (2003).
[33]. Berman JJ. Threshold protocol for the exchange of confidential medical data. BMC Med. Res. Methodol. 2:12 (2002).
[34]. Berman JJ. Racing to share pathology data. Am. J. Clin. Pathol. 121(2):169-171 (2004)
[35]. Malin B, Sweeney L. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. J Biomed Inform. 37(3), 179-192 (2004).
[36]. Sweeney L. Three computational systems for disclosing medical data in the year 1999. Medinfo. 9(Pt 2), 1124-1129 (1998).
[37]. Biskup J, Bleumer G. Crytographic Protection of Health Information: Cost and Benefit, Internl. J. Bio-Med. Comput. 43, 61-67 (1966).
[38]. Kohane IS. Bioinformatics and Clinical Informatics: The Imperative to Collaborate. JAMIA 7, 512-516 (2000).
[100]. W3C Architecture Domain. Extensible Markup Language (XML) [http://www.w3c.org/XML/]
[101]. Numeric representation of Dates and Time: The ISO solution to a long-standing source of confusion [http://www.iso.ch/iso/en/prods-services/popstds/datesandtime.html]
[102]. XML Namespaces. [http://www.w3schools.com/xml/xml_namespaces.asp].
[103]. W3C XML Pointer, XML Base and XML Linking. [http://www.w3.org/XML/Linking].
[104].
XML Path Language (XPath). [http://www.w3.org/TR/xpath].
[105].
Resource Description Framework (RDF). [http://www.w3.org/RDF/].
[106]. Sample GSM763. [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM763].
[107]. Sample GSM727. [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM727].
[108]. Dublin Core Metadata Initiative. [http://dublincore.org/].