biomedical informatics cover Perl Programming for Medicine and Biology Cover

Biomedical Informatics books
by Jules J. Berman

  • Jones & Bartlett sales and informational website for Biomedical Informatics
  • Amazon.com U.S. book site for Biomedical Informatics
  • Full Table of Contents from Library of Congress for Biomedical Informatics
  • List of book-related resources
  • Brief author biography on Association for Pathology Informatics Website
  • Quick link to PubMed listing for Jules J. Berman
  • Full list of Publications for Jules J. Berman
  • Dr. Bruce Friedman's review of Biomedical Informatics
  • Perl Programming for Medicine and Biology companion site.
  • Author's blog on data specifications
  • Contact author



  • RDF data specifications: the simple alternative to complex data standards 


    Version 1.0, July 14, 2006



    A new version of this manuscript is already planned for release.  This version will have downloadable Perl files and links to the most recent versions of CDE documents and RDF Schemas described in the manuscript. Please visit the Association for Pathology Informatics resources site for updates.  http://www.pathologyinformatics.org/informatics_r.htm


    Copyright 2006 Jules J. Berman


    Prepared for APIII

    August 16-18, 2006

    Vancouver, British Columbia


    Jules J. Berman, Ph.D., M.D.

    President, Association for Pathology Informatics

    Co-chair, Laboratory Digital Imaging Project

    jjberman@alum.mit.edu


    G. William Moore, M.D., Ph.D.

    Department of Pathology

    VA Medical Center

    Baltimore, Maryland

    George.Moore4@va.gov




    key words:ldip, laboratory digital imaging project, rdf, resource description framework, rdf schema, iso11179, iso 11179, xml, xsd, cde, common data element, specification, data standard, ontology, ontologies, semantic web, use-case, metadata, binary image file, image, xml schema, class, classes, property, properties, domain, range, pattern, regex, regular expression, regular expressions, urn, uri, url, lsid, oid, datatype, datatyping, data type







    Overview of manuscript

    The Resource Description Framework (RDF) provides a simple method for specifying information as data triples. The authors believe that much of the time and expense associated with developing and deploying data standards can be eliminated by a consistent implementation of recommended RDF data specification practices.


    Necessary background subjects:

    1. Meaning in informatics

    2. Triples

    3. Identifiers

    4. Datatyping

    5. Classes and Properties

    6. Instantiating Classes


    Necessary informatics techniques:

    1. RDF syntax (specifying data as class instance-property-data triples)

    2. RDF schema (formal dictionary for classes and properties)

    3. XSD (to constrain data to a defined datatype) 


    The only implementation tools you really need is your head and a text editor such as notepad or emacs.  Optional tools may include:

    1. Programming modules for parsing XML and RDF (available soon as short Perl scripts distributed with this manuscript).

    2. Publicly available RDF schemas


    This manuscript provides a simplified review of the principles and practices of RDF data specification.  A use-case from the Laboratory Digital Imaging Project (LDIP), involving the annotation of pathology images, is described.   


    The authors hope that readers will be able to adopt these principles and methods to specify their own datasets in an efficient and simple manner that supports data integration and software interoperability,   


    Background


    The definition of Meaning

    In informatics, assertions have meaning whenever a pair of metadata and data (the descriptor for the data and the data itself) is assigned to a specific subject.


    Triples consist of: Specified subject then metadata then data


    Some triples found in a medical dataset


    “Jules Berman” “blood glucose level” “85”

    “Mary Smith” “blood glucose level” “90”

    “Samuel Rice” “blood glucose level" "200"

    “Jules Berman” “eye color” “brown”

    “Mary Smith” “eye color” “blue”

    “Samuel Rice” “eye color" "green"



    Some triples found in a haberdasher's dataset


    “Juan Valdez” “hat size” “8”

    “Jules Berman” “hat size” “9”

    “Homer Simpson” “hat size” “9”

    “Homer Simpson” “hat_type” “bowler”



    Triples collected from both datasets whose subject is "Jules Berman"


    “Jules Berman” “blood glucose level” “85”

    “Jules Berman” “eye color” “brown”

    “Jules Berman” “hat size” “9”


    Triples can port their meaning between different databases because they bind described data to a specified subject. This supports data integration of heterogeneous data and facilitates the design of software agents.  A software agent, as used here, is a program that can interrogate multiple RDF documents on the web, initiating its own actions based on inferences yielded from retrieved triples.


    RDF (Resource Description Framework) is a syntax for writing computer-parsable triples.  For RDF to serve as a general method for describing data objects, we need to answer the following four questions:.


    1. How does the triple convey the unique identity of its subject?  In the triple, “Jules Berman” “blood glucose level” “85”, The name "Jules Berman" is not unique and may apply to several different people.


    2. How do we convey the meaning of metadata terms?  Perhaps one person's definition of a metadata term is different from another person's.  For example, is "hat size" the diameter of the hat, or the distance from ear to ear on the person who is intended to wear the hat, or a digit selected from a pre-defined scale?


    3. How can we constrain the values described by metadata to a specific datatype?  Can a person have an eye color of 8?  Can a person have an eye color of "chartreuse"?


    4. How can we indicate that a unique object is a member of a class and can be described by metadata shared by all the members of a class? 


    Much of the remainder of the background section will be devoted to answering these  four questions. 


    Introduction to RDF syntax: RDF triples

    RDF is a specialized XML syntax for creating computer-parsable files consisting of triples. The subject of the RDF triple is invoked with the rdf:about attribute.  Following the subject is a metadata/data pair. 


    Let us create an RDF triple whose subject is the jpeg image file specified as:

    http://www.gwmoore.org/ldip/ldip2103.jpg.  The metadata is <dc:title> and the data value is "Normal Lung".


    <rdf:Description                     

        rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg">

        <dc:title>Normal Lung</dc:title>

      </rdf:Description>


    An example of three triples is proper RDF syntax is:


    <rdf:Description                     

        rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg">

        <dc:title>Normal Lung</dc:title>

      </rdf:Description>

    <rdf:Description                     

        rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg">   

        <dc:creator>Bill Moore</dc:creator>

      </rdf:Description>

    <rdf:Description                     

        rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg">

        <dc:date>2006-06-28</dc:date>

      </rdf:Description>


    RDF permits you to collapse multiple triples that apply to a single subject.  The following RDF:Description statement  is equivalent to the three prior triples:


    <rdf:Description                     

        rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg">

        <dc:title>Normal Lung</dc:title>

        <dc:creator>Bill Moore</dc:creator>

        <dc:date>2006-06-28</dc:date>

    </rdf:Description>


    An example of a short but well-formed RDF image specification document is:


    <?xml version="1.0"?>

    <rdf:RDF 

          xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

          xmlns:dc="http://purl.org/dc/elements/1.1/">

      <rdf:Description                     

        rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg">

        <dc:title>Normal Lung</dc:title>

        <dc:creator>Bill Moore</dc:creator>

        <dc:date>2006-06-28</dc:date>

      </rdf:Description>

    </rdf:RDF>

    The first line tells you that the document is XML.  The second line tells you that the XML document is an RDF resource.  The third and fourth lines are the namespace documents that are referenced within the document (more about this later).  Following that is the RDF statement that we have already seen. 

    Believe it or not, the manuscript thus far covers 95% of what you need to know to specify your data with RDF.  If you seek the simplest approach to specifying data as RDF documents, you can skip down to the section titled, "Use-case examples" and go to the first subsection, "RDF file with pointers to jpeg image file." This is the simplest use-case for specifying images.  By modifying this use-case for your own purposes, you can write RDF documents that adequately specify almost any biomedical data.

    Common Data Elements (CDEs)

    The term, "common data element" is a misnomer.  Most people, when they first encounter this term, assume that a data element holds data.  Actually, a common data element is the metadata that describes a datum in a data record.  In XML parlance, a CDE is an XML tag.  The thing that makes a descriptor "common" is its common usage by a scientific community.  The way that CDEs are intended to work is that a scientific community creates a list of CDEs that describe the kinds of data that their members use.  The members of the community will all use the same CDEs (XML tags) to annotate their data files.


    ISO-11179 Specification for CDEs

    One of the most calamitous errors in any CDE project is to assume that everyone who reads a metadata tag will automatically understand its intended meaning.  ISO-11179 is a standard way of defining CDEs withthe necessary information for understanding their meanings.


    The most popular CDEs in existence are the Dublin Core CDEs.  These are a set of file descriptors that were prepared by a committee of librarians who convened in Dublin, Ohio.  The Dublin Core includes basic information about electronic documents, such as: the title of the document, the name of the person who created the file, the date that the file was created, the date that the file was modified, a short description of the file.  These are basically the items that a library software agent would need to retrieve if it were building an index of internet documents. The world of informatics would be a better place if everyone who created an HTML, XML or RDF file would remember to include the Dublin Core CDEs.


    These Dublin Core CDEs have been prepared to comply with the ISO-11179 specification.  Every effort to create a data specification for a knowledge domain should begin with by collecting the common data elements for the domain and annotating each element with the ISO-11179 CDE descriptors.  The ISO-11179 descriptors for two of the Dublin Core CDEs (Title and Creator) are shown below.


    From: http://dublincore.org/documents/1999/07/02/dces/

    Title

       Identifier: Title

       Version: 1.1

       Registration Authority: Dublin Core Metadata Initiative

       Language: en

       Obligation: Optional

       Datatype: Character String

       Maximum Occurrence: Unlimited

       Definition: A name given to the resource.

       Comment: Typically, a Title will be a name by which the resource is

       formally known.

    Creator

         Identifier: Creator

       Version: 1.1

       Registration Authority: Dublin Core Metadata Initiative

       Language: en

       Obligation: Optional

       Datatype: Character String

       Maximum Occurrence: Unlimited

       Definition: An entity primarily responsible for making the content

       of the resource.

       Comment: Examples of a Creator include a person, an organisation,

       or a service. Typically, the name of a Creator should be used to

       indicate the entity.

    RDF Schemas

    An RDF schema is a dictionary file that lists the classes and the properties that pertain to RDF documents.  In fact, the official long name for RDF Schema is the RDF Vocabulary Description Language.  The classes of an RDF schema are formal definitions for the kinds of subjects that are found in the RDF triples. The properties of an RDF schema are the types of metadata descriptors for the data of the RDF triples.  Elements in RDF schemas may be subclasses of elements in other RDF schemas.


    Things to remember about RDF Schemas


    1. RDF Schemas are written in XML but are completely unlike XML Schemas. 


    2. RDF Schemas contain declarations of the classes and properties that are used in RDF documents.


    3. RDF Schemas, like all RDF documents, have no pre-determined order or composition and  consist of statements expressed as triples.  The subject of every triples in an RDF Schemas will be either Class or Property.


    4, Every RDF Schema can be thought of as a child of the W3C RDF Schema that defines the "super" classes Resource, Class and Property. All RDF Schemas will refer to the document that defines RDF syntax and to the document that defines the top-level schema, and therefore will begin something like this:


    <?xml version='1.0' encoding='ISO-8859-1'?>

      <rdf:RDF 

          xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

          xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">


    5. A typical RDF document consists of triples (subject, metadata, value).  RDF documents usually reference one or more RDF Schemas to instantiate the subject of each triple (i.e., to tell us which class in an RDF schema the subject is an instance of) and to provide subjects with class-appropriate metadata.


    6.  Documents composed of triples whose components are defined by RDF Schemas can be used to completely specify data objects within a knowledge domain. 


    7. By completely specifying data objects in a knowledge domain, RDF specifications achieve the functionality of data standards.



    In a later section, we will demonstrate the simple method for instantiating classes and for associating class instances with appropriate properties and with datatyped values.


    A template for CDEs that can be transformed to RDF Schema elements


    Here is a CDE template that categories for all the information for a class or a property that is necessary to generate an RDF schema element and that satisfies the minimal recommendations for a CDE under ISO-11179:


    The general format for class elements is:


    Class Label (in standard XML tag format, uppercase 1st letter):

    Registration Authority: Association for Pathology Informatics

    Obligation: optional

    Maximum Occurrence: Unlimited

    Comment(must include detailed definition):

    subClassOf:

    Contributor (your consistent first-name last-name):

    Date of your contribution:



    The general format for property elements is:


    Property Label (in standard XML tag format, lowercase 1st letter):

    Registration Authority: Association for Pathology Informatics

    Obligation: optional

    Maximum Occurrence: Unlimited

    Datatype (can be "Literal", a list, or a regex; default is "Literal"):

    Comment(must include detailed definition):

    Domain (comma-delimited if multiple):

    Range (usually "Literal"):

    Contributor (your consistent first-name last-name):

    Date of your contribution:


    The category "Obligation" should contain the word "required" or the word "optional".  For the kinds of specifications discussed in this manuscript, including any CDE would always be optional. Similarly, for "Maximum Occurrence", we would think any CDE could occur an unlimited number of times in an RDF document. 


    Classes and Properties

    Here is an example of an ISO-11179-compliant CDE written for a class named "Reagent".



    Class Label:Reagent

    versionInfo (required): 0.1

    Registration Authority: Association for Pathology Informatics

    Obligation:optional

    Maximum Occurrence: Unlimited

    Datatype: Literal

    comment: Histologic_stain_reagents, tissue_fixation_reagents, and other chemicals  employed in the laboratory.  For example: distilled_water, ethanol,  hematoxylin, aluminum_sulphate

    subClassOf:Class

    Contributor:Bill Moore

    Date_of_contribution:05-30-2006



    Once we have the CDE, it is a straightforward job to create an RDF Schema Class element:


    <rdf:Class rdf:about="http://www.ldip.org/ldip_sch#Reagent">

    <rdfs:label>Reagent</rdfs:label>

    <rdfs:comment>

    Histologic_stain_reagents, tissue_fixation_reagents, and other chemicals  employed in the laboratory.  For example: distilled_water, ethanol,  hematoxylin, aluminum_sulphate

    </rdfs:comment>

    <rdfs:subClassOf rdf:resource="xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#Class"/>

    </rdf:Class>

                                      

    A few points bear explanation.  All the information needed to generate an  RDF class or property should be contained in an ISO-11179-compliant list of CDEs.  An RDF Schema may consist purely of classes and properties.  Classes are defined in RDF exclusively through their ancestral relation.  Basically, to build a class in RDF Schema, you announce that the element is a Class, you provide a unique locator (such as a URL) or a unique universally understood descriptor (more on this later) for the element, a description of the element, and the name of the father class of the element.


    That's all there is to do for classes.  You don't need to list the subclasses of the class because the subclasses will list the class as their father in their own schema entry.  You don't need to list the properties of the class because the properties will list the classes  whose data they describe.


    Do classes in RDF schema remind you of anything?  The classes in an RDF schema comprise an ontology.  In fact, you can think of an ontology as the "classy" half of an RDF schema.  An ontology is a list of classes and their relationships.  Lists of classes become most useful when they have properties. 



    Creating Properties

    A property is a metadata element that is used to describe the data assigned to one or more class objects.  Here is the CDE for a property named "dateTime".



    Identifier:ldip:dateTime

    Property Label:dateTime

    versionInfo: 0.1

    Registration Authority: Association for Pathology Informatics

    Language:en

    Obligation:optional

    Maximum Occurrence: Unlimited

    Datatype: /[\+\-]{1}[\d]{8}\.[\d]{6}Z[\+\-]{1}[\d]{4}/

    comment: ISO 8601 format of data and time.

    domain:Event

    range: http://www.ldip.org/ldip_xsd.xsd#iso8601

    Contributor:Bill Moore

    Date_of_contribution:05-30-2006



    An RDF Schema declaration for the dateTime property might be:


    <rdf:Property rdf:about="http://www.ldip.org/ldip_sch#dateTime">

      <rdfs:label>dateTime</label> 

      <rdfs:comment>

       The date and time at which an event occurs, in ISO8601 format   

      </rdfs:comment>

      <rdfs:domain rdf:resource="http://www.ldip.org/ldip_sch#Event"/>

      <rdfs:range rdf:resource="http://www.ldip.org/ldip_xsd.xsd#iso8601"/>

    </rdf:Property>



    Let's look at the Time property.  The first line announces that we will be declaring a Property.  The second line tells us the name of the Property (Time) and it's URL (the current RDF schema document).  The  third line provides the label by which we will refer to the Property.  This might come in handy if we had different names for the property in different languages.  The comment includes a definition for the element. 


    The next line specifies the domain (class) for the property.  The domain of a property is the class for which the property  may be used.  In this case, the domain class for which the dateTime property applies is Event.  This makes sense.  If you need to describe an event, you would want to include the time that the event occurred.  


    A property for a class may serve as a property for all of the subclasses of the class (because all the subclass instances are members of the ancestor class).  Every Property must have a domain (a class or classes for which the Property may be used) and a Range (a specified kind of data that is described by the Property).  A property may have multiple classes in its domain.  When a property has multiple classes in its domain, all the classes in the domain share the same property (obviously).   This achieves some of the functionality of multi-class inheritance without actually needing to instantiate multiple classes under a single object.  This is a subtle concept, and does not need to be mastered at this time.  Suffice it to say that as you create your own RDF Schemas, you should try to design your Properties to apply to multiple classes, and you should try to instantiate objects under a single class.


    Specifying a datatype from within an RDF Schema Property Element

    Let us continue to examine the dataTime property.  Recall that a triple consists of a subject followed by metadata (the property element) followed by the data.  The property element describes the data.  The range of the property element tells us what kind of data is described.  In RDF schemas, the range of a property is often "Literal" an element defined in the RDF syntax document that refers to any character string.  You can see immediately that describing the range of a property as a character string does little to constrain or structure the expected values for a data element.


    In the dateTime property, we want the range of the property to be data that conforms to the  ISO8601 date/time format.  How do we convey the datatype of the data/time element in RDF? 


    RDF has no intrinsic datatyping facility.  So for our property range, we provide a resource (URL) that specifies an element in an .xsd file that defines the datatype we need. 


    The range for the dateTime property is a resource:


    <rdfs:range rdf:resource="http://www.ldip.org/ldip_xsd.xsd#iso8601"/>


    The resource points us to an xsd file on the web, and to a particular element within the xsd file, labeled iso8601.  Let's pretend we visit the file and extract the iso8601 element.  We might find the following:



    <simpleType name='iso8601'>

    <!-- values of a data_time must contain            -->

    <!-- a plus or minus sign occurring zero or one  -->

    <!-- times followed by 8 digits                          -->

    <!-- followed by a perios                                  -->

    <!-- followed by 6 digits                                   -->

    <!-- followed by the a letter Z, T or a space       -->

    <!-- followed by a plus or minus sign occurring  -->

    <!-- zero or one time, followed by 4 digits         -->

      <xsd:restriction base='string'>

        <pattern value=''[\+\-]?[\d]{8}\.[\d]{6}[ZT ][\+\-]{1}[\d]{4}"/>

      </xsd:restriction>

    </simpleType>



    The essence of the datatype is found in the pattern value line:


    <pattern value=''[\+\-]?[\d]{8}\.[\d]{6}[ZT ][\+\-]{1}[\d]{4}"/>


    This line uses a Regular Expression (RegEx) that provides a pattern to which the element must conform.  RegEx is beyond the scope of this manuscript. 


    Don't be intimidated by .xsd and Regex rules.  For most purposes, simply describing the range of a property with the RDF syntax-defined element, "Literal" will be all that you need.


    The .xsd element definition imposes a datatype pattern on the value of the data described by the property. A validating software agent would check an RDF document to determine if the data described by a property conforms to the range of the property element defined by the element in the .xsd resource for the the property range.


    If we want to employ this trick, we'll need to prepare an XSD file that contains elements for all the datatypes referred under the property ranges included in our XML Schema.  


    Creating the external XSD document to datatype our property ranges


    XSD datatype files are very easy to prepare.  Basically, you just list your datatypes and provide descriptors.   The following generic file contains samples of the kinds of datatypes you will probably need (patterns, inclusive values, and unions).


    <?xml version="1.0" encoding="UTF-8"?>

    <xsd:schema

    xmlns:xsd ="http://www.w3.org/2000/10/XMLSchema#">


    <simpleType name='sp_pattern'>

    <!-- values of an accession number must contain -->

    <!-- the letters sp followed by a hyphen followed    -->

    <!-- by two digits followed by a hyphen               -->

    <!-- followed by any number of digits                  -->

      <xsd:restriction base='string'>

        <pattern value='sp\-[0-9]{2}\-[0-9]+'/>

      </xsd:restriction>

    </simpleType>



    <xsd:simpleType name="adult_age">

    <!-- an adult is at least 18 years old                 -->

      <xsd:restriction base="xsd:positiveInteger">

         <xsd:minInclusive value="18"/>

      </xsd:restriction>

    </xsd:simpleType>


    <xsd:simpleType name="serialNumber">

      <!--  may be either integers or mixed alphanumeric strings -->

      <xsd:union>

         <xsd:simpleType>

           <xsd:restriction base='integer'/>

         </xsd:simpleType>

         <xsd:simpleType>

           <xsd:restriction base='string'/>

         </xsd:simpleType>

      </xsd:union>

    </xsd:simpleType>


    <xsd:simpleType name="EnumerationObjectives">

    <!--  may be either integers or mixed alphanumeric strings -->

      <xsd:restriction base="string">

        <xsd:enumeration value="2.5x"/>

        <xsd:enumeration value="6.3x"/>

        <xsd:enumeration value="20x"/>

        <xsd:enumeration value="40x"/> 

        <xsd:enumeration value="100x"/>

      </xsd:restriction>

    </xsd:simpleType>

    </xsd:schema>


    The differences between Classes and Properties

    The most difficult step in building any schema is determining whether a candidate element is a Class or a Property.  Generalizations do not hold for all cases.  For example, Classes tend to be nouns, while Properties (that describe data) tend to be adjectives.  However, a Property can be a noun (e.g. Time) if it's role is to describe a data value (4:00 PM EST).  Furthermore, we sometimes assign active processes to classes (e.g. birth, death), and we cannot assume that classes are always static objects.


    There is a strong tendency to assign subclass status to things that are not examples of their ancestral class.  For instance, if Person is a class, someone may think that Leg is a subclass of Person (because a Leg is in a class of things that are parts of a Person).  No! Leg is never a subclass of Person because a Leg is not a Person.  A subclass of Person must be composed of types of Persons.  So, Patient is a subclass of Person, and Pathologist is a subclass of Person, because they are both examples of Persons and because there are instances of Patients and instances of Pathologists.  Remember, a class is a construct whose chief job is to provide specified instances.


    How about Friend?  Is Friend a subclass of Person?  Yes and no.  Friend can be a subClass of person if you want to organize Persons based on whether they are Friends or not-Friends.  However, if you  think that being a friend is just one of many features of any Person, you would be much better off defining friend as an RDF property. The data-type of the friend property may be a Boolean (true or false).


    Here are some general recommendations for distinguishing RDF Schema Classes and Properties.