Books by Jules J. Berman, covers


Implementing an RDF Schema for Pathology Images


Implementing an RDF Schema for Pathology Images

Title

  Jules J. Berman and G. William Moore
  Workshop
  Implementing an RDF Schema for Pathology Images,
  from the Association for Pathology Informatics.
  APIII, Pittsburgh, PA
  September 10, 2007
  Copyright 2007 Jules J. Berman
  Distributed under GNU Free Documentation license
  http://www.gnu.org/licenses/fdl.txt

How to read this document

This document is written for people who need to annotate their photomicrographs in a manner in a manner that binds descriptive data to the image, so that:

1. Collections of photomicrographs can be searched based on their descriptive content or by their image content or both.

2. Individual images can be sent to colleagues, and the person who receives the image can extract from the image file, descriptive text that the sender included with the image.

3. After inserting text inside an image, the person who prepared the image can be certain that years later, after all the clinical and pathologic details associated with the image have been long forgotten, the image will still provide this information.

4. The data included in the image can be prepared in a standard form that is computer parsable and understandable to software agents that search files on the Web.

This document does not contain software applications, nor does it recommend any software applications. Instead, we provide very short scripts (usually about 5 to 20 lines in length) written in Perl or Ruby, that perform the computational tasks described in the text. These scripts are so simple and so short that if you program in another language (e.g., Python, Java, or C), you should have no trouble converting the scripts to your preferred language.

This document is written with the assumption that if you want to achieve self-reliance in pathology informatics, you must learn acquire some minimal programming skills.

For those readers who want to know the best ways of stucturing the data they include with their image, this document provides in-depth discussion of RDF (Resource Description Framework). RDF is a simple technique for bundling all data as triples (the data object plus a piece of metadata that describes the data value plus the value assigned to the data object). This simple but powerful technique allows data triples to be shared among heterogeneous datasets and is the basis for the so-called "Semantic Web."

The document arranges techniques by level of difficulty (Levels 1 through 6). Level 1 describes the simplest method for conveying text in images (inserting a free-text description in the header of a JPEG file). Beginners can stop reading after Level 1 and resume reading at a later date, if so inclined.

Conveying textual information with photomicrographs

Pathology images have no value unless they are annotated with information that describes the image.

Important descriptors of an image might include:

  File information
  Image capture information
  Image format information
  Specimen information
  Patient information
  Pathology  information
  Region of interest information

The Laboratory Digital Imaging Project (2004-2007)

The API (Association for Pathology Informatics) wishes to provide anyone using pathology image data with optional methods for annotating any kind of pathology image, in any image format the user prefers.

The API was not interested in creating yet another new standard that obligates people to use a particular image format for their pathology images.

The API sponsored the Laboratory Digital Imaging Project (LDIP) to provide free and open methods for specifying image data that could be used with existing standard image formats (such as DICOM or JPEG).

From 2004-2007, the API sponsored LDIP, the Laboratory Digital Imaging Project, which consisted of API members and imaging software developers.

Conference calls were conducted from 2004 to 2006, and the minutes of the discussions are available:

  • Instructions for submitting CDEs to LDIP [May 8, 2006, ldip_cde.txt]
  • Minutes of April 28, 2006 LDIP Conference Call [4-28-06.txt]
  • Minutes of the March 31, 2006 LDIP Conference Call [03-31-06.txt]
  • Minutes of the February 24, 2006 LDIP Conference Call [2-24-06.txt]
  • Minutes of the January 27, 2006 LDIP Conference Call [1-27-06.txt]
  • Minutes of the December 30, 2005 LDIP Conference Call [12-30-05.txt]
  • Minutes of the September 30, 2005 LDIP Conference Call [9-30-05.txt]
  • Minutes of the July 29, 2005 LDIP Conference Call
  • Minutes of the June 24, 2005 LDIP Conference Call [6-24-05.txt]
  • Dr. Jules Berman's presentation at the API session of Laboratory InfoTech Summit in Las Vegas (March 4, 2005)
  • Minutes of the Feb. 25, 2005 LDIP Conference Call
  • Minutes of the January 28, 2005 LDIP Conference Call
  • Minutes of the December 17, 2004 LDIP Conference Call
  • LDIP Blog for Lab IntoTech Summit, December 7, 2004, by Berman JJ and Friedman BA
  • Minutes of the November 19, 2004 LDIP Conference Call
  • Minutes of the November 2, 2004 LDIP Conference Call with the Veterans Administration DICOM implementers
  • Minutes of the October 29, 2004 LDIP Conference Call
  • Draft of LDIP task document (Updated October 29, 2004)
  • Minutes of second LDIP open workshop (October 6, 2004)
  • Dr. Jules Berman's presentation at APIII LDIP open workshop (October 6, 2004)
  • LDIP Charter document (draft revised January 25, 2005)
  • Minutes of first LDIP Conference Call (May 3, 2004)

    In 2007, after much discussion, the API Council determined that there were, in existence, adequate methods for specifying images-related data. What was needed were general instructions for image annotation and a few simple software scripts that could parse, insert, extract, port, and interchange image annotations.

    LDIP was dissolved, and the API Council accepted the primary goal of providing the field of pathology informatics with a document that describes available open annotation methods.

    As a secondary goal, the API would provide a very short RDF Schema that would permit those who prefer RDF annotations to type their metadata under general classes and properties that have particular relevance to pathologists (more about this later).

    This document is the current draft of instructions for image annotation.

    Increasing levels of complexity

    Level 1 - Free-text description inserted into image header

    Level 1. Simply composing a free-text description of your image and any other information you'd like to add, such as your name, and adding the information as a Comment field in the header of the image file. The Comment will not alter the binary content of the image or the visual form of the image.

    When the file is copied, it will retain the header comment, and anyone receiving the image can read what you've added, using a simple Perl or Ruby script provided in the document, or using a simple extraction program prepared in any

  • How to do Level 1 - Free-text description inserted into image header
  • Level 2 - Dublin Core descriptors inserted into image header

    The Dublin Core is basic information designed by librarians to provide a minimal set of data to describe the contents of an electronic document. When the file is copied, it will retain the Dublin Core metadata, and anyone receiving the image can read what you've added, using a simple Perl or Ruby program provided in the document, or using a simple extraction program prepared in any

  • How to do Level 2 - Dublin Core descriptors inserted into image header
  • Level 3 - Insert RDF document into image file

    Insert an RDF (Resource Description Framework) document into your image file.

    The RDF document can be extracted, and the triples in the document can be extracted and integrated with other data.

  • How to do Level 3 - Insert RDF document into image file
  • Level 4 - Insert your image into RDF document

    Image file with RDF document inserted into image file header Almost all popular image formats contain "header" sections that are not part of the actual image binary. The header sections contain information that is used by image viewing software to properly display the image. Robust imaging software applications are written with subroutines that parse through the different headers of images and extract information such as the height, width, pixel number, pixel size, pixel color, color map index, and so on. Some headers are extensible, allowing software to insert blocks of text into the header without changing the image binary.

  • How to do Level 4 - Insert your image into an RDF document
  • Level 5 - Point to your image file from an RDF document.

    The RDF document and the image file (for example, jpeg) can be separate documents linked by URLs.

  • How to do Level 5 - Point to your image file from an RDF document.
  • Level 6 - Combining of 1 through 5, with multiple images/annotations

    Break up your annotative data and your image binaries into multiple documents, that can be pointed from any of the files, and that can exclude or include RDF or image binary data, as desired.

    The RDF data can be distributed into multiple documents, and each RDF document may point to more than one image file.

  • How to do Level 6 - Combining of 1 through 5, with multiple images/annotations

  • How to do it

    How to acquire Perl and/or Ruby

    Perl and Ruby are free, open source software and can be downloaded from multiple web sites. Linux downloads for either language are ubiquitous. Perl is distributed with most Linux operating system packages.

    For Windows users, if you use Perl, get a free installation from ActiveState at:

      http://www.activestate.com/

    For Ruby, go to:

      http://rubyforge.org/frs/?group_id=167

    Most of the scripts in this document, and many other medical-related Perl and Ruby scripts, are available in Jules Berman's previously published books:

    Biomedical Informatics
    Perl Programming for Medicine and Biology
    Ruby for Medicine and Biology

    How to view a jpeg image

    How to view a jpeg image in Perl

    Here is a short script that lets you look at any JPEG image. In this script, the image that's used is leaf.jpg.

      #!/usr/local/bin/perl
      use Tk;
      use Tk::JPEG;
      my $mw  = MainWindow->new();
      my $file = "c\:\\ftp\\leaf\.jpg";
      my $image = $mw->Photo(-file => $file);
      $mw->Label('-image' => $image, -height=>500, -width=>600)->pack;
      #$mw->Label(-image => $image)->pack();
      MainLoop;
      exit;
    

    How to view a jpeg image in Ruby

    Preliminaries

    In Ruby, you need to install ImageMagick, Tk and Rmagick if you want to view and modify images.


    ImageMagick is free software for modifying images.

    An important feature of ImageMagick is that many popular programming languages, including Ruby, provide interfaces to ImageMagick.

    RMagick is Ruby's interface to ImageMagick.

    Tk is a free language for creating GUIs (graphic user interfaces). Tk employs widgets (small windows within the Tk window) for input and display structures.

    If you install ImageMagick, RMagick and Tk onto your computer, you can "require" them into your Ruby scripts and create applications that create, modify, evaluate, and display images. All three applications are available at no cost for users of Windows or Linux/Unix operating systems. Ample instruction is available at the web sites, below. Here are some suggestions for Windows users:


    For Windows users, a good way to install RMagick is to follow these instructions.
    Go to the RubyForge site and download the combined RMagick and ImageMagick binaries.

    1. Go to the RubyForge site:

    http://rubyforge.org/frs/?group_id=12&release_id=8170

    This page has a combined win32 binary package for RMagick and ImageMagick

    Pick the binary that is appropriate for your version of Ruby.

    I use Ruby 1.8.4, so I chose the following binary:

    rmagick-1.13.0-IM-6.2.9-0-win32.zip 12.39 MB

    2. Download the binary (zip file) and expand it.

    This produces the subdirectory:

    rmagick-1.13.0-IM-6.2.9-0-win32

    The subdirectory contains a group of files:

    ImageMagick-6.2.9-0-Q8-windows-dll.exe
    README-RMAGICK.html
    README-RMAGICK.txt
    README.html
    rmagick-1.13.0-win32.gem

    3. Run the ImageMagick .exe file, and it will guide you through its installation.

    4. After ImageMagick is installed, you can install the RMagick gem file by invoking Ruby's gem tool with an install command followed by the name of the gem file (add the full path to the gem file if you are not installing from its current subdirectory)

    c:\ftp>gem install rmagick-1.13.0-win32.gem

    5. All the information you need to start using RMagick from within your own Ruby Scripts is found at:
    http://www.simplesystems.org/RMagick/doc/

    Then install Tcl/Tk by visiting ActiveState and downloading the Activebinary for Windows users.

    You've finished your installations. Now you're ready to write Ruby scripts that use and display images.

    Displaying a JPEG image in Ruby

    You can load a JPEG file with RMagick and display it in a window with Tk.

    The figure below displays a jpeg image of a leaf (obtained from Wikipedia as a public domain file: http://en.wikipedia.org/wiki/Image:Leaf_1_web.jpg, and renamed leaf.jpg for this page).


    The image of the leaf produced by the Ruby script, and the Ruby script source code that displays the image, are shown below:
      #!/usr/local/bin/ruby
      require 'RMagick'
      include Magick
      leaf = ImageList.new("leaf.jpg").resize!(0.7)
      leaf_copy = leaf.write("leaf.gif")
      require 'tk'
      root = TkRoot.new {title "view"}
      TkButton.new(root) do
        image TkPhotoImage.new{file "leaf.gif"}
        command {exit}
        pack
      end
      Tk.mainloop
      exit
    




    Viewing a JPEG image in Ruby

    Manipulating an image in Ruby


    #!/usr/local/bin/ruby
    #leaf3.rb
    #
    #This Ruby script was created by Jules J. Berman on 7/8/2007
    #and is provided as a public domain document
    #
    #The software is provided "as is", without warranty of any kind,
    #express or implied, including but not limited to the warranties
    #of merchantability, fitness for a particular purpose and
    #noninfringement. in no event shall the authors or copyright
    #holders be liable for any claim, damages or other liability,
    #whether in an action of contract, tort or otherwise, arising
    #from, out of or in connection with the software or the use or
    #other dealings in the software.
    #
    require 'RMagick'
    include Magick
    orig_leaf = ImageList.new("leaf.jpg").resize!(0.4)
    orig_leaf.write("orig.gif")
    leaf = ImageList.new("leaf.jpg").first.crop(50, 310, 300, 300).resize!(0.4)
    leaf.write("new.gif")
    require 'tk'
    root = TkRoot.new {title "view"}
    TkButton.new(root) do
      image TkPhotoImage.new{file "orig.gif"}
      command {exit}
      pack
    end
    TkButton.new(root) do
      image TkPhotoImage.new{file "new.gif"}
      command {exit}
      pack
    end
    Tk.mainloop
    exit
    




    The figure below displays a jpeg image of a leaf (obtained from Wikipedia as a public domain file: http://en.wikipedia.org/wiki/Image:Leaf_1_web.jpg, and renamed leaf.jpg for this page).

    The image of the leaf sits atop the same image, modified by the "crop" method, which extracts a section of the image. The example sends the criop method to the image object, with arguments (50, 310, 300, 300).

    The syntax of the crop method is: crop(x, y, width, height)

    x and y are the offsets from the upper left corner of the image.

    How to do Level 1 - Free-text description inserted into image header

    Perl

    Preliminaries

    Download the external module Image::MetaData::JPEG from the Perl packet manager (if ActiveState Perl is installed on your system, simply enter ppm as your command line and follow the instructions on the packet manager client).

    Perl example

    Perl script, meta_jpg.pl, to show how metadata can be added to a jpeg file.

      #!/usr/local/bin/perl 
      use Image::MetaData::JPEG; 
      my $filename = "leaf.jpg"; #comment:your filename here 
      my $file = new Image::MetaData::JPEG($filename); 
      die 'Error: ' . Image::MetaData::JPEG::Error() unless $file; 
      print "Description of JPEG file\n"; 
      print $file->get_description(); 
      print "\n\nRDF Annotations to JPEG file\n\n"; 
      $line = "My this is a nice image of a leaf"; 
      $file->add_comment($line); 
      unlink $filename; 
      $file->save($filename); 
      my $file = new Image::MetaData::JPEG($filename); 
      my @comments = $file->get_comments(); 
      print join("",@comments); 
      exit;     
      
      
      The comment "My this is a nice image of a leaf" was added 
      to the header of the JPEG file, ldip2103.jpg
    

    Ruby

    Ruby examples

    Inserting character strings into a JPEG header

    Here is a Ruby script that inserts a Comment and a Label into the JPEG header.

      #!/usr/local/bin/ruby
      require 'RMagick'
      include Magick
      walnut = ImageList.new("c\:\\ftp\\rb\\CT4192~1.JPG")
      walnut.cur_image[:Label] = "hello"
      walnut.cur_image[:Comment] = "<html><title>me</title></html>"
      walnut.properties{|name, value| print "#{name} #{value}\n"}
      walnut_copy = ImageList.new
      walnut_copy = walnut.cur_image.copy
      walnut_copy.write("c\:\\ftp\\rb\\out.JPG")
      walnut_copy.properties{|name, value| print "#{name} #{value}\n"}
      exit
      
      Output:
      Comment <html><title>me</title></html>
      JPEG-Colorspace 2
      JPEG-Sampling-factors 2x2,1x1,1x1
      Label hello
      Comment <html><title>me</title></html>
      JPEG-Colorspace 2
      JPEG-Sampling-factors 2x2,1x1,1x1
      Label hello
    

    Inserting a text file into a JPEG header

    Here is the sample text for a text file.

    "The image is a squamous cell carcinoma of the floor of the mouth. It was taken by Jules Berman, on February 2, 2002. The microscope was an Olympus model 3453. The lens objective was 40x. The camera was a Sony model 342. The image is jpeg and has dimensions of 524 by 429 pixels. The microscope and camera were not calibrated. The specimen Baltimore Hospital Center S-3456-2001, specimen 2, block 3. The specimen was logged in 8/15/01 and processed using the standard protocol for H&E that was in place for that day. The patient is Sam Someone, medical identifier 4357. The tissue was received in formalin. The specimen shows a moderately differentiated, invasive squamous cell carcinoma. The patient has a 30 year history of oral tobacco use. The image is kept in a jpeg file named y49w3p2.jpg and kept in the pathology subdirectory of the hospital's server. Its URL is https://baltohosp.org/pathology/y49w3p2.jpg. The image file has an md_5 hash value of 84027730gjsj350489. The image has no watermark. Copyright is held by Baltimore Hospital Center, and all rights are reserved."

    You can put this text into a file, named "addtext.txt" and use the jpeg_add.rb Ruby script to do the insertion.

      jpeg_add.rb, inserts plain-text file gwmbw.txt into a JPEG image
      #!/usr/local/bin/ruby
      require 'RMagick'
      include Magick
      text = IO.read("addtext.txt")
      orig_image = ImageList.new("gwmbw.jpg")
      orig_image.cur_image[:Comment] = text
      print "\nComment added, let's make a file to hold the
         modifications\n\n"
      copy_image = ImageList.new
      copy_image = orig_image.cur_image.copy
      copy_image.write("c\:\\ftp\\rb\\gwmout.JPG")
      copy_image.properties{|name, value| print "#{name}\n#{value}\n"}
      exit
    

    How to do Level 2 - Dublin Core descriptors inserted into image header

    Dublin core metadata is a special kind of text.

    There are a few problems with simply writing free-text descriptions of your images. Although you might think you've written an adequate description of your image, the likelihood is that you have forgotten to include important information about the file.

    The Dublin Core consists of about 15 data elements selected by a group of librarians, that specifies the kind of file information a librarian might use to describe a file, index the described file, and retrieve files based on included information.

    There are many publicly available documents that describe the Dublin Core elements:

    http://www.ietf.org/rfc/rfc2731.txt

    The Dublin Core elements can be inserted into HTML documents, simple XML documents, or RDF documents. A public document explains exactly how the Dublin Core elements can be used in these file formats:

    http://dublincore.org/documents/usageguide/#rdfxml

    An example of a very simple Dublin Core file description in RDF format is shown below:

      <?xml version="1.0" encoding="UTF-8"?>
      <rdf:RDF 
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
         xmlns:dc="http://purl.org/dc/elements/1.1/">
      <rdf:Description rdf:about="http://www.julesberman.info/rubydisp.htm">
          <dc:creator>Jules Berman</dc:creator>
          <dc:title>Ruby Programming for Medicine and Biology, Jules J. Berman</dc:title>
          <dc:description>Provides instructions for displaying an image in Ruby</dc:description> 
          <dc:date>2007-07-01</dc:date>
      </rdf:Description> 
      </rdf:RDF>

    As you can see, RDF is just a dialect of XML, and it is quite easy to read RDF files without any special instruction. In a later section, we will be explaining the logic and syntax of RDF. For now, all you need to know is:

      The basic features of any file can be described using 
      the Dublin Core common data elements.
     
      These data elements can be represented in HTML, XML or RDF.
      Because HTML, XML and RDF are text files, they can be inserted into
      image files techniques described for text strings or text
      files (Level 1).

    How to do Level 3 - Insert RDF document into image file

    It is easy to insert an RDF document into the header of a jpeg image file, and it is just as easy to extract the RDF triples. Here's how you do it:

      1. Prepare your RDF document.
      2. Use a script to insert the document into your jpeg header.
      3. Use the new jpeg file (now with RDF comments) to display the 
           image or to send to colleagues.  When displayed, it will look 
           exactly like the file before the contents of the RDF document 
           were added.
      4. Use another script to extract the comments from the header of 
           the jpeg file, as needed.

    The following Perl script will take the jpeg image ldip2103.jpg and add the RDF document from the first use-case to its header. Once created, the program extracts and displays the contents of the RDF file.

      #!/usr/local/bin/perl
      use Image::MetaData::JPEG;
      my $filename = "ldip2103.jpg"; #comment:your filename here
      my $file = new Image::MetaData::JPEG($filename);
      die 'Error: ' . Image::MetaData::JPEG::Error() unless $file;
      print "Description of JPEG file\n";
      print $file->get_description();
      print "\n\nRDF Annotations to JPEG file\n\n";
      open (TEXT, "rdf_desc.xml")||die"cannot"; #the rdf document you'll add
      $line = " ";
      while ($line ne "")
         {
         $line = <TEXT>;
         $file->add_comment($line);
         }
      unlink $filename;
      $file->save($filename);
      my $file = new Image::MetaData::JPEG($filename);
      my @comments = $file->get_comments();
      print join("",@comments);
      exit;
    

    This Perl script requires the freely available open source module, Image::MetaData::JPEG. You can download this module from CPAN (Comprehensive Perl Archive Network, www.cpan.org).

    The last few lines extracts and prints the RDF file from the image.

    This Perl script is functionally equivalent to the Ruby script used in Level 2 to insert a Dublin Core RDF file into a jpeg image.

    What exactly is an RDF file?

    Introduction to RDF

    The Resource Description Framework (RDF) provides a simple method for specifying information as data triples. The authors believe that much of the time and expense associated with developing and deploying data standards can be eliminated by a consistent implementation of recommended RDF data specification practices.

    Necessary background subjects:

    
      1. Meaning in informatics
      2. Triples
      3. Identifiers
      4. Datatyping
      5. Classes and Properties
      6. Instantiating Classes

    Necessary informatics techniques:

      1. RDF syntax (specifying data as class instance-property-data triples)
      2. RDF schema (formal dictionary for classes and properties)
      3. XSD (to constrain data to a defined datatype)

    The only implementation tools you really need are your head and a text editor such as notepad or emacs.

    Background

    The definition of Meaning

    In informatics, assertions have meaning whenever a pair of metadata and data (the descriptor for the data and the data itself) is assigned to a specific subject.

    Triples consist of: Specified subject, then metadata, then data.

    Some triples found in a medical dataset

      "Jules Berman" "blood glucose level" "85"
      "Mary Smith" "blood glucose level" "90"
      "Samuel Rice" "blood glucose level" "200"
      "Jules Berman" "eye color" "brown"
      "Mary Smith" "eye color" "blue"
      "Samuel Rice" "eye color" "green"

    Some triples found in a haberdasher's dataset

      "Juan Valdez" "hat size" "8"
      "Jules Berman" "hat size" "9"
      "Homer Simpson" "hat size" "9"
      "Homer Simpson" "hat_type" "bowler"
      
      
    Triples collected from both datasets whose subject is "Jules Berman"
      "Jules Berman" "blood glucose level" "85"
      "Jules Berman" "eye color" "brown"
      "Jules Berman" "hat size" "9"
      
    Triples can port their meaning between different databases because they 
    bind described data to a specified subject. This supports data integration
    of heterogeneous data and facilitates the design of software agents.  A 
    software agent, as used here, is a program that can interrogate multiple 
    RDF documents on the web, initiating its own actions based on inferences 
    yielded from retrieved triples. 
                   
    RDF (Resource Description Framework) is a syntax for writing computer-parsable
    triples.  For RDF to serve as a general method for describing data objects,
    we need to answer the following four questions:.
      1. How does the triple convey the unique identity of its subject?  
         In the triple, "Jules Berman" "blood glucose level" "85", The 
         name "Jules Berman" is not unique and may apply to several different 
         people.
      2. How do we convey the meaning of metadata terms?  Perhaps one person's
         definition of a metadata term is different from another person's.  
         For example, is "hat size" the diameter of the hat, or the distance 
         from ear to ear on the person who is intended to wear the hat, or a 
         digit selected from a pre-defined scale?
      3. How can we constrain the values described by metadata to a specific 
         datatype?  Can a person have an eye color of 8?  Can a person have 
         an eye color of "chartreuse"?
      4. How can we indicate that a unique object is a member of a class and 
         can be described by metadata shared by all the members of a class?

    Much of the remainder of the background section will be devoted to answering these four questions.

    Introduction to RDF syntax: RDF triples

    RDF is a specialized XML syntax for creating computer-parsable files consisting of triples. The subject of the RDF triple is invoked with the rdf:about attribute. Following the subject is a metadata/data pair.

    Let us create an RDF triple whose subject is the jpeg image file specified as: http://www.the_url_here.org/ldip/ldip2103.jpg. The metadata is <dc:title> and the data value is "Normal Lung".

      <rdf:Description                     
          rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
          <dc:title>Normal Lung</dc:title>
        </rdf:Description>
      
      An example of three triples is proper RDF syntax is:
      <rdf:Description                     
          rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
          <dc:title>Normal Lung</dc:title>
        </rdf:Description>
      <rdf:Description                     
          rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">   
          <dc:creator>Bill Moore</dc:creator>
        </rdf:Description>
      <rdf:Description                     
          rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
          <dc:date>2006-06-28</dc:date>
        </rdf:Description>
      
    RDF permits you to collapse multiple triples that apply to a single 
    subject.  The following RDF:Description statement  is equivalent to the 
    three prior triples:
      <rdf:Description                     
          rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
          <dc:title>Normal Lung</dc:title>
          <dc:creator>Bill Moore</dc:creator>
          <dc:date>2006-06-28</dc:date>
       </rdf:Description>

    An example of a short but well-formed RDF image specification document is:

      <?xml version="1.0"?>
      <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:dc="http://purl.org/dc/elements/1.1/">
        <rdf:Description                     
          rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
          <dc:title>Normal Lung</dc:title>
          <dc:creator>Bill Moore</dc:creator>
          <dc:date>2006-06-28</dc:date>
        </rdf:Description>
      </rdf:RDF>

    The first line tells you that the document is XML. The second line tells you that the XML document is an RDF resource. The third and fourth lines are the namespace documents that are referenced within the document (more about this later). Following that is the RDF statement that we have already seen.

    How to do Level 4 - Insert your image into an RDF document

    Example in Perl

    Though we distinguish text files from binary files, all files are actually binary files. Sequential bytes of 8 bits are converted to ascii equivalents, and if the ascii equivalents are alphanumerics, we call the file a text file. If the ascii values of 8-bit sequential file chunks are non-alphanumeric, we call the files binary files.

    Standard format image files are always binary files. Because RDF syntax is a pure ascii file format, image binaries cannot be directly pasted into an RDF document. However, binary files can be interconverted to an from ascii format, using a simple software utility. This simple Perl script, using the MIME::Base64::Perl module is all that is necessary to interconvert binary files to Base64.

      #!/usr/bin/perl
      use MIME::Base64::Perl;
      open (TEXT,"c\:\\ftp\\ldip\\ldip2103\.jpg")||die"cannot"; #path to sample file
      binmode TEXT;
      $/ = undef;
      $string = <TEXT>;
      close TEXT;
      $encoded = encode_base64($string);
      open(OUT,">2103.txt");
      print OUT $encoded;
      close OUT;
      
      #$decoded = decode_base64($encoded);
      #open(OUT,">binary.jpg");
      #binmode OUT;
      #print OUT $decoded;
      exit;
    

    Here is an example of the same RDF document shown in the prior use-case. The only difference is that in addition to pointing to the URL that identifies the image, this document contains the image file converted to base64 ascii.

      <?xml version="1.0"?>
      <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:ldip="http://www.the_url_here.org/ldip_sch#";
            xmlns:dc="http://purl.org/dc/elements/1.1/">
      
       <rdf:Description 
       rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
          <rdf:type  
       rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>;
          <dc:title>Normal Lung</dc:title>
          <dc:creator>Bill Moore</dc:creator>
          <dc:date>2006-06-21</dc:date>
          <ldip:instrument_id 
       rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>    
          <ldip:instrument_id 
       rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
          <ldip:imageType>photomicrograph</ldip:imageType>
          <ldip:stain>H and E</ldip:stain>
          <ldip:tissue>lung</ldip:tissue>
          <ldip:organism>human</ldip:organism>
          <ldip:objective>10x</ldip:objective>
          <ldip:diagnosis>normal</ldip:diagnosis>
          <ldip:base64File>
      /9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8
      UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDB
      gNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyM
      jIyMjIyMjL/wAARCAYACAADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAEC
      .
      .
      .
      D1phhQnOMVSxL6nRHEaWZz8tpvGBWdPpMpbO08967AW8YOcU4xIRgqKr6yl0N449x0S
      0OFOkv6Yq1a6W6HJHFdb9miznbStAhB4AzVfWl0KqY3nsjjNSuEtosE9BXC6lrf7whT
      39a7jxLp0xjbah6V5leaZchySh6104PC+196R1zrU6cFZo/9k=
      </ldip:base64File>
      </rdf:Description>
      
      <rdf:Description 
      rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
      <rdf:type 
      rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>;
      <ldip:instrumentType>Microscope</ldip:instrumentType>
          <ldip:make>Olympus</ldip:make>
          <ldip:model>BH2</ldip:model>
          <ldip:serialNumber>224085</ldip:serialNumber>
      </rdf:Description>
      <rdf:Description 
      rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
         <rdf:type 
      resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>;
         <ldip:instrumentType>Camera</ldip:instrumentType>
         <ldip:make>Infinity</ldip:make>
         <ldip:model>3</ldip:model>
         <ldip:serialNumber>00169344</ldip:serialNumber>
      </rdf:Description>
      </rdf:RDF>

    Base64 in Ruby

    Ruby script, base64.rb, encodes strings in Base64 notation

      #!/usr/local/bin/ruby
      require 'base64'
      text = "The secret of life"
      encoded = Base64.encode64(text)
      puts("This is the encoded text ... #{encoded}")
      decoded = Base64.decode64(encoded)
      puts("This is the decoded text ... #{decoded}")
      exit
    

    Output of Ruby script base64.rb

         C:\ftp\rb>ruby base64.rb
         This is the encoded text ... VGhlIHNlY3JldCBvZiBsaWZl
         This is the decoded text ... The secret of life

    An entire file can be converted to Base64 using class File's read method.

      #!/usr/local/bin/ruby
      require 'base64'
      image_file = File.open("walnut.jpg").binmode
      image_file_string = image_file.read
      b64 = Base64.encode64(image_file_string)
      puts b64.slice(0,300)
      regular = Base64.decode64(b64)
      out_file = File.open("walnew.jpg", "w").binmode
      out_file.write(regular)
      exit

    Nota bene

    Base64 enlarges the file sizes of images. If you want to include Base64 versions of large image binaries, then expect to share very large documents.

    How to do Level 5 - Point to your image file from an RDF document.

    You can prepare an RDF document describing your image, and then simply link to the image from your RDF document, using a pointer.

    The following example provides pathologists with the only method that most will ever use.

      <?xml version="1.0"?>
      <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:ldip="http://www.the_url_here.org/ldip_sch#";
            xmlns:dc="http://purl.org/dc/elements/1.1/">
      
       <rdf:Description 
       rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
          <rdf:type 
       rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>;
          <dc:title>Normal Lung</dc:title>
          <dc:creator>Bill Moore</dc:creator>
          <dc:date>2006-06-21</dc:date>
          <ldip:instrument_id                   
          rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>    
          <ldip:instrument_id 
          rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
          <ldip:imageType>photomicrograph</ldip:imageType>
          <ldip:stain>H and E</ldip:stain>
          <ldip:tissue>lung</ldip:tissue>
          <ldip:organism>human</ldip:organism>
          <ldip:objective>10x</ldip:objective>
          <ldip:diagnosis>normal</ldip:diagnosis>
      </rdf:Description>
      
      <rdf:Description 
      rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
      <rdf:type 
      rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>;
      <ldip:instrumentType>Microscope</ldip:instrumentType>
          <ldip:make>Olympus</ldip:make>
          <ldip:model>BH2</ldip:model>
          <ldip:serialNumber>224085</ldip:serialNumber>
      </rdf:Description>
      <rdf:Description 
      rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
         <rdf:type 
      resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>;
         <ldip:instrumentType>Camera</ldip:instrumentType>
         <ldip:make>Infinity</ldip:make>
         <ldip:model>3</ldip:model>
         <ldip:serialNumber>00169344</ldip:serialNumber>
      </rdf:Description>
      </rdf:RDF>

    How to do Level 6 - Combining of 1 through 5, with multiple images/annotations

    Multiple RDF documents pointing to image file

    There are times when the the structural data (i.e., the non-binary data) for a data object's specification must be distributed in multiple files.

    This is one of the most important reasons for using a data specification, rather than a data standard. The specification permits you to create a dynamic object, composed of informational pieces that can be updated, so that the content and value of a specified image object increases over time. A data standard obligates you to compose a static data file. If the standard data file contains information that cannot be shared (due to human subject risks, or to intellectual property encumbrances), the standard file usually cannot be distributed. A specification may consist of multiple files connected by URL pointers. If component files contain privileged information, the data object's specification can be distributed with access restricted to specified files.

      RDF file 1 (Describes an image)
      <?xml version="1.0"?>
      <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:ldip="http://www.the_url_here.org/ldip_sch#";
            xmlns:dc="http://purl.org/dc/elements/1.1/">
      
       <rdf:Description 
       rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
          <rdf:type 
       rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>;
          <dc:title>Normal Lung</dc:title>
          <dc:creator>Bill Moore</dc:creator>
          <dc:date>2006-06-21</dc:date>
          <ldip:instrument_id                   
       rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>  
          <ldip:linkedFile rdf:resource="http://www.the_url_here.org/file2">  
          <ldip:instrument_id 
       rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
          <ldip:linkedFile rdf:resource="http://www.the_url_here.org/file3">
          <ldip:imageType>photomicrograph</ldip:imageType>
          <ldip:stain>H and E</ldip:stain>
          <ldip:tissue>lung</ldip:tissue>
          <ldip:organism>human</ldip:organism>
          <ldip:objective>10x</ldip:objective>
          <ldip:diagnosis>normal</ldip:diagnosis>
      </rdf:Description>
      </rdf:RDF>
      
      RDF File 2 (Describes a microscope referenced by File 1)
      <?xml version="1.0"?>
      <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:ldip="http://www.the_url_here.org/ldip_sch#";
            xmlns:dc="http://purl.org/dc/elements/1.1/">
      <rdf:Description 
      rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
      <rdf:type 
      rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>;
      <ldip:instrumentType>Microscope</ldip:instrumentType>
          <ldip:make>Olympus</ldip:make>
          <ldip:model>BH2</ldip:model>
          <ldip:serialNumber>224085</ldip:serialNumber>
      </rdf:Description>
      </rdf:RDF>
      
      RDF File 3 (Describes a camera referenced by File 1)
      <?xml version="1.0"?>
      <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:ldip="http://www.the_url_here.org/ldip_sch#";
            xmlns:dc="http://purl.org/dc/elements/1.1/">
      <rdf:Description 
      rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
         <rdf:type 
      resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>;
         <ldip:instrumentType>Camera</ldip:instrumentType>
         <ldip:make>Infinity</ldip:make>
         <ldip:model>3</ldip:model>
         <ldip:serialNumber>00169344</ldip:serialNumber>
      </rdf:Description>
      </rdf:RDF>

    Multiple RDF files pointing to multiple image files

    Suppose, as in the previous example, that the triples relevant to your image lie in multiple RDF files. Suppose, further, that your image is just one of a set of images that were all obtained during the same session, and that all the images apply to the same patient. This situation is routine for radiologic images, wherein dozens of images transecting the brain, or the abdomen, may form part of the same report.

    How might you annotate this complex set of data files and image binaries? Simply include an RDF assertion for each image.

      RDF File 1 (Describes 2 images)
      <?xml version="1.0"?>
      <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:ldip="http://www.the_url_here.org/ldip_sch#";
            xmlns:dc="http://purl.org/dc/elements/1.1/">
      
       <rdf:Description 
       rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
          <rdf:type 
       rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>;
          <dc:title>Normal Lung</dc:title>
          <dc:creator>Bill Moore</dc:creator>
          <dc:date>2006-06-21</dc:date>
          <ldip:instrument_id                   
      rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/> 
      <ldip:linkedFile rdf:resource="http://www.the_url_here.org/file2">   
          <ldip:instrument_id 
      rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
      <ldip:linkedFile rdf:resource="http://www.the_url_here.org/file3">
          <ldip:imageType>photomicrograph</ldip:imageType>
          <ldip:stain>H and E</ldip:stain>
          <ldip:tissue>lung</ldip:tissue>
          <ldip:organism>human</ldip:organism>
          <ldip:objective>10x</ldip:objective>
          <ldip:diagnosis>normal</ldip:diagnosis>
      </rdf:Description>
      <rdf:Description 
      rdf:about="http://www.the_url_here.org/ldip/ldip2201.jpg">
          <rdf:type 
      rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>;
          <dc:title>Normal Lung</dc:title>
          <dc:creator>Bill Moore</dc:creator>
          <dc:date>2006-06-21</dc:date>
          <ldip:instrument_id                   
      rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/> 
      <ldip:linkedFile rdf:resource="http://www.the_url_here.org/file2">   
          <ldip:instrument_id 
      rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
      <ldip:linkedFile rdf:resource="http://www.the_url_here.org/file3">
          <ldip:imageType>photomicrograph</ldip:imageType>
          <ldip:stain>H and E</ldip:stain>
          <ldip:tissue>lung</ldip:tissue>
          <ldip:organism>human</ldip:organism>
          <ldip:objective>2.5x</ldip:objective>
          <ldip:diagnosis>squamous cell carcinoma</ldip:diagnosis>
      </rdf:Description>
      </rdf:RDF>
      
      
      RDF File 2 (Describes a microscope referenced by File 1)
      <?xml version="1.0"?>
      <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:ldip="http://www.the_url_here.org/ldip_sch#";
            xmlns:dc="http://purl.org/dc/elements/1.1/">
      <rdf:Description 
      rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
      <rdf:type 
      rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>;
      <ldip:instrumentType>Microscope</ldip:instrumentType>
          <ldip:make>Olympus</ldip:make>
          <ldip:model>BH2</ldip:model>
          <ldip:serialNumber>224085</ldip:serialNumber>
      </rdf:Description>
      </rdf:RDF>
      RDF File 3(Describes a camera referenced by File 1)
      <?xml version="1.0"?>
      <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:ldip="http://www.the_url_here.org/ldip_sch#";
            xmlns:dc="http://purl.org/dc/elements/1.1/">
      <rdf:Description 
      rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
         <rdf:type 
      resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>;
         <ldip:instrumentType>Camera</ldip:instrumentType>
         <ldip:make>Infinity</ldip:make>
         <ldip:model>3</ldip:model>
         <ldip:serialNumber>00169344</ldip:serialNumber>
      </rdf:Description>
      </rdf:RDF>

    The same approach can be used to reference multiple images within a single image file. The rdf:about attribute can point to any file block or file element that contains the part of the image that is the intended subject of the triples (e.g., region of interest, thumbnail, tile, waveform, color map, and so on).


    Dealing with DICOM

    What is DICOM?

    In the field of biomedicine, DICOM (Digital Imaging and Communications in Medicine) has special significance, because DICOM is the format currently used for almost all radiologic images. DICOM was developed over several decades, to become a multifunctional standard of enormous complexity that uses a model for data storage that is unlike any other image file format. The DICOM standard includes a set of protocols for transferring information through networks, and for communicating between different radiologic devices or different parts of a single device (e.g., between CT machine and CT workstation). It creates a unique syntax and semantics for information and produces a file that contains a large amount of descriptive information (including patient information and diagnostic information), and a binary representation of one or more images.

    One of the best descriptions of the DICOM file format is available at:

      http://www.dclunie.com/medical-image-faq/html/part1.html 
      http://www.dclunie.com/medical-image-faq/html/part2.html

    DICOM Working Group 26, led by Dr. Bruce Beckwith, is attempting to expand DICOM to include metadata appropriate for pathology images. To the best of our knowledge, there are now a handful of pathology departments that format pathology photomicrographs as DICOM images. The VA (U.S. Veterans Administration) seems to have adopted DICOM as their standard format for all medical images. To the best of our knowledge, nobody using DICOM is currently inserting a complete set of pathology descriptors into their DICOM headers, but this may change over the course of time.

    For the purposes of this document, all we need to know is that the header information in a DICOM file can be extracted (with a short Ruby script), and that the binary portion of a DICOM file can be converted to a JPEG file. The header data from the DICOM file can be re-inserted into the header of a JPEG file, or it can be included in a special XML file that "points" back to the original DICOM file or to the JPEG file that contains the image representation.

    Standards are inter-convertible

    The nice thing about electronic standards that set them apart from standards created for physical objects, is that they are interconvertible.

    Though there are hundreds of standard image formats, robust image software can do a pretty good job at converting any format into any other format.

    We like to work with JPEG images, because they are the most popular web images. Our philosophy is that if you like to work in DICOM, you should work in DICOM. If you like to work in JPEG, you should work in JPEG. There are simple ways of interconverting the two formats.

    The DICOM header

    You can find many DICOM images at:

      ftp://ftp.erl.wustl.edu/pub/dicom/images/version3/RSNA95/

    These images can be used as practice files for some of the scripts that will follow.

    DICOM has a header that can be extracted from the DICOM image file, and which contains textual descriptive information about the image.

    Here is a sample DICOM image header.

         0002,0000,File Meta Elements Group Len=122
         0002,0001,File Meta Info Version=1
         0002,0002,Media Storage SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
         0002,0003,Media Storage SOP Inst UID=9999.20070123103417.100.10
         0002,0010,Transfer Syntax UID=1.2.840.10008.1.2.1.
         0002,0012,Implementation Class UID=960051513
         0008,0008,Image Type=
         0008,0012,Instance Creation Date=20070123
         0008,0013,Instance Creation Time=103417
         0008,0016,SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
         0008,0018,SOP Instance UID=9999.20070123103417.100.10
         0008,0020,Study Date=20070123
         0008,0030,Study Time=103417
         0008,0050,Accession Number=
         0008,0060,Modality=OT
         0008,0064,Conversion Type=WSD.
         0008,0090,Referring Physician's Name=
         0010,0010,Patient's Name=gwmbw.jpg.
         0010,0020,Patient ID=0.
         0010,0030,Patient Date of Birth=
         0010,0040,Patient Sex=M 
         0010,1010,Patient Age=0.
         0020,000D,Study Instance UID=9999.20070123103417.100.20
         0020,000E,Series Instance UID=9999.20070123103417.100.30
         0020,0010,Study ID= 0
         0020,0011,Series Number=0
         0020,0013,Image Number=0
         0020,0020,Patient Orientation=
         0028,0002,Samples Per Pixel=1
         0028,0004,Photometric Interpretation=MONOCHROME2
         0028,0010,Rows=1536
         0028,0011,Columns=2048
         0028,0100,Bits Allocated=8
         0028,0101,Bits Stored=8
         0028,0102,High Bit=7
         0028,0103,Pixel Representation=0
         7FE0,0010,Pixel Data=3145728

    Manipulating DICOM with Ruby

    DICOM To JPEG Conversion - ezDicom And DCM2JPG And JPEG2DCM.

    ezDICOM (Copyright 2002, Wolfgang Krug and Chris Rorden) is a medical viewer for DICOM images. It is distributed along with dcm2jpg, a command-line application that can convert DICOM images into standard bitmap file formats (JPEG, PNG, and BMP). In addition, it will convert a DICOM image to its textual header information.

    The sample command-line is:

      dcm2jpg -f p -o C:\TEMP -z 1.5 C:\DICOM\input1.dcm C:\input2.dcm

    This command-line may contain information for brightness, contrast, format of output, output target directory, input files, etc.

    If you simply invoke:

      dcm2jpg <dicom filename and path if file is not in current directory>

    The dicom file will be converted to a .jpg file in the directory that holds the dicom file.

    The .exe file is:

      dcm2jpg.exe 218,112 bytes - converts to jpeg by default

    If you wish, you can simply rename the .exe file to change the default conversion behavior.

      dcm2bmp.exe 218,112 bytes - converts to bmp by default 
      dcm2png.exe 218,112 bytes - converts to png by default 
      dcm2txt.exe 218,112 bytes - converts to text header by default

    Any JPEG file can be converted to a DICOM file, with jpeg2dcm. This free software by CharruaSoft software can be downloaded from:

      http://www.charruasoft.com/downen.htm

    It is a simple exe file (jpeg2dcm.exe 511,488 bytes), that can operate from a command-line:

    It will accept an input file, such as gems.jpg, and convert it to a dicom file,

      gems.jpg 132,135 bytes gems.dcm 1,980,712 bytes

    Ruby scripts for converting DICOM to JPEG

    We will use:

    dcm2jpg.exe 218,112 bytes - converts to jpeg by default

    
    dcm2txt.exe 218,112 bytes - converts to text header by default

    Ruby can simply call a command-line application from within a script, using the exec method.

    Ruby script, dcm2jpg.rb, converts a DICOM file into a JPEG file:

      #!/usr/local/bin/ruby
      exec("dcm2jpg.exe c\:\\ftp\\picker\\dicom\\CT4174\~1")
      exit

    Screen output of dcm2jpg.rb script

      c:\ftp>ruby dcm2jpg.rb
      1 Creating: c:\ftp\picker\dicom\CT4174~1.jpg
      1

    Ruby script, dcmsplit.rb, converts a DICOM file into a JPEG, and a text file for the header

      #!/usr/local/bin/ruby
      system("dcm2jpg.exe c\:\\ftp\\picker\\dicom\\CT4174\~1")
      system("dcm2txt.exe c\:\\ftp\\picker\\dicom\\CT4174\~1")
      exit

    Screen output of dcm2jpg.rb script

      c:\ftp>ruby dcm2jpg.rb
      1 Creating: c:\ftp\picker\dicom\CT4174~1.jpg
      1
      1 Creating: c:\ftp\picker\dicom\CT4174~1.txt
      1

    The just-created JPEG file lacks the clinical jpeg information contained in the DICOM file, but this information is now available to us in our newly created text file. In the next section, we will see how any textual information can be inserted back into a JPEG file, using RMagick.

    Extracting And Inserting Jpeg And Dicom Image Headers.

    Ruby script, jpeg_add.rb, inserts textual information into a JPEG image:

         #!/usr/local/bin/ruby
         require 'RMagick'
         include Magick
         text = IO.read("gwmbw.txt")
         orig_image = ImageList.new("gwmbw.jpg")
         orig_image.cur_image[:Comment] = text
         print "\nComment added, let's make a file to hold the
            modifications\n\n"
         copy_image = ImageList.new
         copy_image = orig_image.cur_image.copy
         copy_image.write("c\:\\ftp\\rb\\gwmout.JPG")
         copy_image.properties{|name, value| print "#{name}\n#{value}\n"}
         exit
    

    Output of jpeg_add.rb script:

         C:\ftp\rb>ruby jpeg_add.rb
              Comment added, let's make a file to hold the modifications  
        Comment
         0002,0000,File Meta Elements Group Len=122
         0002,0001,File Meta Info Version=1
         0002,0002,Media Storage SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
         0002,0003,Media Storage SOP Inst UID=9999.20070123103417.100.10
         0002,0010,Transfer Syntax UID=1.2.840.10008.1.2.1.
         0002,0012,Implementation Class UID=960051513
         0008,0008,Image Type=
         0008,0012,Instance Creation Date=20070123
         0008,0013,Instance Creation Time=103417
         0008,0016,SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
         0008,0018,SOP Instance UID=9999.20070123103417.100.10
         0008,0020,Study Date=20070123
         0008,0030,Study Time=103417
         0008,0050,Accession Number=
         0008,0060,Modality=OT
         0008,0064,Conversion Type=WSD.
         0008,0090,Referring Physician's Name=
         0010,0010,Patient's Name=gwmbw.jpg.
         0010,0020,Patient ID=0.
         0010,0030,Patient Date of Birth=
         0010,0040,Patient Sex=M
         0010,1010,Patient Age=0.
         0020,000D,Study Instance UID=9999.20070123103417.100.20
         0020,000E,Series Instance UID=9999.20070123103417.100.30
         0020,0010,Study ID= 0
         0020,0011,Series Number=0
         0020,0013,Image Number=0
         0020,0020,Patient Orientation=
         0028,0002,Samples Per Pixel=1
         0028,0004,Photometric Interpretation=MONOCHROME2
         0028,0010,Rows=1536
         0028,0011,Columns=2048
         0028,0100,Bits Allocated=8
         0028,0101,Bits Stored=8
         0028,0102,High Bit=7
         0028,0103,Pixel Representation=0
         7FE0,0010,Pixel Data=3145728
         JPEG-Colorspace
         1
         JPEG-Sampling-factors
         1x1

    The output JPEG image file (gwmout.jpg) contains the binary representation of the same image in gwmbw.dcm and in gwmbw.jpg. In addition, it contains a Comment field consisting of the contents of file gwmbw.txt, the textual representation of the original DICOM header. We can extract the Comment field from the JPEG header whenever we wish, using RMagick's properties method.

    RDF image specification contrasted with DICOM

    DICOM is a complex and highly specialized standard developed over several decades. It is intended to negotiate a variety of network transactions in addition to encapsulating an image binary. DICOM was created at a time prior to the development of XML and prior to the development of the web transfer protocol, http. It depends of a variety of data control devices that some may find anachronistic (e.g., prescribed byte locations, DICOM-specific data transfer and negotiation protocols). Mastering the technical aspects of DICOM is may require months analyzing the many volumes of DICOM technical reports. With few exceptions, DICOM is something that serves radiologists through proprietary software written for imaging devices that may cost millions of dollars in outlay plus maintenance.

    Though DICOM has proven itself to be an excellent standard for radiology images, it is not currently a popular standard for pathology images.

    Pathologists should have the option of using DICOM, or using some other image format. Regardless of the image format chosen, it is important to annotate images with textual data in a computer parsable form, that can be ported between alternate image formats.

    Pathologists typically use off-the-shelf cameras and software, and the raw pixel data is usually provided in a popular image format, such as jpeg, png, bmp, or tiff. By specifying images in RDF, pathologists have a simple, inexpensive and easily implemented method for annotating images with useful descriptive data, and for binding these annotations to the images.

    Porting between an RDF specification and a data standard

    Because all the data in a specification is fully described, it is very easy to write software that will port a specification into a data standard. To have full compatibility between a data specification and data standard, the specification must contain all of the required data elements of the data standard. This usually involves:

      1. Studying the data standard, and writing an RDF Schema (or supplementing 
         an existing RDF Schema) with classes and properties appropriate for 
         the data standard.  
      
      2. RDF specifications, unlike data standards, do not place requirements 
         on the classes and properties that need to be included in the document.
         To create an RDF specification that can be ported to a data standard,
         the RDF document must be prepared with knowledge of the classes and 
         properties that are required by the data standard.  This information 
         can be provided as an external document, or as triples added to a 
         schema, or as a feature of the [textual] CDE document. At present,
         no single method has emerged as the preferred strategy.
      
      3. Writing software that will parse the RDF document in which the data 
         object is specified (trivial), and transform the triples into the 
         document that conforms to the structure and content of the data 
         standard.


    Advanced RDF tutorial (Can be skipped unless you have a special interest in subject)

    Common Data Elements (CDEs)

    The term, "common data element" is a misnomer. Most people, when they first encounter this term, assume that a data element holds data. Actually, a common data element is the metadata that describes a datum in a data record. In XML parlance, a CDE is an XML tag. The thing that makes a descriptor "common" is its common usage by a scientific community. The way that CDEs are intended to work is that a scientific community creates a list of CDEs that describe the kinds of data that their members use. The members of the community will all use the same CDEs (XML tags) to annotate their data files.

    ISO-11179 Specification for CDEs

    One of the most calamitous errors in any CDE project is to assume that everyone who reads a metadata tag will automatically understand its intended meaning. ISO-11179 is a standard way of defining CDEs with the necessary information for understanding their meanings.

    The most popular CDEs in existence are the Dublin Core CDEs. These are a set of file descriptors that were prepared by a committee of librarians, who convened in Dublin, Ohio. The Dublin Core includes basic information about electronic documents, such as: the title of the document, the name of the person who created the file, the date that the file was created, the date that the file was modified, and a short description of the file. These are basically the items that a library software agent would need to retrieve, if it were building an index of internet documents. The world of informatics would be a better place if everyone who created an HTML, XML or RDF file would remember to include the Dublin Core CDEs.

    These Dublin Core CDEs have been prepared to comply with the ISO-11179 specification. Every effort to create a data specification for a knowledge domain should begin with by collecting the common data elements for the domain, and annotating each element with the ISO-11179 CDE descriptors. The ISO-11179 descriptors for two of the Dublin Core CDEs (Title and Creator) are shown below.

    From: http://dublincore.org/documents/1999/07/02/dces/

      Title 
         Identifier: Title
         Version: 1.1
         Registration Authority: Dublin Core Metadata Initiative
         Language: en
         Obligation: Optional
         Datatype: Character String
         Maximum Occurrence: Unlimited
         Definition: A name given to the resource.
         Comment: Typically, a Title will be a name by which the resource is
         formally known.
      
      Creator 
         Identifier: Creator
         Version: 1.1
         Registration Authority: Dublin Core Metadata Initiative
         Language: en
         Obligation: Optional
         Datatype: Character String
         Maximum Occurrence: Unlimited
         Definition: An entity primarily responsible for making the content
         of the resource.
         Comment: Examples of a Creator include a person, an organisation,
         or a service. Typically, the name of a Creator should be used to
         indicate the entity.

    RDF Schemas

    An RDF schema is a dictionary file that lists the classes and the properties that pertain to RDF documents. In fact, the official long name for RDF Schema is the RDF Vocabulary Description Language. The classes of an RDF schema are formal definitions for the kinds of subjects that are found in the RDF triples. The properties of an RDF schema are the types of metadata descriptors for the data of the RDF triples. Elements in RDF schemas may be subclasses of elements in other RDF schemas.

    Things to remember about RDF Schemas

      1. RDF Schemas are written in XML, but are completely unlike XML Schemas.  
      
      2. RDF Schemas contain declarations of the classes and properties 
        that are used in RDF documents. 
      
      3. RDF Schemas, like all RDF documents, have no pre-determined order or 
         composition, and consist of statements expressed as triples.  The 
         subject of every triple in an RDF Schema will be either Class or 
         Property.
      
      4. Every RDF Schema can be thought of as a child of the W3C RDF Schema 
         that defines the "super" classes Resource, Class and Property. All 
         RDF Schemas will refer to the document that defines RDF syntax and 
         to the document that defines the top-level schema, and therefore 
         will begin something like this:
        <?xml version='1.0' encoding='ISO-8859-1'?>
        <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
      
      5. A typical RDF document consists of triples <subject, metadata, value>.
         RDF documents usually reference one or more RDF Schemas to instantiate
         the subject of each triple (i.e., to tell us which class in an RDF 
         schema the subject is an instance of) and to provide subjects with 
         class-appropriate metadata.
      6. Documents composed of triples whose components are defined by RDF 
         Schemas can be used to completely specify data objects within a 
         knowledge domain.  
      
      7. By completely specifying data objects in a knowledge domain, RDF 
         specifications achieve the functionality of data standards.

    In a later section, we will demonstrate the simple method for instantiating classes and for associating class instances with appropriate properties and with datatyped values.

    A template for CDEs that can be transformed to RDF Schema elements

    Here is a CDE template that satisfies the minimal recommendations for a CDE under ISO-11179, and that provides all the information for a class or a property in an RDF schema.

    The general format for class elements is:

      Class Label (in standard XML tag format, uppercase first letter):
      Registration Authority: Association for Pathology Informatics
      Obligation: optional
      Maximum Occurrence: Unlimited
      Comment(must include detailed definition):
      subClassOf:
      Contributor (your consistent first-name last-name):
      Date of your contribution:

    The general format for property elements is:

      Property Label (in standard XML tag format, lowercase first letter):
      Registration Authority: Association for Pathology Informatics
      Obligation: optional
      Maximum Occurrence: Unlimited
      Datatype (can be "Literal", a list, or a regex; default is "Literal"):
      Comment(must include detailed definition):
      Domain (comma-delimited if multiple):
      Range (usually "Literal"):
      Contributor (your consistent first-name last-name):
      Date of your contribution:
    

    The category "Obligation" should contain the word "required" or the word "optional". For the kinds of specifications discussed in this manuscript, including any CDE would always be optional. Similarly, for "Maximum Occurrence", we would think any CDE could occur an unlimited number of times in an RDF document.

    Classes and Properties

    Here is an example of an ISO-11179-compliant CDE written for a class named "Reagent".

      Class Label:Reagent
      versionInfo (required): 0.1
      Registration Authority: Association for Pathology Informatics
      Obligation:optional
      Maximum Occurrence: Unlimited
      Datatype: Literal
      comment: Histologic_stain_reagents, tissue_fixation_reagents, and other chemicals  employed in the laboratory.  For example: distilled_water, ethanol,  hematoxylin, aluminum_sulphate
      subClassOf:Class
      Contributor:Bill Moore
      Date_of_contribution:05-30-2006

    Once we have the CDE, it is a straightforward job to create an RDF Schema Class element:

      <rdf:Class rdf:about="http://www.the_url_here.org/ldip_sch#Reagent">
      <rdfs:label>Reagent</rdfs:label>
      <rdfs:comment>
      Histologic_stain_reagents, tissue_fixation_reagents, and other chemicals  employed in the laboratory.  For example: distilled_water, ethanol,  hematoxylin, aluminum_sulphate
      </rdfs:comment>
      <rdfs:subClassOf rdf:resource="xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#Class"/>;
      </rdf:Class>
                                         
      
    A few points bear explanation.  All the information needed to generate an  RDF class or property should be contained in an ISO-11179-compliant list of CDEs.  An RDF Schema may consist purely of classes and properties.  Classes are defined in RDF exclusively through their ancestral relation.  Basically, to build a class in RDF Schema, you announce that the element is a Class, you provide a unique locator (such as a URL) or a unique universally understood descriptor (more on this later) for the element, a description of the element, and the name of the father class of the element.

    That's all there is to do for classes. You don't need to list the subclasses of the class because the subclasses will list the class as their father in their own schema entry. You don't need to list the properties of the class because the properties will list the classes whose data they describe.

    Do classes in RDF schema remind you of anything? The classes in an RDF schema comprise an ontology. An ontology is a list of classes and their relationships. You can think of an ontology as the "classy" half of an RDF schema. Classes become most useful when they have Properties (the other half of the RDF schema)..

    Creating Properties

    A property is a metadata element that is used to describe the data assigned to one or more class objects. Here is the CDE for a property named "dateTime".

      Identifier:ldip:dateTime
      Property Label:dateTime
      versionInfo: 0.1
      Registration Authority: Association for Pathology Informatics
      Language:en
      Obligation:optional
      Maximum Occurrence: Unlimited
      Datatype: /[\+\-]{1}[\d]{8}\.[\d]{6}Z[\+\-]{1}[\d]{4}/
      comment: ISO 8601 format of data and time.
      domain:Event
      range: http://www.the_url_here.org/ldip_xsd.xsd#iso8601
      Contributor:Bill Moore
      Date_of_contribution:05-30-2006
    

    An RDF Schema declaration for the dateTime property might be:

      <rdf:Property rdf:about="http://www.the_url_here.org/ldip_sch#dateTime">
        <rdfs:label>dateTime</label> 
        <rdfs:comment>
         The date and time at which an event occurs, in ISO8601 format   
        </rdfs:comment>
        <rdfs:domain rdf:resource="http://www.the_url_here.org/ldip_sch#Event"/>;
        <rdfs:range rdf:resource="http://www.the_url_here.org/ldip_xsd.xsd#iso8601"/>;
      </rdf:Property>

    Let's look at the dateTime property. The first line announces that we will be declaring a Property. The second line tells us the name of the Property (dateTime), and it's URL (the current RDF schema document). The third line provides the label by which we will refer to the Property. This might come in handy if we had different names for the property in different languages. The comment includes a definition for the element.

    The next line specifies the domain (class) for the property. The domain of a property is the class for which the property may be used. In this case, the domain class for which the dateTime property applies is Event. This makes sense. If you need to describe an event, you would want to include the time that the event occurred.

    A property for a class may serve as a property for all of the subclasses of the class (because all the subclass instances are members of the ancestor class). Every Property must have a domain (a class or classes for which the Property may be used) and a Range (a specified kind of data that is described by the Property). A property may have multiple classes in its domain. When a property has multiple classes in its domain, all the classes in the domain share the same property (obviously). This achieves some of the functionality of multi-class inheritance, without actually needing to instantiate multiple classes under a single object. This is a subtle concept, and does not need to be mastered at this time. Suffice it to say that as you create your own RDF Schemas, you should try to design your Properties to apply to multiple classes, and you should try to instantiate objects under a single class.

    Specifying a datatype from within an RDF Schema Property Element

    Let us continue to examine the dataTime property. Recall that a triple consists of a subject, followed by metadata (the property element), followed by the data. The property element describes the data. The range of the property element tells us what kind of data is described. In RDF schemas, the range of a property is often "Literal", an element defined in the RDF syntax document that refers to any character string. You can see immediately that describing the range of a property as a character string does little to constrain or structure the expected values for a data element.

    In the dateTime property, we want the range of the property to be data that conforms to the ISO8601 date/time format. How do we convey the datatype of the data/time element in RDF?

    RDF has no intrinsic datatyping facility. So for our property range, we provide a resource (URL) that specifies an element in an .xsd file, that defines the datatype we need.

    The range for the dateTime property is a resource:

    <rdfs:range rdf:resource="http://www.the_url_here.org/ldip_xsd.xsd#iso8601"/>

    The resource points us to an xsd file on the web, and to a particular element within the xsd file, labeled iso8601. Let's pretend we visit the file and extract the iso8601 element. We might find the following:

      <simpleType name='iso8601'>
      <!-- values of a data_time must contain              -->
      <!-- a plus or minus sign occurring zero or one      -->
      <!-- times followed by 8 digits                      -->
      <!-- followed by a perios                            -->
      <!-- followed by 6 digits                            -->
      <!-- followed by the a letter Z, T or a space        -->
      <!-- followed by a plus or minus sign occurring      -->
      <!-- zero or one time, followed by 4 digits          -->
        <xsd:restriction base='string'>
          <pattern value=''[\+\-]?[\d]{8}\.[\d]{6}[ZT ][\+\-]{1}[\d]{4}"/>
        </xsd:restriction>
      </simpleType>

    The essence of the datatype is found in the pattern value line:

      <pattern value=''[\+\-]?[\d]{8}\.[\d]{6}[ZT ][\+\-]{1}[\d]{4}"/>

    This line uses a Regular Expression (RegEx), that provides a pattern to which the element must conform. RegEx is beyond the scope of this manuscript.

    Don't be intimidated by .xsd and Regex rules. For most purposes, simply describing the range of a property with the RDF syntax-defined element, "Literal" will be all that you need.

    The .xsd element definition imposes a datatype pattern on the value of the data described by the property. A validating software agent would check an RDF document to determine if the data described by a property conform to the range of the property element, as defined by the element in the .xsd resource for the property range.

    If we want to employ this trick, we'll need to prepare an XSD file that contains elements for all the datatypes referred under the property ranges included in our XML Schema.

    Creating the external XSD document to datatype our property ranges

    XSD datatype files are very easy to prepare. Basically, you just list your datatypes and provide descriptors. The following generic file contains samples of the kinds of datatypes you will probably need (patterns, inclusive values, and unions).

      <?xml version="1.0" encoding="UTF-8"?>
      <xsd:schema 
      xmlns:xsd ="http://www.w3.org/2000/10/XMLSchema#">
      
      <simpleType name='sp_pattern'>
      <!-- values of an accession number must contain      -->
      <!-- the letters sp followed by a hyphen followed    -->
      <!-- by two digits followed by a hyphen              -->
      <!-- followed by any number of digits                -->
        <xsd:restriction base='string'>
          <pattern value='sp\-[0-9]{2}\-[0-9]+'/>
        </xsd:restriction>
      </simpleType>
    
      <xsd:simpleType name="adult_age">
      <!-- an adult is at least 18 years old               -->
        <xsd:restriction base="xsd:positiveInteger">
           <xsd:minInclusive value="18"/>
        </xsd:restriction>
      </xsd:simpleType>
      <xsd:simpleType name="serialNumber">
        <!--  may be either integers or mixed alphanumeric strings -->
        <xsd:union>
           <xsd:simpleType>
             <xsd:restriction base='integer'/>
           </xsd:simpleType>
           <xsd:simpleType>
             <xsd:restriction base='string'/>
           </xsd:simpleType>
        </xsd:union>
      </xsd:simpleType>
      
      <xsd:simpleType name="EnumerationObjectives">
      <!--  may be either integers or mixed alphanumeric strings -->
        <xsd:restriction base="string">
          <xsd:enumeration value="2.5x"/>
          <xsd:enumeration value="6.3x"/>
          <xsd:enumeration value="20x"/>
          <xsd:enumeration value="40x"/> 
          <xsd:enumeration value="100x"/>
        </xsd:restriction>
      </xsd:simpleType>
      </xsd:schema>

    The differences between Classes and Properties

    The most difficult step in building any schema is determining whether a candidate element is a Class or a Property. Generalizations do not hold for all cases. For example, Classes tend to be nouns, while Properties (that describe data) tend to be adjectives. However, a Property can be a noun (e.g. Time) if it's role is to describe a data value (4:00 PM EST). Furthermore, we sometimes assign active processes to classes (e.g. birth, death), and we cannot assume that classes are always static objects.

    There is a strong tendency to assign subclass status to things that are not examples of their ancestral class. For instance, if Person is a class, someone may think that Leg is a subclass of Person (because a Leg is in a class of things that are parts of a Person). No! Leg is never a subclass of Person because a Leg is not a Person. A subclass of Person must be composed of types of Persons. So, Patient is a subclass of Person, and Pathologist is a subclass of Person, because they are both examples of Persons and because there are instances of Patients and instances of Pathologists. Remember, a class is a construct whose chief job is to provide specified instances.

    How about Friend? Is Friend a subclass of Person? Yes and no. Friend can be a subclass of person if you want to organize Persons based on whether they are Friends or not-Friends. However, if you think that being a friend is just one of many features of any Person, you would be much better off defining friend as an RDF property. The data-type of the friend property may be a Boolean (true or false).

    Here are some general recommendations for distinguishing RDF Schema Classes and Properties.

      1. If something has instances of itself, it is almost always a class.
      
      2. If a candidate class is a subclass of more than one class lineages 
         (so-called multiple inheritance), think very hard before making it a 
         class.  In most cases, you will be better off if it is assigned as a 
         Property, or if it is excluded from the RDF Schema.
      
      3. Every class must be a subclass of a class. To be a subclass of a 
         class the subclass must qualify as a member of the father class.  
      
      4. A class is fully specified when you know its definition and you know 
         it's ancestor class.
      
      5. Properties describe data. If something has a specific datatype that 
         includes numerics, it is almost always a property.

    Creating instances of classes

    The purpose of a class is to support the creation of subclasses and class instances. If we have a Report class, we might also have a Surgical_Pathology_Report class which is a subClassOf Report. Elsewhere_General_S06_4352 may be one unique instance of the the class Surgical_Pathology_Report. As an instance of the class, the data in the report can be described using the properties specified in an RDF schema to have the Surgical_Pathology_Report domain.

    The way to create an instance of a class in an RDF document is with the RDF "type" primitive.

      <rdf:description rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
         <rdf:type resource= "http://www.the_url_here.org/ldip_sch#Image">
      </rdf:description>

    Whenever we wish to create an instance of a class belonging to any RDF Schema we choose, we will add an RDF statement much like the one shown above. We begin by specifying the subject of the statement (with the "about" declaration), then we indicate that "type" of the subject is the class listed in an RDF Schema. An object may be an instance of more than one class, and a proper RDF statement may list numerous type/class pairs, but we caution that doing so adds complexity to your document.

    Preserving namespaces for Classes and Properties

    One of the most important features of RDF Schemas is that you can mix and match different elements (classes and properties) from different schemas in a single document. This is done using a simple namespace notation that is common to all XML documents.

      <?xml version='1.0' encoding='ISO-8859-1'?>
        <rdf:RDF 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
            xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#";
            xmlns:dc="http://purl.org/dc/elements/1.1/#";
            xmlns:schem1="http://www.someplace.org/#";
            xmlns:schem2="http://www.someplace_else.com/#">
          <rdf:Description rdf:about="http://www.the_url_here.org/ldip/ldip2201.jpg">
            <dc:creator>Bill Moore</dc:creator>
            <schem1:camera>yes</schem1:camera>
            <schem2:camera>Olympus</schem2:camera>
            <schem1:format>jpeg</schem1:format>
            <schem2:format>jpg</schem2:format>
          </rdf:Description>
        </rdf:RDF>

    Notice that the elements camera and format appear twice, but on each occasion, they are prefixed by different namespaces (schem1 and schem2). The namespaces preserve metadata individuality.

    Unique subject identifiers

    Once you have made an instance of a class, you need to identify the instance uniquely. Failing this, the metadata-data pairs associated with a class instance have no meaning.

    The typical unique subject identifier in an RDF triple is a URL specifying a unique web location for a data object.

    Failing this, any unique identifier that permanently, unmistakably and uniquely links an object to a character string will suffice.

    There are a number of registry services that provide identifiers for data objects in their domains. Examples are:

      DOI, Digital object identifier.
      PMID, PubMed identification number.
      LSID (Life Science Identifier).
      HL7 OID  (Health Level 7 Object Identifier).
      DICOM (Digital Imaging and Communications in Medicine) identifiers.
      ISSN (International Standard Serial Numbers).
      Social Security Numbers (for U.S. population).
      NPI, National Provider Identifier, for physicians.
      Clinical Trials Protocol Registration System.
      Office of Human Research Protections FederalWide Assurance number.
      Data Universal Numbering System (DUNS) number
      DNS, Domain Name Service.
      
    In the life sciences, the LSID number has achieved some popularity. 
    The LSID resolution protocol has five parts:
      Network Identifier (NID)
      root DNS name of the issuing authority
      namespace chosen by the issuing authority
      object id unique in that namespace and assigned locally
      revision id for storing versioning information (optional)
      
    LSIDs can be used as URN's that uniquely identify items in RDF statements.

    LSID Examples:

      urn:lsid:pdb.org:1AFT:1    
      This is the first version of the 1AFT protein in the Protein Data Bank.
      
      urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434   
      This references a PubMed article.
      
      urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2    
      This refers to the second version of an entry in GenBank
      
      HL7 also provides unique identifiers.  An enterprise can obtain an 
      OID at:
      http://www.iana.org/cgi-bin/enterprise.pl
      
      For example, the University of Michigan OID is:
      1.3.6.1.4.1.250
      
      The enterprise OID serves as a prefix for unique data objects within 
        an institution.  
      
      Unique identifiers are used to uniquely specify the subject of a 
        triple (i.e., to specify what a triple is about).
      
      Example:
      
      <rdf:description rdf:about="urn:lsid:ncbi.nlm.nih.gov:pubmed:8718907">
      <dc:creator>Bill Moore</dc:creator>
      </rdf:description>
      
      
    Here we have a unique data object specified with an lsid for a 
    PubMed citation.  The number 8718907 is the unique pubmed citation 
    number.  We add a property/value pair consisting of the Dublin Core 
    creator element and the data value, "Bill Moore".  Once we have a 
    unique subject, we can instantiate the element for an appropriate class.
      <rdf:description rdf:about="urn:lsid:ncbi.nlm.nih.gov:pubmed:8718907">
      <rdf:type resource= "http://www.the_url_here.org/ldip_sch#Document"/>;
      <dc:creator>Bill Moore</dc:creator>
              </rdf:description>.

    Here we have a unique data object, instantiated as a member of the Document class. The Document class is defined in an RDF Schema referenced to a URL.

    To summarize, the subject of a triple needs to be identified. The subject of a triple can be in the form of a URL (complete web address) or a URN (Unique Resource Name).

    URLs and URNs are both forms of URIs (Unique Resource Identifiers).

    You can create your own uniquely specified data object by appending a unique number to a URN prefix. For instance, a surgical pathology report, or a patient name, or an image file, can be the subject of a triple, if it is identified by the following:

      urn:www.the_url_here.org:ldip:4Ib30fk6J3Y9gWpwMV27

    Here, the prefix is "urn:www.dlip.org:ldip:" An alphanumeric suffix, "4Ib30fk6J3Y9gWpwMV27" is a 20 character random string that we have chosen for the object.

    There are many ways of providing identifiers for subjects. Once a subject is identified, triples containing the identifier can be merged from multiple RDF documents appearing anywhere on the internet.

    The RDF Schema for pathology images

        RDF Schema document begins
        <?xml version="1.0" encoding="utf-8"?>
        <!-- 
        This RDF Schema document is a bare-bones RDF Schema for image data.
        It contains the following classes
        Person
        Event
        Report
        Specimen
        Instrument
        Image
        Pathology
        Dublin_core (the class of Dublin Core properties)
        This document is a skeletal Schema, that can be used to build 
        more complex RDF Schemas. A full explanation of RDF, RDF 
        Schemas, and the role of RDF in image annotation, can be found at:
        http://www.julesberman.info/spec2img.htm
        The URL of this RDF Schema document is:
        http://www.julesberman.info/img_sch.xml
        This RDF Schema document is a public domain document.
        This RDF Schema document is provided "as is", without warranty of any kind,
        express or implied, including but not limited to the warranties
        of merchantability, fitness for a particular purpose and
        noninfringement. in no event shall the authors or copyright
        holders be liable for any claim, damages or other liability,
        whether in an action of contract, tort or otherwise, arising
        from, out of or in connection with the software or the use or
        other dealings in the document.
        This document was created by Jules J. Berman, G. William Moore,
        on September 10, 2007.
        for Presentation at the following workshop:
        Implementing an RDF Schema for Pathology Images,
        from the Association for Pathology Informatics.
        APIII, Pittsburgh, PA
        September 10, 2007
        This file was validated by the W3C RDF Validation Service at:
        http://www.w3.org/RDF/Validator/
        on September 2, 2007
        -->
        <rdf:RDF
          xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
          xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#";
          xmlns:dcterms="http://purl.org/dc/terms/"; 
          xmlns:dc="http://purl.org/dc/elements/1.1/"; 
        >
           <rdfs:Class rdf:ID="Person">
             <rdfs:subClassOf 
                  rdfs:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>;
           </rdfs:Class>
           <rdfs:Class rdf:ID="Event">
             <rdfs:subClassOf 
                  rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>;
           </rdfs:Class>
           <rdfs:Class rdf:ID="Report">
             <rdfs:subClassOf 
                  rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>;
           </rdfs:Class>
           <rdfs:Class rdf:ID="Specimen">
             <rdfs:subClassOf 
                  rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>;
           </rdfs:Class>
           <rdfs:Class rdf:ID="Instrument">
             <rdfs:subClassOf 
                  rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>;
           </rdfs:Class>
           <rdfs:Class rdf:ID="Image">
             <rdfs:subClassOf 
                  rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>;
           </rdfs:Class>
           <rdfs:Class rdf:ID="Pathology">
             <rdfs:subClassOf 
                  rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>;
           </rdfs:Class>
           <rdfs:Class rdf:ID="Dublin_core">
             <rdfs:subClassOf 
                  rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>;
           </rdfs:Class>
        </rdf:RDF>
        RDF Schema document ends

    Data Specifications contrasted with Data Standards

    An RDF Schema is a dictionary of classes and properties. An RDF data specification consists of an RDF Schema that describes the classes and properties in its specific knowledge domain, plus an .xsd document that defines the datatypes associated with a proprety's range. An RDF document that specifies a data object contains instances of classes listed in an RDF Schema that are bound to data described by the properties listed in the RDF Schema.

    RDF data specifications have a number of properties that data standards lack.

      1. RDF data specifications are optional.  The purpose of a data 
         specification is to provide an opportunity for people to describe 
         their data objects.  Data standards, unlike data specifications, are 
         often imposed requirements.
      2. RDF data specifications are self-describing and contain all the 
         information needed to interpret the contained information.  Data 
         files that conform to standards are often inscrutable and not 
         intended to be read by humans.
      3. RDF data specifications can be interrogated by general autonomous 
         software agents.   Competently written general software agents can 
         parse and understand any RDF document.  This is the underlying 
         premise of the Semantic Web.  Files that conform to most data 
         standards can only be parsed by software specifically written to 
         accommodate the data standard.
      4. RDF specifications can reduce the complexity of data.  All data can 
         be described in RDF documents consisting of data triples.  Data 
         standards have no unifying principle of data description.  The 
         presence of competing standards, different versions standards, and 
         proprietary extensions of standards have contributed to the undesirable
         complexity of electronic information.
      5. The data in a data specification can be distributed over multiple 
         RDF documents.  
      6. The assertions in RDF data specifications have meaning, and the 
         meaning is preserved when the assertion is extracted from the data 
         specification document.  Data standards do not contain meaningful 
         assertions. There is no general way of extracting components of a 
         data standard, and building datasets composed of meaningful assertions.  
      7. A data specification can comply with multiple RDF schemas at once.
      8. A data specification can be written without violating intellectual 
         property, or breaching patient confidentiality.

    Creating your own RDF Schema

    You can easily create an RDF Schema to specify the classes and properties relevant to a knowledge domain.

      1. List the classes and properties that comprehensively describe a 
         knowledge domain.
      2. Determine a class heirarchy.  Every class on your list must be a 
         subclass of something else.  The top level classes may be subclasses 
         of Class.
      3. Everything that is not a class must be a property.  Determine the 
         domain of each property (i.e., the classes to which the property applies) 
         and the range of each property (the kind of data value associated 
         with the property).  The default value for a property's range is 
         "Literal".
      4. Prepare an .xsd datatype declaration for each Property range that 
         requires a constrained data type. 
      5. Prepare a CDE list that includes the basic CDE annotations recommended 
         in ISO-11179, and the basic annotation needed to describe classes, 
         properties, and datatypes for property ranges.  Use the generic LDIP 
         CDEs for classes and properties as a default.
      6. Validate the CDE list (by careful proofreading or through a software 
         utility.
      7. Convert  CDEs to an RDF schema using a software utility
         or "by hand" if the CDE list is short.
      8. Validate the RDF schema.
      9. Vet the RDF schema through the intended user community.
      10. Distribute the RDF schema to the public.
      11. Repeat steps 1-8 ad libitum, providing a new version number to each 
          successive specification.

    Using an RDF Schema in your RDF documents

    General instructions for specifying biomedical data using RDF Schemas

    You can easily create RDF documents that fully describe a data object and that comply with one or more RDF Schemas written for a knowledge domain.

      1. Look at your own available data for the data object.   List your 
         data as triples sequenced as: <subject, metadata, data>.
      2. Create an RDF header that includes the URL of each of the RDF 
         Schemas that you will use in the document.  Be sure to include the 
         Dublin Core RDF Schema.  This document includes the elements that 
         describe the file (i.e., creator, data of file, type of file, etc.), 
         and is used by librarians to index your document. 
      3. Use the RDF Schema(s) appropriate for the knowledge domain of your 
         data object.  Determine which of your listed triples have subjects 
         that are instances of classes in the RDF Schema, and metadata consistent 
         with class-appropriate  properties.  Type these subjects as class 
         instances.  Check that the data values conform to any data value 
         constraints listed in the .xsd datatype file associated with the RDF 
         Schema.
      4. If you have common data elements (classes or properties) that are 
         not part of any public RDF Schema, create your own RDF Schema to 
         accommodate these elements, and use the RDF Schema as the resource 
         for those elements wherever they appear in your RDF specification 
         document.
      5. Contact the curator of the public RDF Schema that is appropriate 
         for the elements you created, and ask if your RDF Schema can be added 
         to the public RDF Schema
      6. Validate that the document is well-defined XML, syntactically correct 
         RDF and that all triples conform to class-property-datatype 
         descriptions from the RDF Schema.

    A few publicly available RDF schemas

    The metadata (essentially the tags corresponding to Properties) and the Classes in your RDF documents will be selected from pre-existing RDF Schemas (formal vocabularies) found as public Web documents with unique URLs.

    Here are a few:

    A Top-Level Ontology

    http://www-sop.inria.fr/acacia/personnel/phmartin/RDF/phOntology.html

    Gene Ontology (GO)

    http://139.91.183.30:9090/RDF/VRP/Examples/go.rdf

    MGED Ontology

    http://139.91.183.30:9090/RDF/VRP/Examples/mgedontology.rdfs

    As noted above, when you need Classes and Properties that are not available in public RDF Schemas, you can just create your own RDF Schema, put it on the web, and make it available to your own (and anyone else's) RDF documents.

    Validating a specification

    Software validation is one of the more challenging areas in computer science. Dozens of books, hundreds of manuscripts, and many thousands of hours of programmer time, have been devoted to this demanding subject. Fortunately, validation in the realm of data specifications can be quite easy.

    Basically, data specifications consist of lists of triples. Triples are valid when the following conditions hold:

      1. The property in a triple is suitable for the subject.
      2. The value of the triple is suitable for the property.

    Validating an RDF document (in the context of this manuscript, a document that specifies a data object or objects) comes down to this:

    
      1. Checking that the document is well-formed XML.
      2. Checking that the document is well-formed RDF.
      3. Checking that the triples are valid.

    The W3C has a web site that will validate the structure of RDF documents. It is available at:

    http://www.w3.org/RDF/Validator/

    Owl and DAML

    Owl and DAML are extensions of RDF. They extend RDF in a prescribed manner, through well-designed RDF Schemas that build on W3C's top-level RDF Schema. Neither OWL or DAML are discussed here, but we remind you that all RDF documents may benefit from classes and properties available in publicly available RDF Schemas. We urge readers to visit these URLs for further information:

    http://www.w3.org/TR/owl-ref/ http://www.daml.org/

    Problems with RDF

    In the meta-reality of informatics, perfect things may be implemented imperfectly. The issues that concern us most regarding RDF are as follows:

      1. Despite the hype, RDF is not widely used.  The web is virtually all 
         HTML, with a minor contribution in XML and a negligible contribution 
         from RDF.  Many internet visionaries predict a bright future for RDF.  
         Now is a good time to master this simple but fascinating model of 
         reality.
      2. RDF is growing in complexity.  The many extensions to RDF (including 
         OWL and DAML) have made it difficult to master every aspect of the 
         subject.  In this manuscript, we have included only selected aspects 
         of RDF that we consider essential for productivity. 
      3. RDF, and ontologies written for RDF, do not restrict multi-class 
         inheritance.  In our opinion, this is a huge problem that can lead 
         to hopelessly complex systems.  We strongly recommend extending 
         Properties to multiple classes, rather than extending multiple 
         classes to data objects. 
      4. Though many magazine articles have been written extolling the 
         virtues of RDF and of the Semantic Web, most of these articles are 
         pitifully superficial.  Not surprisingly, no two persons ever seem 
         to ever have the same "take" on the subject of RDF data specifications. 
         The best literature seems to come from the W3C, but these Web 
         recommendations are technical reports, and do not focus on the needs 
         of biomedical informaticians. In this this manuscript, we have tried 
         to  provide one interpretation of the theory and implementation of 
         RDF, as it relates to specifying data.

    Summary of advanced RDF section

      1. Meaning is achieved by binding a metadata-data pair to a specified 
         subject, into a so-called triple. Example: 
         Jules J. Berman (subject) favorite food (metadata) pizza (data)
      2. The subject of a triple needs to be identified as a unique data 
         object.  The metadata need to be defined, and the data needs to 
         have a specified structure.  These are achieved with identifiers, 
         that uniquely specify class instances; with RDF Schemas that assign 
         classes to subjects and assign properties to metadata; and with .xsd 
         datatypes, that impose structure on data values.   
      3. RDF documents consist of triples.  RDF documents begin with a 
         declaration of the RDF namespace in which the syntactical elements 
         of RDF are defined.  When the RDF document creates instances of 
         classes defined in one or more external  RDF Schema documents, the 
         namespaces of the RDF Schema documents are also listed at the top of 
         the RDF document.
      4. Triples can be collected from heterogeneous RDF datasets, and the 
         data pertaining to a specified subjects can easily be merged by RDF 
         parsers.  An RDF parser is a general utility that will work equally 
         well for any RDF document, because all RDF documents conform to the 
         W3C's RDF syntax recommendation.    
      5. A specification is a document that describes a data object in a 
         manner that can be understood by humans or by computers.  An RDF 
         Schema is a dictionary of classes and properties that can be used 
         to completely describe a data object in its knowledge domain. There 
         may be many different ways of specifying an object in an RDF document,
         but if the specification is in the form of an RDF document that uses RDF 
         Schemas to create class instances and define metadata and uses XSD 
         datatypes to constrain the value of data, then the specification will be 
         understood by competent general software agents written to interrogate 
         RDF documents. 
      6. Data specifications have many advantages over data standards.  By 
         writing domain-specific RDF Schemas (e.g., the LDIP image 
         specification), we can reduce our dependence on data standards 
         and enhance our ability to integrate data collected from 
         heterogeneous datasets.
    


    About this document

    This document was written as the syllabus for a workshop titled, "Implementing an RDF Schema for Pathology Images, from the Association for Pathology Informatics," presented at the APIII meeting in Pittsburgh, PA, Monday, September 10, 2007, 7:30 am - 8:30 am.

    The workshop summary, as it appears in the APIII 2007 brochure, is:

    The Laboratory Digital Imaging Project (LDIP) was a three year effort (2004-2007) under the direction and sponsorship of the Association for Pathology Informatics (API). The primary goal of LDIP was to develop a free, open source image specification that would convey clinical, histologic, and technical descriptors, along with image files. The purpose of the specification is to facilitate data sharing and to promote the creation of richly annotated pathology images that can be archived, analyzed, or published and that support data integration across heterogeneous biomedical domains. A specification was created in RDF Schema, the technology created by the W3C (World-Wide Web Consortium), as the logical scaffold for the Semantic Web. By annotating images in API's RDF Schema, RDF documents can be created that include objects and metadata from any other RDF Schema or from publicly available biomedical ontologies and can be ported to and from other data standards, including DICOM and OME (Open Microscopy Environment). This workshop will explain RDF, RDF Schemas, namespaces, object classes and data properties, and will describe methods for implementing API's RDF Schema for pathology images. Examples will be provided in which pathology images are annotated with several different RDF Schemas containing classes and properties relevant to pathology data sources. After this workshop, API's RDF Schema will be released as an online public document that can be linked from any RDF document.

    The powerpoint presentation for the workshop is avalable at:

    http://www.julesberman.info/jbimage.ppt

    Competing interests

    The subject matter in this manuscript is covered in books that Jules Berman has written, and this document contains links to the Publisher's web site where these books are advertised. Other than linking to these books from this manuscript, I have no competing interests.

    Usage and Intellectual Property issues

    This document is a work of literature and has no purpose other than as a literary work.

    The text of the body of the document is copyrighted to Jules J. Berman, in 2007 and distributed under the GNU Free Documentation License.

      http://www.gnu.org/licenses/fdl.txt

    The Perl scripts and Ruby scripts were written by Jules J. Berman and are distributed under the GNU General Public License

    http://www.gnu.org/copyleft/gpl.html

    
    A disclaimer applies for all of the included scripts.
      The software is provided "as is", without warranty of any kind,
      express or implied, including but not limited to the warranties
      of merchantability, fitness for a particular purpose and
      noninfringement. in no event shall the authors or copyright
      holders be liable for any claim, damages or other liability,
      whether in an action of contract, tort or otherwise, arising
      from, out of or in connection with the software or the use or
      other dealings in the software.

    The Imaging RDF Schema is public domain.

    A pdf version of this manuscript is available at: http://www.julesberman.info/rdfimage.pdf

    An html version of this manuscript is available at: http://www.julesberman.info/spec2img.htm

    Acknowledgements

    The leaf image, leaf.jpg, was obtained from Wikipedia and has the following Rights annotation "This image from PD Photo.org has been released into the public domain by its author and copyright holder, Jon Sullivan."

    We thank Ulysses Balis, co-chair of the Laboratory Digital Imaging Program, for his many useful suggestions, and we thank Robert Leif, for his many helpful criticisms. We thank Kemp Watson for setting up a wiki to support LDIP committee activities. Most importantly, we thank all of the former members of LDIP for their valuable discussions.

    Citation

    Although this document is distributed freely, all public uses of the document or any part of the document must be cited. When citing this document, please use the following text.

    Berman JJ. Moore GW. Implementing an RDF Schema for Pathology Images, from the Association for Pathology Informatics. APIII, 2007.

    About the Authors

    Jules Berman, Ph.D., M.D., has studied cancer for the past 36 years. After receiving two bachelor of science degrees (mathematics and earth sciences from MIT), he entered the graduate program in pathology at Temple University, where he began his thesis work within the Fels Cancer Research Institute. He spent the final year of his graduate studies at the Naylor Dana Institute of the American Health Foundation in Valhalla, New York, before beginning his post-doctoral studies in the Perinatal Carcinogenesis section of the Laboratory of Experimental Pathology at the U.S. National Cancer Institute. He then earned a medical degree from the University of Miami, followed by a pathology residency at George Washington University Medical Center in Washington, D.C. He became Board Certified in Anatomic Pathology and in Cytopathology, and served as the chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland. While working at the Baltimore VA Medical Center, he held appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he became the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the U.S. National Cancer Institute. In 2006, he became President of the Association for Pathology Informatics. Over the course of his career, his name has appeared as a co-author on hundreds of scientific contributions, and he has written, as first author, more than 100 publications. Today, Dr. Berman is a full-time writer.

    He maintains a personal web site and a personal blog.

    G. William Moore, M.D., Ph.D. (Bill Moore) earned a Bachelor's degree in Cell Biology at the University of Michigan at Ann Arbor, a PhD in Biomathematics from North Carolina State University at Raleigh, NC, and an MD from Wayne State University in Detroit. He did his pathology residency at Johns Hopkins. He continued at Hopkins after his residency, and he eventually wound up at the Baltimore VA Hospital, where he now holds joint appointments at Hopkins and at University of Maryland Hospital. During his career, he's been a co-author on over 180 papers, most of which relate, in one way or another to computational pathology and pathology informatics. His earliest work was in the field of computational evolutionary biology, and he was one of the first people to write programs and prove theorems for automated cladistic classification. His early work also involved developing statistical techniques for analyzing medical data, and his "token swap" paper is probably his best contribution in this area. [Moore GW, Hutchins GM, Miller RE. Token swap test of significance for serial medical data bases. Am J Med. 1986 Feb;80(2):182-190.] He has been a long-time advocate for using pathology data in research. Along with Grover Hutchins, M.D., he got an NLM grant and transferred about 50,000 Hopkins autopsies into a database. These autopsy records and associated blocks were used for over 1,000 research projects at Hopkins. Bill worked on a very early image analysis program, all written in Visual Basic. Bill did most of the programming on that project. [Berman JJ, Moore GW. Image analysis software for the detection of preneoplastic and early neoplastic lesions. Cancer Lett. 1994 Mar 15;77(2-3):103-109.] In the past 20 years, Bill has concentrated on the two related fields of indexing and machine translation. He has contributed many papers to the field, and has shown the utility of MESH and UMLS as primary indexing dictionaries. His barrier word method for extracting candidate terms from text (now better known as the "stop" word method) is currently a widely used technique in informatics science. Bill has been an API member since its inception.

    Dr. Moore's Curriculum Vitae:

    http://www.gwmoore.org/gwmcv.htm


    Last modified: August 23, 2008

    Books by Jules J. Berman, covers