Jules J. Berman and G. William Moore
Workshop
Implementing an RDF Schema for Pathology Images, from the Association for Pathology Informatics.
APIII, Pittsburgh, PA
September 10, 2007
Copyright 2007 Jules J. Berman
Distributed under GNU Free Documentation license
http://www.gnu.org/licenses/fdl.txt
This document is written for people who need to annotate their photomicrographs in a manner in a manner that binds descriptive data to the image, so that:
1. Collections of photomicrographs can be searched based on their descriptive content or by their image content or both.
2. Individual images can be sent to colleagues, and the person who receives the image can extract from the image file, descriptive text that the sender included with the image.
3. After inserting text inside an image, the person who prepared the image can be certain that years later, after all the clinical and pathologic details associated with the image have been long forgotten, the image will still provide this information.
4. The data included in the image can be prepared in a standard form that is computer parsable and understandable to software agents that search files on the Web.
This document does not contain software applications, nor does it recommend any software applications. Instead, we provide very short scripts (usually about 5 to 20 lines in length) written in Perl or Ruby, that perform the computational tasks described in the text. These scripts are so simple and so short that if you program in another language (e.g., Python, Java, or C), you should have no trouble converting the scripts to your preferred language.
This document is written with the assumption that if you want to achieve self-reliance in pathology informatics, you must learn acquire some minimal programming skills.
For those readers who want to know the best ways of stucturing the data they include with their image, this document provides in-depth discussion of RDF (Resource Description Framework). RDF is a simple technique for bundling all data as triples (the data object plus a piece of metadata that describes the data value plus the value assigned to the data object). This simple but powerful technique allows data triples to be shared among heterogeneous datasets and is the basis for the so-called "Semantic Web."
The document arranges techniques by level of difficulty (Levels 1 through 6). Level 1 describes the simplest method for conveying text in images (inserting a free-text description in the header of a JPEG file). Beginners can stop reading after Level 1 and resume reading at a later date, if so inclined.
Pathology images have no value unless they are annotated with information that describes the image.
Important descriptors of an image might include:
File information Image capture information Image format information Specimen information Patient information Pathology information Region of interest information
The API (Association for Pathology Informatics) wishes to provide anyone using pathology image data with optional methods for annotating any kind of pathology image, in any image format the user prefers.
The API was not interested in creating yet another new standard that obligates people to use a particular image format for their pathology images.
The API sponsored the Laboratory Digital Imaging Project (LDIP) to provide free and open methods for specifying image data that could be used with existing standard image formats (such as DICOM or JPEG).
From 2004-2007, the API sponsored LDIP, the Laboratory Digital Imaging Project, which consisted of API members and imaging software developers.
Conference calls were conducted from 2004 to 2006, and the minutes of the discussions are available:
In 2007, after much discussion, the API Council determined that there were, in existence, adequate methods for specifying images-related data. What was needed were general instructions for image annotation and a few simple software scripts that could parse, insert, extract, port, and interchange image annotations.
LDIP was dissolved, and the API Council accepted the primary goal of providing the field of pathology informatics with a document that describes available open annotation methods.
As a secondary goal, the API would provide a very short RDF Schema that would permit those who prefer RDF annotations to type their metadata under general classes and properties that have particular relevance to pathologists (more about this later).
This document is the current draft of instructions for image annotation.
Level 1. Simply composing a free-text description of your image and any other information you'd like to add, such as your name, and adding the information as a Comment field in the header of the image file. The Comment will not alter the binary content of the image or the visual form of the image.
When the file is copied, it will retain the header comment, and anyone receiving the image can read what you've added, using a simple Perl or Ruby script provided in the document, or using a simple extraction program prepared in any
The Dublin Core is basic information designed by librarians to provide a minimal set of data to describe the contents of an electronic document. When the file is copied, it will retain the Dublin Core metadata, and anyone receiving the image can read what you've added, using a simple Perl or Ruby program provided in the document, or using a simple extraction program prepared in any
Insert an RDF (Resource Description Framework) document into your image file.
The RDF document can be extracted, and the triples in the document can be extracted and integrated with other data.
Image file with RDF document inserted into image file header Almost all popular image formats contain "header" sections that are not part of the actual image binary. The header sections contain information that is used by image viewing software to properly display the image. Robust imaging software applications are written with subroutines that parse through the different headers of images and extract information such as the height, width, pixel number, pixel size, pixel color, color map index, and so on. Some headers are extensible, allowing software to insert blocks of text into the header without changing the image binary.
The RDF document and the image file (for example, jpeg) can be separate documents linked by URLs.
Break up your annotative data and your image binaries into multiple documents, that can be pointed from any of the files, and that can exclude or include RDF or image binary data, as desired.
The RDF data can be distributed into multiple documents, and each RDF document may point to more than one image file.
Perl and Ruby are free, open source software and can be downloaded from multiple web sites. Linux downloads for either language are ubiquitous. Perl is distributed with most Linux operating system packages.
For Windows users, if you use Perl, get a free installation from ActiveState at:
http://www.activestate.com/
For Ruby, go to:
http://rubyforge.org/frs/?group_id=167
Most of the scripts in this document, and many other medical-related Perl and Ruby scripts, are available in Jules Berman's previously published books:
Biomedical Informatics
Here is a short script that lets you look at any JPEG image. In this script, the image that's used is leaf.jpg.
#!/usr/local/bin/perl use Tk; use Tk::JPEG; my $mw = MainWindow->new(); my $file = "c\:\\ftp\\leaf\.jpg"; my $image = $mw->Photo(-file => $file); $mw->Label('-image' => $image, -height=>500, -width=>600)->pack; #$mw->Label(-image => $image)->pack(); MainLoop; exit;
In Ruby, you need to install ImageMagick, Tk and Rmagick if you want to view and modify images.
#!/usr/local/bin/ruby require 'RMagick' include Magick leaf = ImageList.new("leaf.jpg").resize!(0.7) leaf_copy = leaf.write("leaf.gif") require 'tk' root = TkRoot.new {title "view"} TkButton.new(root) do image TkPhotoImage.new{file "leaf.gif"} command {exit} pack end Tk.mainloop exit
#!/usr/local/bin/ruby
#leaf3.rb
#
#This Ruby script was created by Jules J. Berman on 7/8/2007
#and is provided as a public domain document
#
#The software is provided "as is", without warranty of any kind,
#express or implied, including but not limited to the warranties
#of merchantability, fitness for a particular purpose and
#noninfringement. in no event shall the authors or copyright
#holders be liable for any claim, damages or other liability,
#whether in an action of contract, tort or otherwise, arising
#from, out of or in connection with the software or the use or
#other dealings in the software.
#
require 'RMagick'
include Magick
orig_leaf = ImageList.new("leaf.jpg").resize!(0.4)
orig_leaf.write("orig.gif")
leaf = ImageList.new("leaf.jpg").first.crop(50, 310, 300, 300).resize!(0.4)
leaf.write("new.gif")
require 'tk'
root = TkRoot.new {title "view"}
TkButton.new(root) do
image TkPhotoImage.new{file "orig.gif"}
command {exit}
pack
end
TkButton.new(root) do
image TkPhotoImage.new{file "new.gif"}
command {exit}
pack
end
Tk.mainloop
exit
Download the external module Image::MetaData::JPEG from the Perl packet manager (if ActiveState Perl is installed on your system, simply enter ppm as your command line and follow the instructions on the packet manager client).
Perl script, meta_jpg.pl, to show how metadata can be added to a jpeg file.
#!/usr/local/bin/perl use Image::MetaData::JPEG; my $filename = "leaf.jpg"; #comment:your filename here my $file = new Image::MetaData::JPEG($filename); die 'Error: ' . Image::MetaData::JPEG::Error() unless $file; print "Description of JPEG file\n"; print $file->get_description(); print "\n\nRDF Annotations to JPEG file\n\n"; $line = "My this is a nice image of a leaf"; $file->add_comment($line); unlink $filename; $file->save($filename); my $file = new Image::MetaData::JPEG($filename); my @comments = $file->get_comments(); print join("",@comments); exit; The comment "My this is a nice image of a leaf" was added to the header of the JPEG file, ldip2103.jpg
Here is a Ruby script that inserts a Comment and a Label into the JPEG header.
#!/usr/local/bin/ruby require 'RMagick' include Magick walnut = ImageList.new("c\:\\ftp\\rb\\CT4192~1.JPG") walnut.cur_image[:Label] = "hello" walnut.cur_image[:Comment] = "<html><title>me</title></html>" walnut.properties{|name, value| print "#{name} #{value}\n"} walnut_copy = ImageList.new walnut_copy = walnut.cur_image.copy walnut_copy.write("c\:\\ftp\\rb\\out.JPG") walnut_copy.properties{|name, value| print "#{name} #{value}\n"} exit Output: Comment <html><title>me</title></html> JPEG-Colorspace 2 JPEG-Sampling-factors 2x2,1x1,1x1 Label hello Comment <html><title>me</title></html> JPEG-Colorspace 2 JPEG-Sampling-factors 2x2,1x1,1x1 Label hello
Here is the sample text for a text file.
"The image is a squamous cell carcinoma of the floor of the mouth. It was taken by Jules Berman, on February 2, 2002. The microscope was an Olympus model 3453. The lens objective was 40x. The camera was a Sony model 342. The image is jpeg and has dimensions of 524 by 429 pixels. The microscope and camera were not calibrated. The specimen Baltimore Hospital Center S-3456-2001, specimen 2, block 3. The specimen was logged in 8/15/01 and processed using the standard protocol for H&E that was in place for that day. The patient is Sam Someone, medical identifier 4357. The tissue was received in formalin. The specimen shows a moderately differentiated, invasive squamous cell carcinoma. The patient has a 30 year history of oral tobacco use. The image is kept in a jpeg file named y49w3p2.jpg and kept in the pathology subdirectory of the hospital's server. Its URL is https://baltohosp.org/pathology/y49w3p2.jpg. The image file has an md_5 hash value of 84027730gjsj350489. The image has no watermark. Copyright is held by Baltimore Hospital Center, and all rights are reserved."
You can put this text into a file, named "addtext.txt" and use the jpeg_add.rb Ruby script to do the insertion.
jpeg_add.rb, inserts plain-text file gwmbw.txt into a JPEG image #!/usr/local/bin/ruby require 'RMagick' include Magick text = IO.read("addtext.txt") orig_image = ImageList.new("gwmbw.jpg") orig_image.cur_image[:Comment] = text print "\nComment added, let's make a file to hold the modifications\n\n" copy_image = ImageList.new copy_image = orig_image.cur_image.copy copy_image.write("c\:\\ftp\\rb\\gwmout.JPG") copy_image.properties{|name, value| print "#{name}\n#{value}\n"} exit
There are a few problems with simply writing free-text descriptions of your images. Although you might think you've written an adequate description of your image, the likelihood is that you have forgotten to include important information about the file.
The Dublin Core consists of about 15 data elements selected by a group of librarians, that specifies the kind of file information a librarian might use to describe a file, index the described file, and retrieve files based on included information.
There are many publicly available documents that describe the Dublin Core elements:
http://www.ietf.org/rfc/rfc2731.txtThe Dublin Core elements can be inserted into HTML documents, simple XML documents, or RDF documents. A public document explains exactly how the Dublin Core elements can be used in these file formats:
http://dublincore.org/documents/usageguide/#rdfxmlAn example of a very simple Dublin Core file description in RDF format is shown below:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://www.julesberman.info/rubydisp.htm">
<dc:creator>Jules Berman</dc:creator>
<dc:title>Ruby Programming for Medicine and Biology, Jules J. Berman</dc:title>
<dc:description>Provides instructions for displaying an image in Ruby</dc:description>
<dc:date>2007-07-01</dc:date>
</rdf:Description>
</rdf:RDF>
As you can see, RDF is just a dialect of XML, and it is quite easy to read RDF files without any special instruction. In a later section, we will be explaining the logic and syntax of RDF. For now, all you need to know is:
The basic features of any file can be described using the Dublin Core common data elements. These data elements can be represented in HTML, XML or RDF.
Because HTML, XML and RDF are text files, they can be inserted into image files techniques described for text strings or text files (Level 1).
It is easy to insert an RDF document into the header of a jpeg image file, and it is just as easy to extract the RDF triples. Here's how you do it:
1. Prepare your RDF document.
2. Use a script to insert the document into your jpeg header.
3. Use the new jpeg file (now with RDF comments) to display the
image or to send to colleagues. When displayed, it will look
exactly like the file before the contents of the RDF document
were added.
4. Use another script to extract the comments from the header of
the jpeg file, as needed.
The following Perl script will take the jpeg image ldip2103.jpg and add the RDF document from the first use-case to its header. Once created, the program extracts and displays the contents of the RDF file.
#!/usr/local/bin/perl use Image::MetaData::JPEG; my $filename = "ldip2103.jpg"; #comment:your filename here my $file = new Image::MetaData::JPEG($filename); die 'Error: ' . Image::MetaData::JPEG::Error() unless $file; print "Description of JPEG file\n"; print $file->get_description(); print "\n\nRDF Annotations to JPEG file\n\n"; open (TEXT, "rdf_desc.xml")||die"cannot"; #the rdf document you'll add $line = " "; while ($line ne "") { $line = <TEXT>; $file->add_comment($line); } unlink $filename; $file->save($filename); my $file = new Image::MetaData::JPEG($filename); my @comments = $file->get_comments(); print join("",@comments); exit;
This Perl script requires the freely available open source module, Image::MetaData::JPEG. You can download this module from CPAN (Comprehensive Perl Archive Network, www.cpan.org).
The last few lines extracts and prints the RDF file from the image.
This Perl script is functionally equivalent to the Ruby script used in Level 2 to insert a Dublin Core RDF file into a jpeg image.
What exactly is an RDF file?
The Resource Description Framework (RDF) provides a simple method for specifying information as data triples. The authors believe that much of the time and expense associated with developing and deploying data standards can be eliminated by a consistent implementation of recommended RDF data specification practices.
Necessary background subjects:
1. Meaning in informatics 2. Triples 3. Identifiers 4. Datatyping 5. Classes and Properties 6. Instantiating Classes
Necessary informatics techniques:
1. RDF syntax (specifying data as class instance-property-data triples) 2. RDF schema (formal dictionary for classes and properties) 3. XSD (to constrain data to a defined datatype)
The only implementation tools you really need are your head and a text editor such as notepad or emacs.
In informatics, assertions have meaning whenever a pair of metadata and data (the descriptor for the data and the data itself) is assigned to a specific subject.
Triples consist of: Specified subject, then metadata, then data.
Some triples found in a medical dataset
"Jules Berman" "blood glucose level" "85" "Mary Smith" "blood glucose level" "90" "Samuel Rice" "blood glucose level" "200" "Jules Berman" "eye color" "brown" "Mary Smith" "eye color" "blue" "Samuel Rice" "eye color" "green"
Some triples found in a haberdasher's dataset
"Juan Valdez" "hat size" "8" "Jules Berman" "hat size" "9" "Homer Simpson" "hat size" "9" "Homer Simpson" "hat_type" "bowler" Triples collected from both datasets whose subject is "Jules Berman"
"Jules Berman" "blood glucose level" "85"
"Jules Berman" "eye color" "brown"
"Jules Berman" "hat size" "9"
Triples can port their meaning between different databases because they
bind described data to a specified subject. This supports data integration
of heterogeneous data and facilitates the design of software agents. A
software agent, as used here, is a program that can interrogate multiple
RDF documents on the web, initiating its own actions based on inferences
yielded from retrieved triples.
RDF (Resource Description Framework) is a syntax for writing computer-parsable
triples. For RDF to serve as a general method for describing data objects,
we need to answer the following four questions:.
1. How does the triple convey the unique identity of its subject?
In the triple, "Jules Berman" "blood glucose level" "85", The
name "Jules Berman" is not unique and may apply to several different
people.
2. How do we convey the meaning of metadata terms? Perhaps one person's
definition of a metadata term is different from another person's.
For example, is "hat size" the diameter of the hat, or the distance
from ear to ear on the person who is intended to wear the hat, or a
digit selected from a pre-defined scale?
3. How can we constrain the values described by metadata to a specific
datatype? Can a person have an eye color of 8? Can a person have
an eye color of "chartreuse"?
4. How can we indicate that a unique object is a member of a class and
can be described by metadata shared by all the members of a class?
Much of the remainder of the background section will be devoted to answering these four questions.
RDF is a specialized XML syntax for creating computer-parsable files consisting of triples. The subject of the RDF triple is invoked with the rdf:about attribute. Following the subject is a metadata/data pair.
Let us create an RDF triple whose subject is the jpeg image file specified as: http://www.the_url_here.org/ldip/ldip2103.jpg. The metadata is <dc:title> and the data value is "Normal Lung".
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:title>Normal Lung</dc:title>
</rdf:Description>
An example of three triples is proper RDF syntax is:
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:title>Normal Lung</dc:title>
</rdf:Description>
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:creator>Bill Moore</dc:creator>
</rdf:Description>
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:date>2006-06-28</dc:date>
</rdf:Description>
RDF permits you to collapse multiple triples that apply to a single
subject. The following RDF:Description statement is equivalent to the
three prior triples:
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-28</dc:date>
</rdf:Description>
An example of a short but well-formed RDF image specification document is:
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-28</dc:date>
</rdf:Description>
</rdf:RDF>
The first line tells you that the document is XML. The second line tells you that the XML document is an RDF resource. The third and fourth lines are the namespace documents that are referenced within the document (more about this later). Following that is the RDF statement that we have already seen.
Though we distinguish text files from binary files, all files are actually binary files. Sequential bytes of 8 bits are converted to ascii equivalents, and if the ascii equivalents are alphanumerics, we call the file a text file. If the ascii values of 8-bit sequential file chunks are non-alphanumeric, we call the files binary files.
Standard format image files are always binary files. Because RDF syntax is a pure ascii file format, image binaries cannot be directly pasted into an RDF document. However, binary files can be interconverted to an from ascii format, using a simple software utility. This simple Perl script, using the MIME::Base64::Perl module is all that is necessary to interconvert binary files to Base64.
#!/usr/bin/perl use MIME::Base64::Perl; open (TEXT,"c\:\\ftp\\ldip\\ldip2103\.jpg")||die"cannot"; #path to sample file binmode TEXT; $/ = undef; $string = <TEXT>; close TEXT; $encoded = encode_base64($string); open(OUT,">2103.txt"); print OUT $encoded; close OUT; #$decoded = decode_base64($encoded); #open(OUT,">binary.jpg"); #binmode OUT; #print OUT $decoded; exit;
Here is an example of the same RDF document shown in the prior use-case. The only difference is that in addition to pointing to the URL that identifies the image, this document contains the image file converted to base64 ascii.
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<rdf:type
rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-21</dc:date>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
<ldip:imageType>photomicrograph</ldip:imageType>
<ldip:stain>H and E</ldip:stain>
<ldip:tissue>lung</ldip:tissue>
<ldip:organism>human</ldip:organism>
<ldip:objective>10x</ldip:objective>
<ldip:diagnosis>normal</ldip:diagnosis>
<ldip:base64File>
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8
UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDB
gNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyM
jIyMjIyMjL/wAARCAYACAADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAEC
.
.
.
D1phhQnOMVSxL6nRHEaWZz8tpvGBWdPpMpbO08967AW8YOcU4xIRgqKr6yl0N449x0S
0OFOkv6Yq1a6W6HJHFdb9miznbStAhB4AzVfWl0KqY3nsjjNSuEtosE9BXC6lrf7whT
39a7jxLp0xjbah6V5leaZchySh6104PC+196R1zrU6cFZo/9k=
</ldip:base64File>
</rdf:Description>
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
<rdf:type
rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Microscope</ldip:instrumentType>
<ldip:make>Olympus</ldip:make>
<ldip:model>BH2</ldip:model>
<ldip:serialNumber>224085</ldip:serialNumber>
</rdf:Description>
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
<rdf:type
resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Camera</ldip:instrumentType>
<ldip:make>Infinity</ldip:make>
<ldip:model>3</ldip:model>
<ldip:serialNumber>00169344</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
Ruby script, base64.rb, encodes strings in Base64 notation
#!/usr/local/bin/ruby require 'base64' text = "The secret of life" encoded = Base64.encode64(text) puts("This is the encoded text ... #{encoded}") decoded = Base64.decode64(encoded) puts("This is the decoded text ... #{decoded}") exit
Output of Ruby script base64.rb
C:\ftp\rb>ruby base64.rb
This is the encoded text ... VGhlIHNlY3JldCBvZiBsaWZl
This is the decoded text ... The secret of life
An entire file can be converted to Base64 using class File's read method.
#!/usr/local/bin/ruby
require 'base64'
image_file = File.open("walnut.jpg").binmode
image_file_string = image_file.read
b64 = Base64.encode64(image_file_string)
puts b64.slice(0,300)
regular = Base64.decode64(b64)
out_file = File.open("walnew.jpg", "w").binmode
out_file.write(regular)
exit
Base64 enlarges the file sizes of images. If you want to include Base64 versions of large image binaries, then expect to share very large documents.
You can prepare an RDF document describing your image, and then simply link to the image from your RDF document, using a pointer.
The following example provides pathologists with the only method that most will ever use.
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<rdf:type
rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-21</dc:date>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
<ldip:imageType>photomicrograph</ldip:imageType>
<ldip:stain>H and E</ldip:stain>
<ldip:tissue>lung</ldip:tissue>
<ldip:organism>human</ldip:organism>
<ldip:objective>10x</ldip:objective>
<ldip:diagnosis>normal</ldip:diagnosis>
</rdf:Description>
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
<rdf:type
rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Microscope</ldip:instrumentType>
<ldip:make>Olympus</ldip:make>
<ldip:model>BH2</ldip:model>
<ldip:serialNumber>224085</ldip:serialNumber>
</rdf:Description>
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
<rdf:type
resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Camera</ldip:instrumentType>
<ldip:make>Infinity</ldip:make>
<ldip:model>3</ldip:model>
<ldip:serialNumber>00169344</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
There are times when the the structural data (i.e., the non-binary data) for a data object's specification must be distributed in multiple files.
This is one of the most important reasons for using a data specification, rather than a data standard. The specification permits you to create a dynamic object, composed of informational pieces that can be updated, so that the content and value of a specified image object increases over time. A data standard obligates you to compose a static data file. If the standard data file contains information that cannot be shared (due to human subject risks, or to intellectual property encumbrances), the standard file usually cannot be distributed. A specification may consist of multiple files connected by URL pointers. If component files contain privileged information, the data object's specification can be distributed with access restricted to specified files.
RDF file 1 (Describes an image)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<rdf:type
rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-21</dc:date>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file2">
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file3">
<ldip:imageType>photomicrograph</ldip:imageType>
<ldip:stain>H and E</ldip:stain>
<ldip:tissue>lung</ldip:tissue>
<ldip:organism>human</ldip:organism>
<ldip:objective>10x</ldip:objective>
<ldip:diagnosis>normal</ldip:diagnosis>
</rdf:Description>
</rdf:RDF>
RDF File 2 (Describes a microscope referenced by File 1)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
<rdf:type
rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Microscope</ldip:instrumentType>
<ldip:make>Olympus</ldip:make>
<ldip:model>BH2</ldip:model>
<ldip:serialNumber>224085</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
RDF File 3 (Describes a camera referenced by File 1)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
<rdf:type
resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Camera</ldip:instrumentType>
<ldip:make>Infinity</ldip:make>
<ldip:model>3</ldip:model>
<ldip:serialNumber>00169344</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
Suppose, as in the previous example, that the triples relevant to your image lie in multiple RDF files. Suppose, further, that your image is just one of a set of images that were all obtained during the same session, and that all the images apply to the same patient. This situation is routine for radiologic images, wherein dozens of images transecting the brain, or the abdomen, may form part of the same report.
How might you annotate this complex set of data files and image binaries? Simply include an RDF assertion for each image.
RDF File 1 (Describes 2 images)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<rdf:type
rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-21</dc:date>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file2">
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file3">
<ldip:imageType>photomicrograph</ldip:imageType>
<ldip:stain>H and E</ldip:stain>
<ldip:tissue>lung</ldip:tissue>
<ldip:organism>human</ldip:organism>
<ldip:objective>10x</ldip:objective>
<ldip:diagnosis>normal</ldip:diagnosis>
</rdf:Description>
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2201.jpg">
<rdf:type
rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-21</dc:date>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file2">
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file3">
<ldip:imageType>photomicrograph</ldip:imageType>
<ldip:stain>H and E</ldip:stain>
<ldip:tissue>lung</ldip:tissue>
<ldip:organism>human</ldip:organism>
<ldip:objective>2.5x</ldip:objective>
<ldip:diagnosis>squamous cell carcinoma</ldip:diagnosis>
</rdf:Description>
</rdf:RDF>
RDF File 2 (Describes a microscope referenced by File 1)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
<rdf:type
rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Microscope</ldip:instrumentType>
<ldip:make>Olympus</ldip:make>
<ldip:model>BH2</ldip:model>
<ldip:serialNumber>224085</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
RDF File 3(Describes a camera referenced by File 1)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
<rdf:type
resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Camera</ldip:instrumentType>
<ldip:make>Infinity</ldip:make>
<ldip:model>3</ldip:model>
<ldip:serialNumber>00169344</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
The same approach can be used to reference multiple images within a single image file. The rdf:about attribute can point to any file block or file element that contains the part of the image that is the intended subject of the triples (e.g., region of interest, thumbnail, tile, waveform, color map, and so on).
In the field of biomedicine, DICOM (Digital Imaging and Communications in Medicine) has special significance, because DICOM is the format currently used for almost all radiologic images. DICOM was developed over several decades, to become a multifunctional standard of enormous complexity that uses a model for data storage that is unlike any other image file format. The DICOM standard includes a set of protocols for transferring information through networks, and for communicating between different radiologic devices or different parts of a single device (e.g., between CT machine and CT workstation). It creates a unique syntax and semantics for information and produces a file that contains a large amount of descriptive information (including patient information and diagnostic information), and a binary representation of one or more images.
One of the best descriptions of the DICOM file format is available at:
http://www.dclunie.com/medical-image-faq/html/part1.html http://www.dclunie.com/medical-image-faq/html/part2.html
DICOM Working Group 26, led by Dr. Bruce Beckwith, is attempting to expand DICOM to include metadata appropriate for pathology images. To the best of our knowledge, there are now a handful of pathology departments that format pathology photomicrographs as DICOM images. The VA (U.S. Veterans Administration) seems to have adopted DICOM as their standard format for all medical images. To the best of our knowledge, nobody using DICOM is currently inserting a complete set of pathology descriptors into their DICOM headers, but this may change over the course of time.
For the purposes of this document, all we need to know is that the header information in a DICOM file can be extracted (with a short Ruby script), and that the binary portion of a DICOM file can be converted to a JPEG file. The header data from the DICOM file can be re-inserted into the header of a JPEG file, or it can be included in a special XML file that "points" back to the original DICOM file or to the JPEG file that contains the image representation.
The nice thing about electronic standards that set them apart from standards created for physical objects, is that they are interconvertible.
Though there are hundreds of standard image formats, robust image software can do a pretty good job at converting any format into any other format.
We like to work with JPEG images, because they are the most popular web images. Our philosophy is that if you like to work in DICOM, you should work in DICOM. If you like to work in JPEG, you should work in JPEG. There are simple ways of interconverting the two formats.
You can find many DICOM images at:
ftp://ftp.erl.wustl.edu/pub/dicom/images/version3/RSNA95/
These images can be used as practice files for some of the scripts that will follow.
DICOM has a header that can be extracted from the DICOM image file, and which contains textual descriptive information about the image.
Here is a sample DICOM image header.
0002,0000,File Meta Elements Group Len=122
0002,0001,File Meta Info Version=1
0002,0002,Media Storage SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
0002,0003,Media Storage SOP Inst UID=9999.20070123103417.100.10
0002,0010,Transfer Syntax UID=1.2.840.10008.1.2.1.
0002,0012,Implementation Class UID=960051513
0008,0008,Image Type=
0008,0012,Instance Creation Date=20070123
0008,0013,Instance Creation Time=103417
0008,0016,SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
0008,0018,SOP Instance UID=9999.20070123103417.100.10
0008,0020,Study Date=20070123
0008,0030,Study Time=103417
0008,0050,Accession Number=
0008,0060,Modality=OT
0008,0064,Conversion Type=WSD.
0008,0090,Referring Physician's Name=
0010,0010,Patient's Name=gwmbw.jpg.
0010,0020,Patient ID=0.
0010,0030,Patient Date of Birth=
0010,0040,Patient Sex=M
0010,1010,Patient Age=0.
0020,000D,Study Instance UID=9999.20070123103417.100.20
0020,000E,Series Instance UID=9999.20070123103417.100.30
0020,0010,Study ID= 0
0020,0011,Series Number=0
0020,0013,Image Number=0
0020,0020,Patient Orientation=
0028,0002,Samples Per Pixel=1
0028,0004,Photometric Interpretation=MONOCHROME2
0028,0010,Rows=1536
0028,0011,Columns=2048
0028,0100,Bits Allocated=8
0028,0101,Bits Stored=8
0028,0102,High Bit=7
0028,0103,Pixel Representation=0
7FE0,0010,Pixel Data=3145728
ezDICOM (Copyright 2002, Wolfgang Krug and Chris Rorden) is a medical viewer for DICOM images. It is distributed along with dcm2jpg, a command-line application that can convert DICOM images into standard bitmap file formats (JPEG, PNG, and BMP). In addition, it will convert a DICOM image to its textual header information.
The sample command-line is:
dcm2jpg -f p -o C:\TEMP -z 1.5 C:\DICOM\input1.dcm C:\input2.dcm
This command-line may contain information for brightness, contrast, format of output, output target directory, input files, etc.
If you simply invoke:
dcm2jpg <dicom filename and path if file is not in current directory>
The dicom file will be converted to a .jpg file in the directory that holds the dicom file.
The .exe file is:
dcm2jpg.exe 218,112 bytes - converts to jpeg by default
If you wish, you can simply rename the .exe file to change the default conversion behavior.
dcm2bmp.exe 218,112 bytes - converts to bmp by default dcm2png.exe 218,112 bytes - converts to png by default dcm2txt.exe 218,112 bytes - converts to text header by default
Any JPEG file can be converted to a DICOM file, with jpeg2dcm. This free software by CharruaSoft software can be downloaded from:
http://www.charruasoft.com/downen.htm
It is a simple exe file (jpeg2dcm.exe 511,488 bytes), that can operate from a command-line:
It will accept an input file, such as gems.jpg, and convert it to a dicom file,
gems.jpg 132,135 bytes gems.dcm 1,980,712 bytes
We will use:
dcm2jpg.exe 218,112 bytes - converts to jpeg by default
dcm2txt.exe 218,112 bytes - converts to text header by default
Ruby can simply call a command-line application from within a script, using the exec method.
Ruby script, dcm2jpg.rb, converts a DICOM file into a JPEG file:
#!/usr/local/bin/ruby
exec("dcm2jpg.exe c\:\\ftp\\picker\\dicom\\CT4174\~1")
exit
Screen output of dcm2jpg.rb script
c:\ftp>ruby dcm2jpg.rb 1 Creating: c:\ftp\picker\dicom\CT4174~1.jpg 1
Ruby script, dcmsplit.rb, converts a DICOM file into a JPEG, and a text file for the header
#!/usr/local/bin/ruby
system("dcm2jpg.exe c\:\\ftp\\picker\\dicom\\CT4174\~1")
system("dcm2txt.exe c\:\\ftp\\picker\\dicom\\CT4174\~1")
exit
Screen output of dcm2jpg.rb script
c:\ftp>ruby dcm2jpg.rb 1 Creating: c:\ftp\picker\dicom\CT4174~1.jpg 1 1 Creating: c:\ftp\picker\dicom\CT4174~1.txt 1
The just-created JPEG file lacks the clinical jpeg information contained in the DICOM file, but this information is now available to us in our newly created text file. In the next section, we will see how any textual information can be inserted back into a JPEG file, using RMagick.
Ruby script, jpeg_add.rb, inserts textual information into a JPEG image:
#!/usr/local/bin/ruby
require 'RMagick'
include Magick
text = IO.read("gwmbw.txt")
orig_image = ImageList.new("gwmbw.jpg")
orig_image.cur_image[:Comment] = text
print "\nComment added, let's make a file to hold the
modifications\n\n"
copy_image = ImageList.new
copy_image = orig_image.cur_image.copy
copy_image.write("c\:\\ftp\\rb\\gwmout.JPG")
copy_image.properties{|name, value| print "#{name}\n#{value}\n"}
exit
Output of jpeg_add.rb script:
C:\ftp\rb>ruby jpeg_add.rb
Comment added, let's make a file to hold the modifications
Comment
0002,0000,File Meta Elements Group Len=122
0002,0001,File Meta Info Version=1
0002,0002,Media Storage SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
0002,0003,Media Storage SOP Inst UID=9999.20070123103417.100.10
0002,0010,Transfer Syntax UID=1.2.840.10008.1.2.1.
0002,0012,Implementation Class UID=960051513
0008,0008,Image Type=
0008,0012,Instance Creation Date=20070123
0008,0013,Instance Creation Time=103417
0008,0016,SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
0008,0018,SOP Instance UID=9999.20070123103417.100.10
0008,0020,Study Date=20070123
0008,0030,Study Time=103417
0008,0050,Accession Number=
0008,0060,Modality=OT
0008,0064,Conversion Type=WSD.
0008,0090,Referring Physician's Name=
0010,0010,Patient's Name=gwmbw.jpg.
0010,0020,Patient ID=0.
0010,0030,Patient Date of Birth=
0010,0040,Patient Sex=M
0010,1010,Patient Age=0.
0020,000D,Study Instance UID=9999.20070123103417.100.20
0020,000E,Series Instance UID=9999.20070123103417.100.30
0020,0010,Study ID= 0
0020,0011,Series Number=0
0020,0013,Image Number=0
0020,0020,Patient Orientation=
0028,0002,Samples Per Pixel=1
0028,0004,Photometric Interpretation=MONOCHROME2
0028,0010,Rows=1536
0028,0011,Columns=2048
0028,0100,Bits Allocated=8
0028,0101,Bits Stored=8
0028,0102,High Bit=7
0028,0103,Pixel Representation=0
7FE0,0010,Pixel Data=3145728
JPEG-Colorspace
1
JPEG-Sampling-factors
1x1
The output JPEG image file (gwmout.jpg) contains the binary representation of the same image in gwmbw.dcm and in gwmbw.jpg. In addition, it contains a Comment field consisting of the contents of file gwmbw.txt, the textual representation of the original DICOM header. We can extract the Comment field from the JPEG header whenever we wish, using RMagick's properties method.
DICOM is a complex and highly specialized standard developed over several decades. It is intended to negotiate a variety of network transactions in addition to encapsulating an image binary. DICOM was created at a time prior to the development of XML and prior to the development of the web transfer protocol, http. It depends of a variety of data control devices that some may find anachronistic (e.g., prescribed byte locations, DICOM-specific data transfer and negotiation protocols). Mastering the technical aspects of DICOM is may require months analyzing the many volumes of DICOM technical reports. With few exceptions, DICOM is something that serves radiologists through proprietary software written for imaging devices that may cost millions of dollars in outlay plus maintenance.
Though DICOM has proven itself to be an excellent standard for radiology images, it is not currently a popular standard for pathology images.
Pathologists should have the option of using DICOM, or using some other image format. Regardless of the image format chosen, it is important to annotate images with textual data in a computer parsable form, that can be ported between alternate image formats.
Pathologists typically use off-the-shelf cameras and software, and the raw pixel data is usually provided in a popular image format, such as jpeg, png, bmp, or tiff. By specifying images in RDF, pathologists have a simple, inexpensive and easily implemented method for annotating images with useful descriptive data, and for binding these annotations to the images.
Because all the data in a specification is fully described, it is very easy to write software that will port a specification into a data standard. To have full compatibility between a data specification and data standard, the specification must contain all of the required data elements of the data standard. This usually involves:
1. Studying the data standard, and writing an RDF Schema (or supplementing
an existing RDF Schema) with classes and properties appropriate for
the data standard.
2. RDF specifications, unlike data standards, do not place requirements
on the classes and properties that need to be included in the document.
To create an RDF specification that can be ported to a data standard,
the RDF document must be prepared with knowledge of the classes and
properties that are required by the data standard. This information
can be provided as an external document, or as triples added to a
schema, or as a feature of the [textual] CDE document. At present,
no single method has emerged as the preferred strategy.
3. Writing software that will parse the RDF document in which the data
object is specified (trivial), and transform the triples into the
document that conforms to the structure and content of the data
standard.
The term, "common data element" is a misnomer. Most people, when they first encounter this term, assume that a data element holds data. Actually, a common data element is the metadata that describes a datum in a data record. In XML parlance, a CDE is an XML tag. The thing that makes a descriptor "common" is its common usage by a scientific community. The way that CDEs are intended to work is that a scientific community creates a list of CDEs that describe the kinds of data that their members use. The members of the community will all use the same CDEs (XML tags) to annotate their data files.
One of the most calamitous errors in any CDE project is to assume that everyone who reads a metadata tag will automatically understand its intended meaning. ISO-11179 is a standard way of defining CDEs with the necessary information for understanding their meanings.
The most popular CDEs in existence are the Dublin Core CDEs. These are a set of file descriptors that were prepared by a committee of librarians, who convened in Dublin, Ohio. The Dublin Core includes basic information about electronic documents, such as: the title of the document, the name of the person who created the file, the date that the file was created, the date that the file was modified, and a short description of the file. These are basically the items that a library software agent would need to retrieve, if it were building an index of internet documents. The world of informatics would be a better place if everyone who created an HTML, XML or RDF file would remember to include the Dublin Core CDEs.
These Dublin Core CDEs have been prepared to comply with the ISO-11179 specification. Every effort to create a data specification for a knowledge domain should begin with by collecting the common data elements for the domain, and annotating each element with the ISO-11179 CDE descriptors. The ISO-11179 descriptors for two of the Dublin Core CDEs (Title and Creator) are shown below.
From: http://dublincore.org/documents/1999/07/02/dces/
Title
Identifier: Title
Version: 1.1
Registration Authority: Dublin Core Metadata Initiative
Language: en
Obligation: Optional
Datatype: Character String
Maximum Occurrence: Unlimited
Definition: A name given to the resource.
Comment: Typically, a Title will be a name by which the resource is
formally known.
Creator
Identifier: Creator
Version: 1.1
Registration Authority: Dublin Core Metadata Initiative
Language: en
Obligation: Optional
Datatype: Character String
Maximum Occurrence: Unlimited
Definition: An entity primarily responsible for making the content
of the resource.
Comment: Examples of a Creator include a person, an organisation,
or a service. Typically, the name of a Creator should be used to
indicate the entity.
An RDF schema is a dictionary file that lists the classes and the properties that pertain to RDF documents. In fact, the official long name for RDF Schema is the RDF Vocabulary Description Language. The classes of an RDF schema are formal definitions for the kinds of subjects that are found in the RDF triples. The properties of an RDF schema are the types of metadata descriptors for the data of the RDF triples. Elements in RDF schemas may be subclasses of elements in other RDF schemas.
Things to remember about RDF Schemas
1. RDF Schemas are written in XML, but are completely unlike XML Schemas.
2. RDF Schemas contain declarations of the classes and properties
that are used in RDF documents.
3. RDF Schemas, like all RDF documents, have no pre-determined order or
composition, and consist of statements expressed as triples. The
subject of every triple in an RDF Schema will be either Class or
Property.
4. Every RDF Schema can be thought of as a child of the W3C RDF Schema
that defines the "super" classes Resource, Class and Property. All
RDF Schemas will refer to the document that defines RDF syntax and
to the document that defines the top-level schema, and therefore
will begin something like this:
<?xml version='1.0' encoding='ISO-8859-1'?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
5. A typical RDF document consists of triples <subject, metadata, value>.
RDF documents usually reference one or more RDF Schemas to instantiate
the subject of each triple (i.e., to tell us which class in an RDF
schema the subject is an instance of) and to provide subjects with
class-appropriate metadata.
6. Documents composed of triples whose components are defined by RDF
Schemas can be used to completely specify data objects within a
knowledge domain.
7. By completely specifying data objects in a knowledge domain, RDF
specifications achieve the functionality of data standards.
In a later section, we will demonstrate the simple method for instantiating classes and for associating class instances with appropriate properties and with datatyped values.
Here is a CDE template that satisfies the minimal recommendations for a CDE under ISO-11179, and that provides all the information for a class or a property in an RDF schema.
The general format for class elements is:
Class Label (in standard XML tag format, uppercase first letter): Registration Authority: Association for Pathology Informatics Obligation: optional Maximum Occurrence: Unlimited Comment(must include detailed definition): subClassOf: Contributor (your consistent first-name last-name): Date of your contribution:
The general format for property elements is:
Property Label (in standard XML tag format, lowercase first letter): Registration Authority: Association for Pathology Informatics Obligation: optional Maximum Occurrence: Unlimited Datatype (can be "Literal", a list, or a regex; default is "Literal"): Comment(must include detailed definition): Domain (comma-delimited if multiple): Range (usually "Literal"): Contributor (your consistent first-name last-name): Date of your contribution:
The category "Obligation" should contain the word "required" or the word "optional". For the kinds of specifications discussed in this manuscript, including any CDE would always be optional. Similarly, for "Maximum Occurrence", we would think any CDE could occur an unlimited number of times in an RDF document.
Here is an example of an ISO-11179-compliant CDE written for a class named "Reagent".
Class Label:Reagent versionInfo (required): 0.1 Registration Authority: Association for Pathology Informatics Obligation:optional Maximum Occurrence: Unlimited Datatype: Literal comment: Histologic_stain_reagents, tissue_fixation_reagents, and other chemicals employed in the laboratory. For example: distilled_water, ethanol, hematoxylin, aluminum_sulphate subClassOf:Class Contributor:Bill Moore Date_of_contribution:05-30-2006
Once we have the CDE, it is a straightforward job to create an RDF Schema Class element:
<rdf:Class rdf:about="http://www.the_url_here.org/ldip_sch#Reagent">
<rdfs:label>Reagent</rdfs:label>
<rdfs:comment>
Histologic_stain_reagents, tissue_fixation_reagents, and other chemicals employed in the laboratory. For example: distilled_water, ethanol, hematoxylin, aluminum_sulphate
</rdfs:comment>
<rdfs:subClassOf rdf:resource="xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdf:Class>
A few points bear explanation. All the information needed to generate an RDF class or property should be contained in an ISO-11179-compliant list of CDEs. An RDF Schema may consist purely of classes and properties. Classes are defined in RDF exclusively through their ancestral relation. Basically, to build a class in RDF Schema, you announce that the element is a Class, you provide a unique locator (such as a URL) or a unique universally understood descriptor (more on this later) for the element, a description of the element, and the name of the father class of the element.
That's all there is to do for classes. You don't need to list the subclasses of the class because the subclasses will list the class as their father in their own schema entry. You don't need to list the properties of the class because the properties will list the classes whose data they describe.
Do classes in RDF schema remind you of anything? The classes in an RDF schema comprise an ontology. An ontology is a list of classes and their relationships. You can think of an ontology as the "classy" half of an RDF schema. Classes become most useful when they have Properties (the other half of the RDF schema)..
A property is a metadata element that is used to describe the data assigned to one or more class objects. Here is the CDE for a property named "dateTime".
Identifier:ldip:dateTime Property Label:dateTime versionInfo: 0.1 Registration Authority: Association for Pathology Informatics Language:en Obligation:optional Maximum Occurrence: Unlimited Datatype: /[\+\-]{1}[\d]{8}\.[\d]{6}Z[\+\-]{1}[\d]{4}/ comment: ISO 8601 format of data and time. domain:Event range: http://www.the_url_here.org/ldip_xsd.xsd#iso8601 Contributor:Bill Moore Date_of_contribution:05-30-2006
An RDF Schema declaration for the dateTime property might be:
<rdf:Property rdf:about="http://www.the_url_here.org/ldip_sch#dateTime">
<rdfs:label>dateTime</label>
<rdfs:comment>
The date and time at which an event occurs, in ISO8601 format
</rdfs:comment>
<rdfs:domain rdf:resource="http://www.the_url_here.org/ldip_sch#Event"/>
<rdfs:range rdf:resource="http://www.the_url_here.org/ldip_xsd.xsd#iso8601"/>
</rdf:Property>
Let's look at the dateTime property. The first line announces that we will be declaring a Property. The second line tells us the name of the Property (dateTime), and it's URL (the current RDF schema document). The third line provides the label by which we will refer to the Property. This might come in handy if we had different names for the property in different languages. The comment includes a definition for the element.
The next line specifies the domain (class) for the property. The domain of a property is the class for which the property may be used. In this case, the domain class for which the dateTime property applies is Event. This makes sense. If you need to describe an event, you would want to include the time that the event occurred.
A property for a class may serve as a property for all of the subclasses of the class (because all the subclass instances are members of the ancestor class). Every Property must have a domain (a class or classes for which the Property may be used) and a Range (a specified kind of data that is described by the Property). A property may have multiple classes in its domain. When a property has multiple classes in its domain, all the classes in the domain share the same property (obviously). This achieves some of the functionality of multi-class inheritance, without actually needing to instantiate multiple classes under a single object. This is a subtle concept, and does not need to be mastered at this time. Suffice it to say that as you create your own RDF Schemas, you should try to design your Properties to apply to multiple classes, and you should try to instantiate objects under a single class.
Let us continue to examine the dataTime property. Recall that a triple consists of a subject, followed by metadata (the property element), followed by the data. The property element describes the data. The range of the property element tells us what kind of data is described. In RDF schemas, the range of a property is often "Literal", an element defined in the RDF syntax document that refers to any character string. You can see immediately that describing the range of a property as a character string does little to constrain or structure the expected values for a data element.
In the dateTime property, we want the range of the property to be data that conforms to the ISO8601 date/time format. How do we convey the datatype of the data/time element in RDF?
RDF has no intrinsic datatyping facility. So for our property range, we provide a resource (URL) that specifies an element in an .xsd file, that defines the datatype we need.
The range for the dateTime property is a resource:
<rdfs:range rdf:resource="http://www.the_url_here.org/ldip_xsd.xsd#iso8601"/>
The resource points us to an xsd file on the web, and to a particular element within the xsd file, labeled iso8601. Let's pretend we visit the file and extract the iso8601 element. We might find the following:
<simpleType name='iso8601'>
<!-- values of a data_time must contain -->
<!-- a plus or minus sign occurring zero or one -->
<!-- times followed by 8 digits -->
<!-- followed by a perios -->
<!-- followed by 6 digits -->
<!-- followed by the a letter Z, T or a space -->
<!-- followed by a plus or minus sign occurring -->
<!-- zero or one time, followed by 4 digits -->
<xsd:restriction base='string'>
<pattern value=''[\+\-]?[\d]{8}\.[\d]{6}[ZT ][\+\-]{1}[\d]{4}"/>
</xsd:restriction>
</simpleType>
The essence of the datatype is found in the pattern value line:
<pattern value=''[\+\-]?[\d]{8}\.[\d]{6}[ZT ][\+\-]{1}[\d]{4}"/>
This line uses a Regular Expression (RegEx), that provides a pattern to which the element must conform. RegEx is beyond the scope of this manuscript.
Don't be intimidated by .xsd and Regex rules. For most purposes, simply describing the range of a property with the RDF syntax-defined element, "Literal" will be all that you need.
The .xsd element definition imposes a datatype pattern on the value of the data described by the property. A validating software agent would check an RDF document to determine if the data described by a property conform to the range of the property element, as defined by the element in the .xsd resource for the property range.
If we want to employ this trick, we'll need to prepare an XSD file that contains elements for all the datatypes referred under the property ranges included in our XML Schema.
XSD datatype files are very easy to prepare. Basically, you just list your datatypes and provide descriptors. The following generic file contains samples of the kinds of datatypes you will probably need (patterns, inclusive values, and unions).
<?xml version="1.0" encoding="UTF-8"?> <xsd:schema xmlns:xsd ="http://www.w3.org/2000/10/XMLSchema#"> <simpleType name='sp_pattern'> <!-- values of an accession number must contain --> <!-- the letters sp followed by a hyphen followed --> <!-- by two digits followed by a hyphen --> <!-- followed by any number of digits --> <xsd:restriction base='string'> <pattern value='sp\-[0-9]{2}\-[0-9]+'/> </xsd:restriction> </simpleType>
<xsd:simpleType name="adult_age">
<!-- an adult is at least 18 years old -->
<xsd:restriction base="xsd:positiveInteger">
<xsd:minInclusive value="18"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="serialNumber">
<!-- may be either integers or mixed alphanumeric strings -->
<xsd:union>
<xsd:simpleType>
<xsd:restriction base='integer'/>
</xsd:simpleType>
<xsd:simpleType>
<xsd:restriction base='string'/>
</xsd:simpleType>
</xsd:union>
</xsd:simpleType>
<xsd:simpleType name="EnumerationObjectives">
<!-- may be either integers or mixed alphanumeric strings -->
<xsd:restriction base="string">
<xsd:enumeration value="2.5x"/>
<xsd:enumeration value="6.3x"/>
<xsd:enumeration value="20x"/>
<xsd:enumeration value="40x"/>
<xsd:enumeration value="100x"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
The most difficult step in building any schema is determining whether a candidate element is a Class or a Property. Generalizations do not hold for all cases. For example, Classes tend to be nouns, while Properties (that describe data) tend to be adjectives. However, a Property can be a noun (e.g. Time) if it's role is to describe a data value (4:00 PM EST). Furthermore, we sometimes assign active processes to classes (e.g. birth, death), and we cannot assume that classes are always static objects.
There is a strong tendency to assign subclass status to things that are not examples of their ancestral class. For instance, if Person is a class, someone may think that Leg is a subclass of Person (because a Leg is in a class of things that are parts of a Person). No! Leg is never a subclass of Person because a Leg is not a Person. A subclass of Person must be composed of types of Persons. So, Patient is a subclass of Person, and Pathologist is a subclass of Person, because they are both examples of Persons and because there are instances of Patients and instances of Pathologists. Remember, a class is a construct whose chief job is to provide specified instances.
How about Friend? Is Friend a subclass of Person? Yes and no. Friend can be a subclass of person if you want to organize Persons based on whether they are Friends or not-Friends. However, if you think that being a friend i