Jules J. Berman and G. William Moore
Workshop
Implementing an RDF Schema for Pathology Images, from the Association for Pathology Informatics.
APIII, Pittsburgh, PA
September 10, 2007
Copyright 2007 Jules J. Berman
Distributed under GNU Free Documentation license
http://www.gnu.org/licenses/fdl.txt
This document is written for people who need to annotate their photomicrographs in a manner in a manner that binds descriptive data to the image, so that:
1. Collections of photomicrographs can be searched based on their descriptive content or by their image content or both.
2. Individual images can be sent to colleagues, and the person who receives the image can extract from the image file, descriptive text that the sender included with the image.
3. After inserting text inside an image, the person who prepared the image can be certain that years later, after all the clinical and pathologic details associated with the image have been long forgotten, the image will still provide this information.
4. The data included in the image can be prepared in a standard form that is computer parsable and understandable to software agents that search files on the Web.
This document does not contain software applications, nor does it recommend any software applications. Instead, we provide very short scripts (usually about 5 to 20 lines in length) written in Perl or Ruby, that perform the computational tasks described in the text. These scripts are so simple and so short that if you program in another language (e.g., Python, Java, or C), you should have no trouble converting the scripts to your preferred language.
This document is written with the assumption that if you want to achieve self-reliance in pathology informatics, you must learn acquire some minimal programming skills.
For those readers who want to know the best ways of stucturing the data they include with their image, this document provides in-depth discussion of RDF (Resource Description Framework). RDF is a simple technique for bundling all data as triples (the data object plus a piece of metadata that describes the data value plus the value assigned to the data object). This simple but powerful technique allows data triples to be shared among heterogeneous datasets and is the basis for the so-called "Semantic Web."
The document arranges techniques by level of difficulty (Levels 1 through 6). Level 1 describes the simplest method for conveying text in images (inserting a free-text description in the header of a JPEG file). Beginners can stop reading after Level 1 and resume reading at a later date, if so inclined.
Pathology images have no value unless they are annotated with information that describes the image.
Important descriptors of an image might include:
File information Image capture information Image format information Specimen information Patient information Pathology information Region of interest information
The API (Association for Pathology Informatics) wishes to provide anyone using pathology image data with optional methods for annotating any kind of pathology image, in any image format the user prefers.
The API was not interested in creating yet another new standard that obligates people to use a particular image format for their pathology images.
The API sponsored the Laboratory Digital Imaging Project (LDIP) to provide free and open methods for specifying image data that could be used with existing standard image formats (such as DICOM or JPEG).
From 2004-2007, the API sponsored LDIP, the Laboratory Digital Imaging Project, which consisted of API members and imaging software developers.
Conference calls were conducted from 2004 to 2006, and the minutes of the discussions are available:
In 2007, after much discussion, the API Council determined that there were, in existence, adequate methods for specifying images-related data. What was needed were general instructions for image annotation and a few simple software scripts that could parse, insert, extract, port, and interchange image annotations.
LDIP was dissolved, and the API Council accepted the primary goal of providing the field of pathology informatics with a document that describes available open annotation methods.
As a secondary goal, the API would provide a very short RDF Schema that would permit those who prefer RDF annotations to type their metadata under general classes and properties that have particular relevance to pathologists (more about this later).
This document is the current draft of instructions for image annotation.
Level 1. Simply composing a free-text description of your image and any other information you'd like to add, such as your name, and adding the information as a Comment field in the header of the image file. The Comment will not alter the binary content of the image or the visual form of the image.
When the file is copied, it will retain the header comment, and anyone receiving the image can read what you've added, using a simple Perl or Ruby script provided in the document, or using a simple extraction program prepared in any
The Dublin Core is basic information designed by librarians to provide a minimal set of data to describe the contents of an electronic document. When the file is copied, it will retain the Dublin Core metadata, and anyone receiving the image can read what you've added, using a simple Perl or Ruby program provided in the document, or using a simple extraction program prepared in any
Insert an RDF (Resource Description Framework) document into your image file.
The RDF document can be extracted, and the triples in the document can be extracted and integrated with other data.
Image file with RDF document inserted into image file header Almost all popular image formats contain "header" sections that are not part of the actual image binary. The header sections contain information that is used by image viewing software to properly display the image. Robust imaging software applications are written with subroutines that parse through the different headers of images and extract information such as the height, width, pixel number, pixel size, pixel color, color map index, and so on. Some headers are extensible, allowing software to insert blocks of text into the header without changing the image binary.
The RDF document and the image file (for example, jpeg) can be separate documents linked by URLs.
Break up your annotative data and your image binaries into multiple documents, that can be pointed from any of the files, and that can exclude or include RDF or image binary data, as desired.
The RDF data can be distributed into multiple documents, and each RDF document may point to more than one image file.
Perl and Ruby are free, open source software and can be downloaded from multiple web sites. Linux downloads for either language are ubiquitous. Perl is distributed with most Linux operating system packages.
For Windows users, if you use Perl, get a free installation from ActiveState at:
http://www.activestate.com/
For Ruby, go to:
http://rubyforge.org/frs/?group_id=167
Most of the scripts in this document, and many other medical-related Perl and Ruby scripts, are available in Jules Berman's previously published books:
Biomedical Informatics
Here is a short script that lets you look at any JPEG image. In this script, the image that's used is leaf.jpg.
#!/usr/local/bin/perl use Tk; use Tk::JPEG; my $mw = MainWindow->new(); my $file = "c\:\\ftp\\leaf\.jpg"; my $image = $mw->Photo(-file => $file); $mw->Label('-image' => $image, -height=>500, -width=>600)->pack; #$mw->Label(-image => $image)->pack(); MainLoop; exit;
In Ruby, you need to install ImageMagick, Tk and Rmagick if you want to view and modify images.
#!/usr/local/bin/ruby require 'RMagick' include Magick leaf = ImageList.new("leaf.jpg").resize!(0.7) leaf_copy = leaf.write("leaf.gif") require 'tk' root = TkRoot.new {title "view"} TkButton.new(root) do image TkPhotoImage.new{file "leaf.gif"} command {exit} pack end Tk.mainloop exit
#!/usr/local/bin/ruby
#leaf3.rb
#
#This Ruby script was created by Jules J. Berman on 7/8/2007
#and is provided as a public domain document
#
#The software is provided "as is", without warranty of any kind,
#express or implied, including but not limited to the warranties
#of merchantability, fitness for a particular purpose and
#noninfringement. in no event shall the authors or copyright
#holders be liable for any claim, damages or other liability,
#whether in an action of contract, tort or otherwise, arising
#from, out of or in connection with the software or the use or
#other dealings in the software.
#
require 'RMagick'
include Magick
orig_leaf = ImageList.new("leaf.jpg").resize!(0.4)
orig_leaf.write("orig.gif")
leaf = ImageList.new("leaf.jpg").first.crop(50, 310, 300, 300).resize!(0.4)
leaf.write("new.gif")
require 'tk'
root = TkRoot.new {title "view"}
TkButton.new(root) do
image TkPhotoImage.new{file "orig.gif"}
command {exit}
pack
end
TkButton.new(root) do
image TkPhotoImage.new{file "new.gif"}
command {exit}
pack
end
Tk.mainloop
exit
Download the external module Image::MetaData::JPEG from the Perl packet manager (if ActiveState Perl is installed on your system, simply enter ppm as your command line and follow the instructions on the packet manager client).
Perl script, meta_jpg.pl, to show how metadata can be added to a jpeg file.
#!/usr/local/bin/perl use Image::MetaData::JPEG; my $filename = "leaf.jpg"; #comment:your filename here my $file = new Image::MetaData::JPEG($filename); die 'Error: ' . Image::MetaData::JPEG::Error() unless $file; print "Description of JPEG file\n"; print $file->get_description(); print "\n\nRDF Annotations to JPEG file\n\n"; $line = "My this is a nice image of a leaf"; $file->add_comment($line); unlink $filename; $file->save($filename); my $file = new Image::MetaData::JPEG($filename); my @comments = $file->get_comments(); print join("",@comments); exit; The comment "My this is a nice image of a leaf" was added to the header of the JPEG file, ldip2103.jpg
Here is a Ruby script that inserts a Comment and a Label into the JPEG header.
#!/usr/local/bin/ruby require 'RMagick' include Magick walnut = ImageList.new("c\:\\ftp\\rb\\CT4192~1.JPG") walnut.cur_image[:Label] = "hello" walnut.cur_image[:Comment] = "<html><title>me</title></html>" walnut.properties{|name, value| print "#{name} #{value}\n"} walnut_copy = ImageList.new walnut_copy = walnut.cur_image.copy walnut_copy.write("c\:\\ftp\\rb\\out.JPG") walnut_copy.properties{|name, value| print "#{name} #{value}\n"} exit Output: Comment <html><title>me</title></html> JPEG-Colorspace 2 JPEG-Sampling-factors 2x2,1x1,1x1 Label hello Comment <html><title>me</title></html> JPEG-Colorspace 2 JPEG-Sampling-factors 2x2,1x1,1x1 Label hello
Here is the sample text for a text file.
"The image is a squamous cell carcinoma of the floor of the mouth. It was taken by Jules Berman, on February 2, 2002. The microscope was an Olympus model 3453. The lens objective was 40x. The camera was a Sony model 342. The image is jpeg and has dimensions of 524 by 429 pixels. The microscope and camera were not calibrated. The specimen Baltimore Hospital Center S-3456-2001, specimen 2, block 3. The specimen was logged in 8/15/01 and processed using the standard protocol for H&E that was in place for that day. The patient is Sam Someone, medical identifier 4357. The tissue was received in formalin. The specimen shows a moderately differentiated, invasive squamous cell carcinoma. The patient has a 30 year history of oral tobacco use. The image is kept in a jpeg file named y49w3p2.jpg and kept in the pathology subdirectory of the hospital's server. Its URL is https://baltohosp.org/pathology/y49w3p2.jpg. The image file has an md_5 hash value of 84027730gjsj350489. The image has no watermark. Copyright is held by Baltimore Hospital Center, and all rights are reserved."
You can put this text into a file, named "addtext.txt" and use the jpeg_add.rb Ruby script to do the insertion.
jpeg_add.rb, inserts plain-text file gwmbw.txt into a JPEG image #!/usr/local/bin/ruby require 'RMagick' include Magick text = IO.read("addtext.txt") orig_image = ImageList.new("gwmbw.jpg") orig_image.cur_image[:Comment] = text print "\nComment added, let's make a file to hold the modifications\n\n" copy_image = ImageList.new copy_image = orig_image.cur_image.copy copy_image.write("c\:\\ftp\\rb\\gwmout.JPG") copy_image.properties{|name, value| print "#{name}\n#{value}\n"} exit
There are a few problems with simply writing free-text descriptions of your images. Although you might think you've written an adequate description of your image, the likelihood is that you have forgotten to include important information about the file.
The Dublin Core consists of about 15 data elements selected by a group of librarians, that specifies the kind of file information a librarian might use to describe a file, index the described file, and retrieve files based on included information.
There are many publicly available documents that describe the Dublin Core elements:
http://www.ietf.org/rfc/rfc2731.txtThe Dublin Core elements can be inserted into HTML documents, simple XML documents, or RDF documents. A public document explains exactly how the Dublin Core elements can be used in these file formats:
http://dublincore.org/documents/usageguide/#rdfxmlAn example of a very simple Dublin Core file description in RDF format is shown below:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://www.julesberman.info/rubydisp.htm">
<dc:creator>Jules Berman</dc:creator>
<dc:title>Ruby Programming for Medicine and Biology, Jules J. Berman</dc:title>
<dc:description>Provides instructions for displaying an image in Ruby</dc:description>
<dc:date>2007-07-01</dc:date>
</rdf:Description>
</rdf:RDF>
As you can see, RDF is just a dialect of XML, and it is quite easy to read RDF files without any special instruction. In a later section, we will be explaining the logic and syntax of RDF. For now, all you need to know is:
The basic features of any file can be described using the Dublin Core common data elements. These data elements can be represented in HTML, XML or RDF.
Because HTML, XML and RDF are text files, they can be inserted into image files techniques described for text strings or text files (Level 1).
It is easy to insert an RDF document into the header of a jpeg image file, and it is just as easy to extract the RDF triples. Here's how you do it:
1. Prepare your RDF document.
2. Use a script to insert the document into your jpeg header.
3. Use the new jpeg file (now with RDF comments) to display the
image or to send to colleagues. When displayed, it will look
exactly like the file before the contents of the RDF document
were added.
4. Use another script to extract the comments from the header of
the jpeg file, as needed.
The following Perl script will take the jpeg image ldip2103.jpg and add the RDF document from the first use-case to its header. Once created, the program extracts and displays the contents of the RDF file.
#!/usr/local/bin/perl use Image::MetaData::JPEG; my $filename = "ldip2103.jpg"; #comment:your filename here my $file = new Image::MetaData::JPEG($filename); die 'Error: ' . Image::MetaData::JPEG::Error() unless $file; print "Description of JPEG file\n"; print $file->get_description(); print "\n\nRDF Annotations to JPEG file\n\n"; open (TEXT, "rdf_desc.xml")||die"cannot"; #the rdf document you'll add $line = " "; while ($line ne "") { $line = <TEXT>; $file->add_comment($line); } unlink $filename; $file->save($filename); my $file = new Image::MetaData::JPEG($filename); my @comments = $file->get_comments(); print join("",@comments); exit;
This Perl script requires the freely available open source module, Image::MetaData::JPEG. You can download this module from CPAN (Comprehensive Perl Archive Network, www.cpan.org).
The last few lines extracts and prints the RDF file from the image.
This Perl script is functionally equivalent to the Ruby script used in Level 2 to insert a Dublin Core RDF file into a jpeg image.
What exactly is an RDF file?
The Resource Description Framework (RDF) provides a simple method for specifying information as data triples. The authors believe that much of the time and expense associated with developing and deploying data standards can be eliminated by a consistent implementation of recommended RDF data specification practices.
Necessary background subjects:
1. Meaning in informatics 2. Triples 3. Identifiers 4. Datatyping 5. Classes and Properties 6. Instantiating Classes
Necessary informatics techniques:
1. RDF syntax (specifying data as class instance-property-data triples) 2. RDF schema (formal dictionary for classes and properties) 3. XSD (to constrain data to a defined datatype)
The only implementation tools you really need are your head and a text editor such as notepad or emacs.
In informatics, assertions have meaning whenever a pair of metadata and data (the descriptor for the data and the data itself) is assigned to a specific subject.
Triples consist of: Specified subject, then metadata, then data.
Some triples found in a medical dataset
"Jules Berman" "blood glucose level" "85" "Mary Smith" "blood glucose level" "90" "Samuel Rice" "blood glucose level" "200" "Jules Berman" "eye color" "brown" "Mary Smith" "eye color" "blue" "Samuel Rice" "eye color" "green"
Some triples found in a haberdasher's dataset
"Juan Valdez" "hat size" "8" "Jules Berman" "hat size" "9" "Homer Simpson" "hat size" "9" "Homer Simpson" "hat_type" "bowler" Triples collected from both datasets whose subject is "Jules Berman"
"Jules Berman" "blood glucose level" "85"
"Jules Berman" "eye color" "brown"
"Jules Berman" "hat size" "9"
Triples can port their meaning between different databases because they
bind described data to a specified subject. This supports data integration
of heterogeneous data and facilitates the design of software agents. A
software agent, as used here, is a program that can interrogate multiple
RDF documents on the web, initiating its own actions based on inferences
yielded from retrieved triples.
RDF (Resource Description Framework) is a syntax for writing computer-parsable
triples. For RDF to serve as a general method for describing data objects,
we need to answer the following four questions:.
1. How does the triple convey the unique identity of its subject?
In the triple, "Jules Berman" "blood glucose level" "85", The
name "Jules Berman" is not unique and may apply to several different
people.
2. How do we convey the meaning of metadata terms? Perhaps one person's
definition of a metadata term is different from another person's.
For example, is "hat size" the diameter of the hat, or the distance
from ear to ear on the person who is intended to wear the hat, or a
digit selected from a pre-defined scale?
3. How can we constrain the values described by metadata to a specific
datatype? Can a person have an eye color of 8? Can a person have
an eye color of "chartreuse"?
4. How can we indicate that a unique object is a member of a class and
can be described by metadata shared by all the members of a class?
Much of the remainder of the background section will be devoted to answering these four questions.
RDF is a specialized XML syntax for creating computer-parsable files consisting of triples. The subject of the RDF triple is invoked with the rdf:about attribute. Following the subject is a metadata/data pair.
Let us create an RDF triple whose subject is the jpeg image file specified as: http://www.the_url_here.org/ldip/ldip2103.jpg. The metadata is <dc:title> and the data value is "Normal Lung".
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:title>Normal Lung</dc:title>
</rdf:Description>
An example of three triples is proper RDF syntax is:
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:title>Normal Lung</dc:title>
</rdf:Description>
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:creator>Bill Moore</dc:creator>
</rdf:Description>
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:date>2006-06-28</dc:date>
</rdf:Description>
RDF permits you to collapse multiple triples that apply to a single
subject. The following RDF:Description statement is equivalent to the
three prior triples:
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-28</dc:date>
</rdf:Description>
An example of a short but well-formed RDF image specification document is:
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-28</dc:date>
</rdf:Description>
</rdf:RDF>
The first line tells you that the document is XML. The second line tells you that the XML document is an RDF resource. The third and fourth lines are the namespace documents that are referenced within the document (more about this later). Following that is the RDF statement that we have already seen.
Though we distinguish text files from binary files, all files are actually binary files. Sequential bytes of 8 bits are converted to ascii equivalents, and if the ascii equivalents are alphanumerics, we call the file a text file. If the ascii values of 8-bit sequential file chunks are non-alphanumeric, we call the files binary files.
Standard format image files are always binary files. Because RDF syntax is a pure ascii file format, image binaries cannot be directly pasted into an RDF document. However, binary files can be interconverted to an from ascii format, using a simple software utility. This simple Perl script, using the MIME::Base64::Perl module is all that is necessary to interconvert binary files to Base64.
#!/usr/bin/perl use MIME::Base64::Perl; open (TEXT,"c\:\\ftp\\ldip\\ldip2103\.jpg")||die"cannot"; #path to sample file binmode TEXT; $/ = undef; $string = <TEXT>; close TEXT; $encoded = encode_base64($string); open(OUT,">2103.txt"); print OUT $encoded; close OUT; #$decoded = decode_base64($encoded); #open(OUT,">binary.jpg"); #binmode OUT; #print OUT $decoded; exit;
Here is an example of the same RDF document shown in the prior use-case. The only difference is that in addition to pointing to the URL that identifies the image, this document contains the image file converted to base64 ascii.
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<rdf:type
rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-21</dc:date>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
<ldip:imageType>photomicrograph</ldip:imageType>
<ldip:stain>H and E</ldip:stain>
<ldip:tissue>lung</ldip:tissue>
<ldip:organism>human</ldip:organism>
<ldip:objective>10x</ldip:objective>
<ldip:diagnosis>normal</ldip:diagnosis>
<ldip:base64File>
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8
UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDB
gNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyM
jIyMjIyMjL/wAARCAYACAADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAEC
.
.
.
D1phhQnOMVSxL6nRHEaWZz8tpvGBWdPpMpbO08967AW8YOcU4xIRgqKr6yl0N449x0S
0OFOkv6Yq1a6W6HJHFdb9miznbStAhB4AzVfWl0KqY3nsjjNSuEtosE9BXC6lrf7whT
39a7jxLp0xjbah6V5leaZchySh6104PC+196R1zrU6cFZo/9k=
</ldip:base64File>
</rdf:Description>
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
<rdf:type
rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Microscope</ldip:instrumentType>
<ldip:make>Olympus</ldip:make>
<ldip:model>BH2</ldip:model>
<ldip:serialNumber>224085</ldip:serialNumber>
</rdf:Description>
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
<rdf:type
resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Camera</ldip:instrumentType>
<ldip:make>Infinity</ldip:make>
<ldip:model>3</ldip:model>
<ldip:serialNumber>00169344</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
Ruby script, base64.rb, encodes strings in Base64 notation
#!/usr/local/bin/ruby require 'base64' text = "The secret of life" encoded = Base64.encode64(text) puts("This is the encoded text ... #{encoded}") decoded = Base64.decode64(encoded) puts("This is the decoded text ... #{decoded}") exit
Output of Ruby script base64.rb
C:\ftp\rb>ruby base64.rb
This is the encoded text ... VGhlIHNlY3JldCBvZiBsaWZl
This is the decoded text ... The secret of life
An entire file can be converted to Base64 using class File's read method.
#!/usr/local/bin/ruby
require 'base64'
image_file = File.open("walnut.jpg").binmode
image_file_string = image_file.read
b64 = Base64.encode64(image_file_string)
puts b64.slice(0,300)
regular = Base64.decode64(b64)
out_file = File.open("walnew.jpg", "w").binmode
out_file.write(regular)
exit
Base64 enlarges the file sizes of images. If you want to include Base64 versions of large image binaries, then expect to share very large documents.
You can prepare an RDF document describing your image, and then simply link to the image from your RDF document, using a pointer.
The following example provides pathologists with the only method that most will ever use.
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<rdf:type
rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-21</dc:date>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
<ldip:imageType>photomicrograph</ldip:imageType>
<ldip:stain>H and E</ldip:stain>
<ldip:tissue>lung</ldip:tissue>
<ldip:organism>human</ldip:organism>
<ldip:objective>10x</ldip:objective>
<ldip:diagnosis>normal</ldip:diagnosis>
</rdf:Description>
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
<rdf:type
rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Microscope</ldip:instrumentType>
<ldip:make>Olympus</ldip:make>
<ldip:model>BH2</ldip:model>
<ldip:serialNumber>224085</ldip:serialNumber>
</rdf:Description>
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
<rdf:type
resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Camera</ldip:instrumentType>
<ldip:make>Infinity</ldip:make>
<ldip:model>3</ldip:model>
<ldip:serialNumber>00169344</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
There are times when the the structural data (i.e., the non-binary data) for a data object's specification must be distributed in multiple files.
This is one of the most important reasons for using a data specification, rather than a data standard. The specification permits you to create a dynamic object, composed of informational pieces that can be updated, so that the content and value of a specified image object increases over time. A data standard obligates you to compose a static data file. If the standard data file contains information that cannot be shared (due to human subject risks, or to intellectual property encumbrances), the standard file usually cannot be distributed. A specification may consist of multiple files connected by URL pointers. If component files contain privileged information, the data object's specification can be distributed with access restricted to specified files.
RDF file 1 (Describes an image)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<rdf:type
rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-21</dc:date>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file2">
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file3">
<ldip:imageType>photomicrograph</ldip:imageType>
<ldip:stain>H and E</ldip:stain>
<ldip:tissue>lung</ldip:tissue>
<ldip:organism>human</ldip:organism>
<ldip:objective>10x</ldip:objective>
<ldip:diagnosis>normal</ldip:diagnosis>
</rdf:Description>
</rdf:RDF>
RDF File 2 (Describes a microscope referenced by File 1)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
<rdf:type
rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Microscope</ldip:instrumentType>
<ldip:make>Olympus</ldip:make>
<ldip:model>BH2</ldip:model>
<ldip:serialNumber>224085</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
RDF File 3 (Describes a camera referenced by File 1)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
<rdf:type
resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Camera</ldip:instrumentType>
<ldip:make>Infinity</ldip:make>
<ldip:model>3</ldip:model>
<ldip:serialNumber>00169344</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
Suppose, as in the previous example, that the triples relevant to your image lie in multiple RDF files. Suppose, further, that your image is just one of a set of images that were all obtained during the same session, and that all the images apply to the same patient. This situation is routine for radiologic images, wherein dozens of images transecting the brain, or the abdomen, may form part of the same report.
How might you annotate this complex set of data files and image binaries? Simply include an RDF assertion for each image.
RDF File 1 (Describes 2 images)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<rdf:type
rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-21</dc:date>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file2">
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file3">
<ldip:imageType>photomicrograph</ldip:imageType>
<ldip:stain>H and E</ldip:stain>
<ldip:tissue>lung</ldip:tissue>
<ldip:organism>human</ldip:organism>
<ldip:objective>10x</ldip:objective>
<ldip:diagnosis>normal</ldip:diagnosis>
</rdf:Description>
<rdf:Description
rdf:about="http://www.the_url_here.org/ldip/ldip2201.jpg">
<rdf:type
rdf:resource= "http://www.the_url_here.org/ldip_sch#Image"/>
<dc:title>Normal Lung</dc:title>
<dc:creator>Bill Moore</dc:creator>
<dc:date>2006-06-21</dc:date>
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Olympus_BH2_224085"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file2">
<ldip:instrument_id
rdf:resource="urn:www.the_url_here.org:ldip:Infinity_3_00169344"/>
<ldip:linkedFile rdf:resource="http://www.the_url_here.org/file3">
<ldip:imageType>photomicrograph</ldip:imageType>
<ldip:stain>H and E</ldip:stain>
<ldip:tissue>lung</ldip:tissue>
<ldip:organism>human</ldip:organism>
<ldip:objective>2.5x</ldip:objective>
<ldip:diagnosis>squamous cell carcinoma</ldip:diagnosis>
</rdf:Description>
</rdf:RDF>
RDF File 2 (Describes a microscope referenced by File 1)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Olympus_BH2_224085">
<rdf:type
rdf:resource="http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Microscope</ldip:instrumentType>
<ldip:make>Olympus</ldip:make>
<ldip:model>BH2</ldip:model>
<ldip:serialNumber>224085</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
RDF File 3(Describes a camera referenced by File 1)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ldip="http://www.the_url_here.org/ldip_sch#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="urn:www.the_url_here.org:ldip:Infinity_3_00169344">
<rdf:type
resource= "http://www.the_url_here.org/ldip_sch#Instrument"/>
<ldip:instrumentType>Camera</ldip:instrumentType>
<ldip:make>Infinity</ldip:make>
<ldip:model>3</ldip:model>
<ldip:serialNumber>00169344</ldip:serialNumber>
</rdf:Description>
</rdf:RDF>
The same approach can be used to reference multiple images within a single image file. The rdf:about attribute can point to any file block or file element that contains the part of the image that is the intended subject of the triples (e.g., region of interest, thumbnail, tile, waveform, color map, and so on).
In the field of biomedicine, DICOM (Digital Imaging and Communications in Medicine) has special significance, because DICOM is the format currently used for almost all radiologic images. DICOM was developed over several decades, to become a multifunctional standard of enormous complexity that uses a model for data storage that is unlike any other image file format. The DICOM standard includes a set of protocols for transferring information through networks, and for communicating between different radiologic devices or different parts of a single device (e.g., between CT machine and CT workstation). It creates a unique syntax and semantics for information and produces a file that contains a large amount of descriptive information (including patient information and diagnostic information), and a binary representation of one or more images.
One of the best descriptions of the DICOM file format is available at:
http://www.dclunie.com/medical-image-faq/html/part1.html http://www.dclunie.com/medical-image-faq/html/part2.html
DICOM Working Group 26, led by Dr. Bruce Beckwith, is attempting to expand DICOM to include metadata appropriate for pathology images. To the best of our knowledge, there are now a handful of pathology departments that format pathology photomicrographs as DICOM images. The VA (U.S. Veterans Administration) seems to have adopted DICOM as their standard format for all medical images. To the best of our knowledge, nobody using DICOM is currently inserting a complete set of pathology descriptors into their DICOM headers, but this may change over the course of time.
For the purposes of this document, all we need to know is that the header information in a DICOM file can be extracted (with a short Ruby script), and that the binary portion of a DICOM file can be converted to a JPEG file. The header data from the DICOM file can be re-inserted into the header of a JPEG file, or it can be included in a special XML file that "points" back to the original DICOM file or to the JPEG file that contains the image representation.
The nice thing about electronic standards that set them apart from standards created for physical objects, is that they are interconvertible.
Though there are hundreds of standard image formats, robust image software can do a pretty good job at converting any format into any other format.
We like to work with JPEG images, because they are the most popular web images. Our philosophy is that if you like to work in DICOM, you should work in DICOM. If you like to work in JPEG, you should work in JPEG. There are simple ways of interconverting the two formats.
You can find many DICOM images at:
ftp://ftp.erl.wustl.edu/pub/dicom/images/version3/RSNA95/
These images can be used as practice files for some of the scripts that will follow.
DICOM has a header that can be extracted from the DICOM image file, and which contains textual descriptive information about the image.
Here is a sample DICOM image header.
0002,0000,File Meta Elements Group Len=122
0002,0001,File Meta Info Version=1
0002,0002,Media Storage SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
0002,0003,Media Storage SOP Inst UID=9999.20070123103417.100.10
0002,0010,Transfer Syntax UID=1.2.840.10008.1.2.1.
0002,0012,Implementation Class UID=960051513
0008,0008,Image Type=
0008,0012,Instance Creation Date=20070123
0008,0013,Instance Creation Time=103417
0008,0016,SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
0008,0018,SOP Instance UID=9999.20070123103417.100.10
0008,0020,Study Date=20070123
0008,0030,Study Time=103417
0008,0050,Accession Number=
0008,0060,Modality=OT
0008,0064,Conversion Type=WSD.
0008,0090,Referring Physician's Name=
0010,0010,Patient's Name=gwmbw.jpg.
0010,0020,Patient ID=0.
0010,0030,Patient Date of Birth=
0010,0040,Patient Sex=M
0010,1010,Patient Age=0.
0020,000D,Study Instance UID=9999.20070123103417.100.20
0020,000E,Series Instance UID=9999.20070123103417.100.30
0020,0010,Study ID= 0
0020,0011,Series Number=0
0020,0013,Image Number=0
0020,0020,Patient Orientation=
0028,0002,Samples Per Pixel=1
0028,0004,Photometric Interpretation=MONOCHROME2
0028,0010,Rows=1536
0028,0011,Columns=2048
0028,0100,Bits Allocated=8
0028,0101,Bits Stored=8
0028,0102,High Bit=7
0028,0103,Pixel Representation=0
7FE0,0010,Pixel Data=3145728
ezDICOM (Copyright 2002, Wolfgang Krug and Chris Rorden) is a medical viewer for DICOM images. It is distributed along with dcm2jpg, a command-line application that can convert DICOM images into standard bitmap file formats (JPEG, PNG, and BMP). In addition, it will convert a DICOM image to its textual header information.
The sample command-line is:
dcm2jpg -f p -o C:\TEMP -z 1.5 C:\DICOM\input1.dcm C:\input2.dcm
This command-line may contain information for brightness, contrast, format of output, output target directory, input files, etc.
If you simply invoke:
dcm2jpg <dicom filename and path if file is not in current directory>
The dicom file will be converted to a .jpg file in the directory that holds the dicom file.
The .exe file is:
dcm2jpg.exe 218,112 bytes - converts to jpeg by default
If you wish, you can simply rename the .exe file to change the default conversion behavior.
dcm2bmp.exe 218,112 bytes - converts to bmp by default dcm2png.exe 218,112 bytes - converts to png by default dcm2txt.exe 218,112 bytes - converts to text header by default
Any JPEG file can be converted to a DICOM file, with jpeg2dcm. This free software by CharruaSoft software can be downloaded from:
http://www.charruasoft.com/downen.htm
It is a simple exe file (jpeg2dcm.exe 511,488 bytes), that can operate from a command-line:
It will accept an input file, such as gems.jpg, and convert it to a dicom file,
gems.jpg 132,135 bytes gems.dcm 1,980,712 bytes
We will use:
dcm2jpg.exe 218,112 bytes - converts to jpeg by default
dcm2txt.exe 218,112 bytes - converts to text header by default
Ruby can simply call a command-line application from within a script, using the exec method.
Ruby script, dcm2jpg.rb, converts a DICOM file into a JPEG file:
#!/usr/local/bin/ruby
exec("dcm2jpg.exe c\:\\ftp\\picker\\dicom\\CT4174\~1")
exit
Screen output of dcm2jpg.rb script
c:\ftp>ruby dcm2jpg.rb 1 Creating: c:\ftp\picker\dicom\CT4174~1.jpg 1
Ruby script, dcmsplit.rb, converts a DICOM file into a JPEG, and a text file for the header
#!/usr/local/bin/ruby
system("dcm2jpg.exe c\:\\ftp\\picker\\dicom\\CT4174\~1")
system("dcm2txt.exe c\:\\ftp\\picker\\dicom\\CT4174\~1")
exit
Screen output of dcm2jpg.rb script
c:\ftp>ruby dcm2jpg.rb 1 Creating: c:\ftp\picker\dicom\CT4174~1.jpg 1 1 Creating: c:\ftp\picker\dicom\CT4174~1.txt 1
The just-created JPEG file lacks the clinical jpeg information contained in the DICOM file, but this information is now available to us in our newly created text file. In the next section, we will see how any textual information can be inserted back into a JPEG file, using RMagick.
Ruby script, jpeg_add.rb, inserts textual information into a JPEG image:
#!/usr/local/bin/ruby
require 'RMagick'
include Magick
text = IO.read("gwmbw.txt")
orig_image = ImageList.new("gwmbw.jpg")
orig_image.cur_image[:Comment] = text
print "\nComment added, let's make a file to hold the
modifications\n\n"
copy_image = ImageList.new
copy_image = orig_image.cur_image.copy
copy_image.write("c\:\\ftp\\rb\\gwmout.JPG")
copy_image.properties{|name, value| print "#{name}\n#{value}\n"}
exit
Output of jpeg_add.rb script:
C:\ftp\rb>ruby jpeg_add.rb
Comment added, let's make a file to hold the modifications
Comment
0002,0000,File Meta Elements Group Len=122
0002,0001,File Meta Info Version=1
0002,0002,Media Storage SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
0002,0003,Media Storage SOP Inst UID=9999.20070123103417.100.10
0002,0010,Transfer Syntax UID=1.2.840.10008.1.2.1.
0002,0012,Implementation Class UID=960051513
0008,0008,Image Type=
0008,0012,Instance Creation Date=20070123
0008,0013,Instance Creation Time=103417
0008,0016,SOP Class UID=1.2.840.10008.5.1.4.1.1.7.
0008,0018,SOP Instance UID=9999.20070123103417.100.10
0008,0020,Study Date=20070123
0008,0030,Study Time=103417
0008,0050,Accession Number=
0008,0060,Modality=OT
0008,0064,Conversion Type=WSD.
0008,0090,Referring Physician's Name=
0010,0010,Patient's Name=gwmbw.jpg.
0010,0020,Patient ID=0.
0010,0030,Patient Date of Birth=
0010,0040,Patient Sex=M
0010,1010,Patient Age=0.
0020,000D,Study Instance UID=9999.20070123103417.100.20
0020,000E,Series Instance UID=9999.20070123103417.100.30
0020,0010,Study ID= 0
0020,0011,Series Number=0
0020,0013,Image Number=0
0020,0020,Patient Orientation=
0028,0002,Samples Per Pixel=1
0028,0004,Photometric Interpretation=MONOCHROME2
0028,0010,Rows=1536
0028,0011,Columns=2048
0028,0100,Bits Allocated=8
0028,0101,Bits Stored=8
0028,0102,High Bit=7
0028,0103,Pixel Representation=0
7FE0,0010,Pixel Data=3145728
JPEG-Colorspace
1
JPEG-Sampling-factors
1x1
The output JPEG image file (gwmout.jpg) contains the binary representation of the same image in gwmbw.dcm and in gwmbw.jpg. In addition, it contains a Comment field consisting of the contents of file gwmbw.txt, the textual representation of the original DICOM header. We can extract the Comment field from the JPEG header whenever we wish, using RMagick's properties method.
DICOM is a complex and highly specialized standard developed over several decades. It is intended to negotiate a variety of network transactions in addition to encapsulating an image binary. DICOM was created at a time prior to the development of XML and prior to the development of the web transfer protocol, http. It depends of a variety of data control devices that some may find anachronistic (e.g., prescribed byte locations, DICOM-specific data transfer and negotiation protocols). Mastering the technical aspects of DICOM is may require months analyzing the many volumes of DICOM technical reports. With few exceptions, DICOM is something that serves radiologists through proprietary software written for imaging devices that may cost millions of dollars in outlay plus maintenance.
Though DICOM has proven itself to be an excellent standard for radiology images, it is not currently a popular standard for pathology images.
Pathologists should have the option of using DICOM, or using some other image format. Regardless of the image format chosen, it is important to annotate images with textual data in a computer parsable form, that can be ported between alternate image formats.
Pathologists typically use off-the-shelf cameras and software, and the raw pixel data is usually provided in a popular image format, such as jpeg, png, bmp, or tiff. By specifying images in RDF, pathologists have a simple, inexpensive and easily implemented method for annotating images with useful descriptive data, and for binding these annotations to the images.
Because all the data in a specification is fully described, it is very easy to write software that will port a specification into a data standard. To have full compatibility between a data specification and data standard, the specification must contain all of the required data elements of the data standard. This usually involves:
1. Studying the data standard, and writing an RDF Schema (or supplementing
an existing RDF Schema) with classes and properties appropriate for
the data standard.
2. RDF specifications, unlike data standards, do not place requirements
on the classes and properties that need to be included in the document.
To create an RDF specification that can be ported to a data standard,
the RDF document must be prepared with knowledge of the classes and
properties that are required by the data standard. This information
can be provided as an external document, or as triples added to a
schema, or as a feature of the [textual] CDE document. At present,
no single method has emerged as the preferred strategy.
3. Writing software that will parse the RDF document in which the data
object is specified (trivial), and transform the triples into the
document that conforms to the structure and content of the data
standard.
The term, "common data element" is a misnomer. Most people, when they first encounter this term, assume that a data element holds data. Actually, a common data element is the metadata that describes a datum in a data record. In XML parlance, a CDE is an XML tag. The thing that makes a descriptor "common" is its common usage by a scientific community. The way that CDEs are intended to work is that a scientific community creates a list of CDEs that describe the kinds of data that their members use. The members of the community will all use the same CDEs (XML tags) to annotate their data files.
One of the most calamitous errors in any CDE project is to assume that everyone who reads a metadata tag will automatically understand its intended meaning. ISO-11179 is a standard way of defining CDEs with the necessary information for understanding their meanings.
The most popular CDEs in existence are the Dublin Core CDEs. These are a set of file descriptors that were prepared by a committee of librarians, who convened in Dublin, Ohio. The Dublin Core includes basic information about electronic documents, such as: the title of the document, the name of the person who created the file, the date that the file was created, the date that the file was modified, and a short description of the file. These are basically the items that a library software agent would need to retrieve, if it were building an index of internet documents. The world of informatics would be a better place if everyone who created an HTML, XML or RDF file would remember to include the Dublin Core CDEs.
These Dublin Core CDEs have been prepared to comply with the ISO-11179 specification. Every effort to create a data specification for a knowledge domain should begin with by collecting the common data elements for the domain, and annotating each element with the ISO-11179 CDE descriptors. The ISO-11179 descriptors for two of the Dublin Core CDEs (Title and Creator) are shown below.
From: http://dublincore.org/documents/1999/07/02/dces/
Title
Identifier: Title
Version: 1.1
Registration Authority: Dublin Core Metadata Initiative
Language: en
Obligation: Optional
Datatype: Character String
Maximum Occurrence: Unlimited
Definition: A name given to the resource.
Comment: Typically, a Title will be a name by which the resource is
formally known.
Creator
Identifier: Creator
Version: 1.1
Registration Authority: Dublin Core Metadata Initiative
Language: en
Obligation: Optional
Datatype: Character String
Maximum Occurrence: Unlimited
Definition: An entity primarily responsible for making the content
of the resource.
Comment: Examples of a Creator include a person, an organisation,
or a service. Typically, the name of a Creator should be used to
indicate the entity.
An RDF schema is a dictionary file that lists the classes and the properties that pertain to RDF documents. In fact, the official long name for RDF Schema is the RDF Vocabulary Description Language. The classes of an RDF schema are formal definitions for the kinds of subjects that are found in the RDF triples. The properties of an RDF schema are the types of metadata descriptors for the data of the RDF triples. Elements in RDF schemas may be subclasses of elements in other RDF schemas.
Things to remember about RDF Schemas
1. RDF Schemas are written in XML, but are completely unlike XML Schemas.
2. RDF Schemas contain declarations of the classes and properties
that are used in RDF documents.
3. RDF Schemas, like all RDF documents, have no pre-determined order or
composition, and consist of statements expressed as triples. The
subject of every triple in an RDF Schema will be either Class or
Property.
4. Every RDF Schema can be thought of as a child of the W3C RDF Schema
that defines the "super" classes Resource, Class and Property. All
RDF Schemas will refer to the document that defines RDF syntax and
to the document that defines the top-level schema, and therefore
will begin something like this:
<?xml version='1.0' encoding='ISO-8859-1'?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
5. A typical RDF document consists of triples <subject, metadata, value>.
RDF documents usually reference one or more RDF Schemas to instantiate
the subject of each triple (i.e., to tell us which class in an RDF
schema the subject is an instance of) and to provide subjects with
class-appropriate metadata.
6. Documents composed of triples whose components are defined by RDF
Schemas can be used to completely specify data objects within a
knowledge domain.
7. By completely specifying data objects in a knowledge domain, RDF
specifications achieve the functionality of data standards.
In a later section, we will demonstrate the simple method for instantiating classes and for associating class instances with appropriate properties and with datatyped values.
Here is a CDE template that satisfies the minimal recommendations for a CDE under ISO-11179, and that provides all the information for a class or a property in an RDF schema.
The general format for class elements is:
Class Label (in standard XML tag format, uppercase first letter): Registration Authority: Association for Pathology Informatics Obligation: optional Maximum Occurrence: Unlimited Comment(must include detailed definition): subClassOf: Contributor (your consistent first-name last-name): Date of your contribution:
The general format for property elements is:
Property Label (in standard XML tag format, lowercase first letter): Registration Authority: Association for Pathology Informatics Obligation: optional Maximum Occurrence: Unlimited Datatype (can be "Literal", a list, or a regex; default is "Literal"): Comment(must include detailed definition): Domain (comma-delimited if multiple): Range (usually "Literal"): Contributor (your consistent first-name last-name): Date of your contribution:
The category "Obligation" should contain the word "required" or the word "optional". For the kinds of specifications discussed in this manuscript, including any CDE would always be optional. Similarly, for "Maximum Occurrence", we would think any CDE could occur an unlimited number of times in an RDF document.
Here is an example of an ISO-11179-compliant CDE written for a class named "Reagent".
Class Label:Reagent versionInfo (required): 0.1 Registration Authority: Association for Pathology Informatics Obligation:optional Maximum Occurrence: Unlimited Datatype: Literal comment: Histologic_stain_reagents, tissue_fixation_reagents, and other chemicals employed in the laboratory. For example: distilled_water, ethanol, hematoxylin, aluminum_sulphate subClassOf:Class Contributor:Bill Moore Date_of_contribution:05-30-2006
Once we have the CDE, it is a straightforward job to create an RDF Schema Class element:
<rdf:Class rdf:about="http://www.the_url_here.org/ldip_sch#Reagent">
<rdfs:label>Reagent</rdfs:label>
<rdfs:comment>
Histologic_stain_reagents, tissue_fixation_reagents, and other chemicals employed in the laboratory. For example: distilled_water, ethanol, hematoxylin, aluminum_sulphate
</rdfs:comment>
<rdfs:subClassOf rdf:resource="xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdf:Class>
A few points bear explanation. All the information needed to generate an RDF class or property should be contained in an ISO-11179-compliant list of CDEs. An RDF Schema may consist purely of classes and properties. Classes are defined in RDF exclusively through their ancestral relation. Basically, to build a class in RDF Schema, you announce that the element is a Class, you provide a unique locator (such as a URL) or a unique universally understood descriptor (more on this later) for the element, a description of the element, and the name of the father class of the element.
That's all there is to do for classes. You don't need to list the subclasses of the class because the subclasses will list the class as their father in their own schema entry. You don't need to list the properties of the class because the properties will list the classes whose data they describe.
Do classes in RDF schema remind you of anything? The classes in an RDF schema comprise an ontology. An ontology is a list of classes and their relationships. You can think of an ontology as the "classy" half of an RDF schema. Classes become most useful when they have Properties (the other half of the RDF schema)..
A property is a metadata element that is used to describe the data assigned to one or more class objects. Here is the CDE for a property named "dateTime".
Identifier:ldip:dateTime Property Label:dateTime versionInfo: 0.1 Registration Authority: Association for Pathology Informatics Language:en Obligation:optional Maximum Occurrence: Unlimited Datatype: /[\+\-]{1}[\d]{8}\.[\d]{6}Z[\+\-]{1}[\d]{4}/ comment: ISO 8601 format of data and time. domain:Event range: http://www.the_url_here.org/ldip_xsd.xsd#iso8601 Contributor:Bill Moore Date_of_contribution:05-30-2006
An RDF Schema declaration for the dateTime property might be:
<rdf:Property rdf:about="http://www.the_url_here.org/ldip_sch#dateTime">
<rdfs:label>dateTime</label>
<rdfs:comment>
The date and time at which an event occurs, in ISO8601 format
</rdfs:comment>
<rdfs:domain rdf:resource="http://www.the_url_here.org/ldip_sch#Event"/>
<rdfs:range rdf:resource="http://www.the_url_here.org/ldip_xsd.xsd#iso8601"/>
</rdf:Property>
Let's look at the dateTime property. The first line announces that we will be declaring a Property. The second line tells us the name of the Property (dateTime), and it's URL (the current RDF schema document). The third line provides the label by which we will refer to the Property. This might come in handy if we had different names for the property in different languages. The comment includes a definition for the element.
The next line specifies the domain (class) for the property. The domain of a property is the class for which the property may be used. In this case, the domain class for which the dateTime property applies is Event. This makes sense. If you need to describe an event, you would want to include the time that the event occurred.
A property for a class may serve as a property for all of the subclasses of the class (because all the subclass instances are members of the ancestor class). Every Property must have a domain (a class or classes for which the Property may be used) and a Range (a specified kind of data that is described by the Property). A property may have multiple classes in its domain. When a property has multiple classes in its domain, all the classes in the domain share the same property (obviously). This achieves some of the functionality of multi-class inheritance, without actually needing to instantiate multiple classes under a single object. This is a subtle concept, and does not need to be mastered at this time. Suffice it to say that as you create your own RDF Schemas, you should try to design your Properties to apply to multiple classes, and you should try to instantiate objects under a single class.
Let us continue to examine the dataTime property. Recall that a triple consists of a subject, followed by metadata (the property element), followed by the data. The property element describes the data. The range of the property element tells us what kind of data is described. In RDF schemas, the range of a property is often "Literal", an element defined in the RDF syntax document that refers to any character string. You can see immediately that describing the range of a property as a character string does little to constrain or structure the expected values for a data element.
In the dateTime property, we want the range of the property to be data that conforms to the ISO8601 date/time format. How do we convey the datatype of the data/time element in RDF?
RDF has no intrinsic datatyping facility. So for our property range, we provide a resource (URL) that specifies an element in an .xsd file, that defines the datatype we need.
The range for the dateTime property is a resource:
<rdfs:range rdf:resource="http://www.the_url_here.org/ldip_xsd.xsd#iso8601"/>
The resource points us to an xsd file on the web, and to a particular element within the xsd file, labeled iso8601. Let's pretend we visit the file and extract the iso8601 element. We might find the following:
<simpleType name='iso8601'>
<!-- values of a data_time must contain -->
<!-- a plus or minus sign occurring zero or one -->
<!-- times followed by 8 digits -->
<!-- followed by a perios -->
<!-- followed by 6 digits -->
<!-- followed by the a letter Z, T or a space -->
<!-- followed by a plus or minus sign occurring -->
<!-- zero or one time, followed by 4 digits -->
<xsd:restriction base='string'>
<pattern value=''[\+\-]?[\d]{8}\.[\d]{6}[ZT ][\+\-]{1}[\d]{4}"/>
</xsd:restriction>
</simpleType>
The essence of the datatype is found in the pattern value line:
<pattern value=''[\+\-]?[\d]{8}\.[\d]{6}[ZT ][\+\-]{1}[\d]{4}"/>
This line uses a Regular Expression (RegEx), that provides a pattern to which the element must conform. RegEx is beyond the scope of this manuscript.
Don't be intimidated by .xsd and Regex rules. For most purposes, simply describing the range of a property with the RDF syntax-defined element, "Literal" will be all that you need.
The .xsd element definition imposes a datatype pattern on the value of the data described by the property. A validating software agent would check an RDF document to determine if the data described by a property conform to the range of the property element, as defined by the element in the .xsd resource for the property range.
If we want to employ this trick, we'll need to prepare an XSD file that contains elements for all the datatypes referred under the property ranges included in our XML Schema.
XSD datatype files are very easy to prepare. Basically, you just list your datatypes and provide descriptors. The following generic file contains samples of the kinds of datatypes you will probably need (patterns, inclusive values, and unions).
<?xml version="1.0" encoding="UTF-8"?> <xsd:schema xmlns:xsd ="http://www.w3.org/2000/10/XMLSchema#"> <simpleType name='sp_pattern'> <!-- values of an accession number must contain --> <!-- the letters sp followed by a hyphen followed --> <!-- by two digits followed by a hyphen --> <!-- followed by any number of digits --> <xsd:restriction base='string'> <pattern value='sp\-[0-9]{2}\-[0-9]+'/> </xsd:restriction> </simpleType>
<xsd:simpleType name="adult_age">
<!-- an adult is at least 18 years old -->
<xsd:restriction base="xsd:positiveInteger">
<xsd:minInclusive value="18"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="serialNumber">
<!-- may be either integers or mixed alphanumeric strings -->
<xsd:union>
<xsd:simpleType>
<xsd:restriction base='integer'/>
</xsd:simpleType>
<xsd:simpleType>
<xsd:restriction base='string'/>
</xsd:simpleType>
</xsd:union>
</xsd:simpleType>
<xsd:simpleType name="EnumerationObjectives">
<!-- may be either integers or mixed alphanumeric strings -->
<xsd:restriction base="string">
<xsd:enumeration value="2.5x"/>
<xsd:enumeration value="6.3x"/>
<xsd:enumeration value="20x"/>
<xsd:enumeration value="40x"/>
<xsd:enumeration value="100x"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
The most difficult step in building any schema is determining whether a candidate element is a Class or a Property. Generalizations do not hold for all cases. For example, Classes tend to be nouns, while Properties (that describe data) tend to be adjectives. However, a Property can be a noun (e.g. Time) if it's role is to describe a data value (4:00 PM EST). Furthermore, we sometimes assign active processes to classes (e.g. birth, death), and we cannot assume that classes are always static objects.
There is a strong tendency to assign subclass status to things that are not examples of their ancestral class. For instance, if Person is a class, someone may think that Leg is a subclass of Person (because a Leg is in a class of things that are parts of a Person). No! Leg is never a subclass of Person because a Leg is not a Person. A subclass of Person must be composed of types of Persons. So, Patient is a subclass of Person, and Pathologist is a subclass of Person, because they are both examples of Persons and because there are instances of Patients and instances of Pathologists. Remember, a class is a construct whose chief job is to provide specified instances.
How about Friend? Is Friend a subclass of Person? Yes and no. Friend can be a subclass of person if you want to organize Persons based on whether they are Friends or not-Friends. However, if you think that being a friend is just one of many features of any Person, you would be much better off defining friend as an RDF property. The data-type of the friend property may be a Boolean (true or false).
Here are some general recommendations for distinguishing RDF Schema Classes and Properties.
1. If something has instances of itself, it is almost always a class.
2. If a candidate class is a subclass of more than one class lineages
(so-called multiple inheritance), think very hard before making it a
class. In most cases, you will be better off if it is assigned as a
Property, or if it is excluded from the RDF Schema.
3. Every class must be a subclass of a class. To be a subclass of a
class the subclass must qualify as a member of the father class.
4. A class is fully specified when you know its definition and you know
it's ancestor class.
5. Properties describe data. If something has a specific datatype that
includes numerics, it is almost always a property.
The purpose of a class is to support the creation of subclasses and class instances. If we have a Report class, we might also have a Surgical_Pathology_Report class which is a subClassOf Report. Elsewhere_General_S06_4352 may be one unique instance of the the class Surgical_Pathology_Report. As an instance of the class, the data in the report can be described using the properties specified in an RDF schema to have the Surgical_Pathology_Report domain.
The way to create an instance of a class in an RDF document is with the RDF "type" primitive.
<rdf:description rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">
<rdf:type resource= "http://www.the_url_here.org/ldip_sch#Image">
</rdf:description>
Whenever we wish to create an instance of a class belonging to any RDF Schema we choose, we will add an RDF statement much like the one shown above. We begin by specifying the subject of the statement (with the "about" declaration), then we indicate that "type" of the subject is the class listed in an RDF Schema. An object may be an instance of more than one class, and a proper RDF statement may list numerous type/class pairs, but we caution that doing so adds complexity to your document.
One of the most important features of RDF Schemas is that you can mix and match different elements (classes and properties) from different schemas in a single document. This is done using a simple namespace notation that is common to all XML documents.
<?xml version='1.0' encoding='ISO-8859-1'?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:dc="http://purl.org/dc/elements/1.1/#"
xmlns:schem1="http://www.someplace.org/#"
xmlns:schem2="http://www.someplace_else.com/#">
<rdf:Description rdf:about="http://www.the_url_here.org/ldip/ldip2201.jpg">
<dc:creator>Bill Moore</dc:creator>
<schem1:camera>yes</schem1:camera>
<schem2:camera>Olympus</schem2:camera>
<schem1:format>jpeg</schem1:format>
<schem2:format>jpg</schem2:format>
</rdf:Description>
</rdf:RDF>
Notice that the elements camera and format appear twice, but on each occasion, they are prefixed by different namespaces (schem1 and schem2). The namespaces preserve metadata individuality.
Once you have made an instance of a class, you need to identify the instance uniquely. Failing this, the metadata-data pairs associated with a class instance have no meaning.
The typical unique subject identifier in an RDF triple is a URL specifying a unique web location for a data object.
Failing this, any unique identifier that permanently, unmistakably and uniquely links an object to a character string will suffice.
There are a number of registry services that provide identifiers for data objects in their domains. Examples are:
DOI, Digital object identifier. PMID, PubMed identification number. LSID (Life Science Identifier). HL7 OID (Health Level 7 Object Identifier). DICOM (Digital Imaging and Communications in Medicine) identifiers. ISSN (International Standard Serial Numbers). Social Security Numbers (for U.S. population). NPI, National Provider Identifier, for physicians. Clinical Trials Protocol Registration System. Office of Human Research Protections FederalWide Assurance number. Data Universal Numbering System (DUNS) number DNS, Domain Name Service. In the life sciences, the LSID number has achieved some popularity. The LSID resolution protocol has five parts:
Network Identifier (NID) root DNS name of the issuing authority namespace chosen by the issuing authority object id unique in that namespace and assigned locally revision id for storing versioning information (optional) LSIDs can be used as URN's that uniquely identify items in RDF statements.
urn:lsid:pdb.org:1AFT:1 This is the first version of the 1AFT protein in the Protein Data Bank. urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434 This references a PubMed article. urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2 This refers to the second version of an entry in GenBank HL7 also provides unique identifiers. An enterprise can obtain an OID at: http://www.iana.org/cgi-bin/enterprise.pl For example, the University of Michigan OID is:
1.3.6.1.4.1.250
The enterprise OID serves as a prefix for unique data objects within
an institution.
Unique identifiers are used to uniquely specify the subject of a
triple (i.e., to specify what a triple is about).
Example:
<rdf:description rdf:about="urn:lsid:ncbi.nlm.nih.gov:pubmed:8718907">
<dc:creator>Bill Moore</dc:creator>
</rdf:description>
Here we have a unique data object specified with an lsid for a
PubMed citation. The number 8718907 is the unique pubmed citation
number. We add a property/value pair consisting of the Dublin Core
creator element and the data value, "Bill Moore". Once we have a
unique subject, we can instantiate the element for an appropriate class.
<rdf:description rdf:about="urn:lsid:ncbi.nlm.nih.gov:pubmed:8718907">
<rdf:type resource= "http://www.the_url_here.org/ldip_sch#Document"/>
<dc:creator>Bill Moore</dc:creator>
</rdf:description>.
Here we have a unique data object, instantiated as a member of the Document class. The Document class is defined in an RDF Schema referenced to a URL.
To summarize, the subject of a triple needs to be identified. The subject of a triple can be in the form of a URL (complete web address) or a URN (Unique Resource Name).
URLs and URNs are both forms of URIs (Unique Resource Identifiers).
You can create your own uniquely specified data object by appending a unique number to a URN prefix. For instance, a surgical pathology report, or a patient name, or an image file, can be the subject of a triple, if it is identified by the following:
urn:www.the_url_here.org:ldip:4Ib30fk6J3Y9gWpwMV27
Here, the prefix is "urn:www.dlip.org:ldip:" An alphanumeric suffix, "4Ib30fk6J3Y9gWpwMV27" is a 20 character random string that we have chosen for the object.
There are many ways of providing identifiers for subjects. Once a subject is identified, triples containing the identifier can be merged from multiple RDF documents appearing anywhere on the internet.
RDF Schema document begins
<?xml version="1.0" encoding="utf-8"?>
<!--
This RDF Schema document is a bare-bones RDF Schema for image data.
It contains the following classes
Person
Event
Report
Specimen
Instrument
Image
Pathology
Dublin_core (the class of Dublin Core properties)
This document is a skeletal Schema, that can be used to build
more complex RDF Schemas. A full explanation of RDF, RDF
Schemas, and the role of RDF in image annotation, can be found at:
http://www.julesberman.info/spec2img.htm
The URL of this RDF Schema document is:
http://www.julesberman.info/img_sch.xml
This RDF Schema document is a public domain document.
This RDF Schema document is provided "as is", without warranty of any kind,
express or implied, including but not limited to the warranties
of merchantability, fitness for a particular purpose and
noninfringement. in no event shall the authors or copyright
holders be liable for any claim, damages or other liability,
whether in an action of contract, tort or otherwise, arising
from, out of or in connection with the software or the use or
other dealings in the document.
This document was created by Jules J. Berman, G. William Moore,
on September 10, 2007.
for Presentation at the following workshop:
Implementing an RDF Schema for Pathology Images,
from the Association for Pathology Informatics.
APIII, Pittsburgh, PA
September 10, 2007
This file was validated by the W3C RDF Validation Service at:
http://www.w3.org/RDF/Validator/
on September 2, 2007
-->
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
>
<rdfs:Class rdf:ID="Person">
<rdfs:subClassOf
rdfs:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Event">
<rdfs:subClassOf
rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Report">
<rdfs:subClassOf
rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Specimen">
<rdfs:subClassOf
rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Instrument">
<rdfs:subClassOf
rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Image">
<rdfs:subClassOf
rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Pathology">
<rdfs:subClassOf
rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Dublin_core">
<rdfs:subClassOf
rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdfs:Class>
</rdf:RDF>
RDF Schema document ends
An RDF Schema is a dictionary of classes and properties. An RDF data specification consists of an RDF Schema that describes the classes and properties in its specific knowledge domain, plus an .xsd document that defines the datatypes associated with a proprety's range. An RDF document that specifies a data object contains instances of classes listed in an RDF Schema that are bound to data described by the properties listed in the RDF Schema.
RDF data specifications have a number of properties that data standards lack.
1. RDF data specifications are optional. The purpose of a data
specification is to provide an opportunity for people to describe
their data objects. Data standards, unlike data specifications, are
often imposed requirements.
2. RDF data specifications are self-describing and contain all the
information needed to interpret the contained information. Data
files that conform to standards are often inscrutable and not
intended to be read by humans.
3. RDF data specifications can be interrogated by general autonomous
software agents. Competently written general software agents can
parse and understand any RDF document. This is the underlying
premise of the Semantic Web. Files that conform to most data
standards can only be parsed by software specifically written to
accommodate the data standard.
4. RDF specifications can reduce the complexity of data. All data can
be described in RDF documents consisting of data triples. Data
standards have no unifying principle of data description. The
presence of competing standards, different versions standards, and
proprietary extensions of standards have contributed to the undesirable
complexity of electronic information.
5. The data in a data specification can be distributed over multiple
RDF documents.
6. The assertions in RDF data specifications have meaning, and the
meaning is preserved when the assertion is extracted from the data
specification document. Data standards do not contain meaningful
assertions. There is no general way of extracting components of a
data standard, and building datasets composed of meaningful assertions.
7. A data specification can comply with multiple RDF schemas at once.
8. A data specification can be written without violating intellectual
property, or breaching patient confidentiality.
You can easily create an RDF Schema to specify the classes and properties relevant to a knowledge domain.
1. List the classes and properties that comprehensively describe a
knowledge domain.
2. Determine a class heirarchy. Every class on your list must be a
subclass of something else. The top level classes may be subclasses
of Class.
3. Everything that is not a class must be a property. Determine the
domain of each property (i.e., the classes to which the property applies)
and the range of each property (the kind of data value associated
with the property). The default value for a property's range is
"Literal".
4. Prepare an .xsd datatype declaration for each Property range that
requires a constrained data type.
5. Prepare a CDE list that includes the basic CDE annotations recommended
in ISO-11179, and the basic annotation needed to describe classes,
properties, and datatypes for property ranges. Use the generic LDIP
CDEs for classes and properties as a default.
6. Validate the CDE list (by careful proofreading or through a software
utility.
7. Convert CDEs to an RDF schema using a software utility
or "by hand" if the CDE list is short.
8. Validate the RDF schema.
9. Vet the RDF schema through the intended user community.
10. Distribute the RDF schema to the public.
11. Repeat steps 1-8 ad libitum, providing a new version number to each
successive specification.
General instructions for specifying biomedical data using RDF Schemas
You can easily create RDF documents that fully describe a data object and that comply with one or more RDF Schemas written for a knowledge domain.
1. Look at your own available data for the data object. List your
data as triples sequenced as: <subject, metadata, data>.
2. Create an RDF header that includes the URL of each of the RDF
Schemas that you will use in the document. Be sure to include the
Dublin Core RDF Schema. This document includes the elements that
describe the file (i.e., creator, data of file, type of file, etc.),
and is used by librarians to index your document.
3. Use the RDF Schema(s) appropriate for the knowledge domain of your
data object. Determine which of your listed triples have subjects
that are instances of classes in the RDF Schema, and metadata consistent
with class-appropriate properties. Type these subjects as class
instances. Check that the data values conform to any data value
constraints listed in the .xsd datatype file associated with the RDF
Schema.
4. If you have common data elements (classes or properties) that are
not part of any public RDF Schema, create your own RDF Schema to
accommodate these elements, and use the RDF Schema as the resource
for those elements wherever they appear in your RDF specification
document.
5. Contact the curator of the public RDF Schema that is appropriate
for the elements you created, and ask if your RDF Schema can be added
to the public RDF Schema
6. Validate that the document is well-defined XML, syntactically correct
RDF and that all triples conform to class-property-datatype
descriptions from the RDF Schema.
The metadata (essentially the tags corresponding to Properties) and the Classes in your RDF documents will be selected from pre-existing RDF Schemas (formal vocabularies) found as public Web documents with unique URLs.
Here are a few:
A Top-Level Ontology
http://www-sop.inria.fr/acacia/personnel/phmartin/RDF/phOntology.html
Gene Ontology (GO)
http://139.91.183.30:9090/RDF/VRP/Examples/go.rdf
MGED Ontology
http://139.91.183.30:9090/RDF/VRP/Examples/mgedontology.rdfs
As noted above, when you need Classes and Properties that are not available in public RDF Schemas, you can just create your own RDF Schema, put it on the web, and make it available to your own (and anyone else's) RDF documents.
Software validation is one of the more challenging areas in computer science. Dozens of books, hundreds of manuscripts, and many thousands of hours of programmer time, have been devoted to this demanding subject. Fortunately, validation in the realm of data specifications can be quite easy.
Basically, data specifications consist of lists of triples. Triples are valid when the following conditions hold:
1. The property in a triple is suitable for the subject. 2. The value of the triple is suitable for the property.
Validating an RDF document (in the context of this manuscript, a document that specifies a data object or objects) comes down to this:
1. Checking that the document is well-formed XML. 2. Checking that the document is well-formed RDF. 3. Checking that the triples are valid.
The W3C has a web site that will validate the structure of RDF documents. It is available at:
http://www.w3.org/RDF/Validator/
Owl and DAML are extensions of RDF. They extend RDF in a prescribed manner, through well-designed RDF Schemas that build on W3C's top-level RDF Schema. Neither OWL or DAML are discussed here, but we remind you that all RDF documents may benefit from classes and properties available in publicly available RDF Schemas. We urge readers to visit these URLs for further information:
http://www.w3.org/TR/owl-ref/ http://www.daml.org/
In the meta-reality of informatics, perfect things may be implemented imperfectly. The issues that concern us most regarding RDF are as follows:
1. Despite the hype, RDF is not widely used. The web is virtually all
HTML, with a minor contribution in XML and a negligible contribution
from RDF. Many internet visionaries predict a bright future for RDF.
Now is a good time to master this simple but fascinating model of
reality.
2. RDF is growing in complexity. The many extensions to RDF (including
OWL and DAML) have made it difficult to master every aspect of the
subject. In this manuscript, we have included only selected aspects
of RDF that we consider essential for productivity.
3. RDF, and ontologies written for RDF, do not restrict multi-class
inheritance. In our opinion, this is a huge problem that can lead
to hopelessly complex systems. We strongly recommend extending
Properties to multiple classes, rather than extending multiple
classes to data objects.
4. Though many magazine articles have been written extolling the
virtues of RDF and of the Semantic Web, most of these articles are
pitifully superficial. Not surprisingly, no two persons ever seem
to ever have the same "take" on the subject of RDF data specifications.
The best literature seems to come from the W3C, but these Web
recommendations are technical reports, and do not focus on the needs
of biomedical informaticians. In this this manuscript, we have tried
to provide one interpretation of the theory and implementation of
RDF, as it relates to specifying data.
1. Meaning is achieved by binding a metadata-data pair to a specified subject, into a so-called triple. Example: Jules J. Berman (subject) favorite food (metadata) pizza (data) 2. The subject of a triple needs to be identified as a unique data object. The metadata need to be defined, and the data needs to have a specified structure. These are achieved with identifiers, that uniquely specify class instances; with RDF Schemas that assign classes to subjects and assign properties to metadata; and with .xsd datatypes, that impose structure on data values. 3. RDF documents consist of triples. RDF documents begin with a declaration of the RDF namespace in which the syntactical elements of RDF are defined. When the RDF document creates instances of classes defined in one or more external RDF Schema documents, the namespaces of the RDF Schema documents are also listed at the top of the RDF document. 4. Triples can be collected from heterogeneous RDF datasets, and the data pertaining to a specified subjects can easily be merged by RDF parsers. An RDF parser is a general utility that will work equally well for any RDF document, because all RDF documents conform to the W3C's RDF syntax recommendation. 5. A specification is a document that describes a data object in a manner that can be understood by humans or by computers. An RDF Schema is a dictionary of classes and properties that can be used to completely describe a data object in its knowledge domain. There may be many different ways of specifying an object in an RDF document, but if the specification is in the form of an RDF document that uses RDF Schemas to create class instances and define metadata and uses XSD datatypes to constrain the value of data, then the specification will be understood by competent general software agents written to interrogate RDF documents. 6. Data specifications have many advantages over data standards. By writing domain-specific RDF Schemas (e.g., the LDIP image specification), we can reduce our dependence on data standards and enhance our ability to integrate data collected from heterogeneous datasets.
This document was written as the syllabus for a workshop titled, "Implementing an RDF Schema for Pathology Images, from the Association for Pathology Informatics," presented at the APIII meeting in Pittsburgh, PA, Monday, September 10, 2007, 7:30 am - 8:30 am.
The workshop summary, as it appears in the APIII 2007 brochure, is:
The Laboratory Digital Imaging Project (LDIP) was a three year effort (2004-2007) under the direction and sponsorship of the Association for Pathology Informatics (API). The primary goal of LDIP was to develop a free, open source image specification that would convey clinical, histologic, and technical descriptors, along with image files. The purpose of the specification is to facilitate data sharing and to promote the creation of richly annotated pathology images that can be archived, analyzed, or published and that support data integration across heterogeneous biomedical domains. A specification was created in RDF Schema, the technology created by the W3C (World-Wide Web Consortium), as the logical scaffold for the Semantic Web. By annotating images in API's RDF Schema, RDF documents can be created that include objects and metadata from any other RDF Schema or from publicly available biomedical ontologies and can be ported to and from other data standards, including DICOM and OME (Open Microscopy Environment). This workshop will explain RDF, RDF Schemas, namespaces, object classes and data properties, and will describe methods for implementing API's RDF Schema for pathology images. Examples will be provided in which pathology images are annotated with several different RDF Schemas containing classes and properties relevant to pathology data sources. After this workshop, API's RDF Schema will be released as an online public document that can be linked from any RDF document.
The powerpoint presentation for the workshop is avalable at:
http://www.julesberman.info/jbimage.ppt
The subject matter in this manuscript is covered in books that Jules Berman has written, and this document contains links to the Publisher's web site where these books are advertised. Other than linking to these books from this manuscript, I have no competing interests.
This document is a work of literature and has no purpose other than as a literary work.
The text of the body of the document is copyrighted to Jules J. Berman, in 2007 and distributed under the GNU Free Documentation License.
http://www.gnu.org/licenses/fdl.txt
The Perl scripts and Ruby scripts were written by Jules J. Berman and are distributed under the GNU General Public License
http://www.gnu.org/copyleft/gpl.html
A disclaimer applies for all of the included scripts.
The software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.
The Imaging RDF Schema is public domain.
A pdf version of this manuscript is available at: http://www.julesberman.info/rdfimage.pdf
An html version of this manuscript is available at: http://www.julesberman.info/spec2img.htm
The leaf image, leaf.jpg, was obtained from Wikipedia and has the following Rights annotation "This image from PD Photo.org has been released into the public domain by its author and copyright holder, Jon Sullivan."
We thank Ulysses Balis, co-chair of the Laboratory Digital Imaging Program, for his many useful suggestions, and we thank Robert Leif, for his many helpful criticisms. We thank Kemp Watson for setting up a wiki to support LDIP committee activities. Most importantly, we thank all of the former members of LDIP for their valuable discussions.
Although this document is distributed freely, all public uses of the document or any part of the document must be cited. When citing this document, please use the following text.
Berman JJ. Moore GW. Implementing an RDF Schema for Pathology Images, from the Association for Pathology Informatics. APIII, 2007.
Jules Berman, Ph.D., M.D., has studied cancer for the past 36 years. After receiving two bachelor of science degrees (mathematics and earth sciences from MIT), he entered the graduate program in pathology at Temple University, where he began his thesis work within the Fels Cancer Research Institute. He spent the final year of his graduate studies at the Naylor Dana Institute of the American Health Foundation in Valhalla, New York, before beginning his post-doctoral studies in the Perinatal Carcinogenesis section of the Laboratory of Experimental Pathology at the U.S. National Cancer Institute. He then earned a medical degree from the University of Miami, followed by a pathology residency at George Washington University Medical Center in Washington, D.C. He became Board Certified in Anatomic Pathology and in Cytopathology, and served as the chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland. While working at the Baltimore VA Medical Center, he held appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he became the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the U.S. National Cancer Institute. In 2006, he became President of the Association for Pathology Informatics. Over the course of his career, his name has appeared as a co-author on hundreds of scientific contributions, and he has written, as first author, more than 100 publications. Today, Dr. Berman is a full-time writer.
He maintains a personal web site and a personal blog.G. William Moore, M.D., Ph.D. (Bill Moore) earned a Bachelor's degree in Cell Biology at the University of Michigan at Ann Arbor, a PhD in Biomathematics from North Carolina State University at Raleigh, NC, and an MD from Wayne State University in Detroit. He did his pathology residency at Johns Hopkins. He continued at Hopkins after his residency, and he eventually wound up at the Baltimore VA Hospital, where he now holds joint appointments at Hopkins and at University of Maryland Hospital. During his career, he's been a co-author on over 180 papers, most of which relate, in one way or another to computational pathology and pathology informatics. His earliest work was in the field of computational evolutionary biology, and he was one of the first people to write programs and prove theorems for automated cladistic classification. His early work also involved developing statistical techniques for analyzing medical data, and his "token swap" paper is probably his best contribution in this area. [Moore GW, Hutchins GM, Miller RE. Token swap test of significance for serial medical data bases. Am J Med. 1986 Feb;80(2):182-190.] He has been a long-time advocate for using pathology data in research. Along with Grover Hutchins, M.D., he got an NLM grant and transferred about 50,000 Hopkins autopsies into a database. These autopsy records and associated blocks were used for over 1,000 research projects at Hopkins. Bill worked on a very early image analysis program, all written in Visual Basic. Bill did most of the programming on that project. [Berman JJ, Moore GW. Image analysis software for the detection of preneoplastic and early neoplastic lesions. Cancer Lett. 1994 Mar 15;77(2-3):103-109.] In the past 20 years, Bill has concentrated on the two related fields of indexing and machine translation. He has contributed many papers to the field, and has shown the utility of MESH and UMLS as primary indexing dictionaries. His barrier word method for extracting candidate terms from text (now better known as the "stop" word method) is currently a widely used technique in informatics science. Bill has been an API member since its inception.
Dr. Moore's Curriculum Vitae:
http://www.gwmoore.org/gwmcv.htm