Tissue Microarray Workshop
Sponsored by API, APIII
and
The National Cancer Institute
Held October 6, 2001
Marriott City Center, Pittsburgh
Summary
Acknowledgements
First, we would like to thank the individuals who contributed to the workshop. This includes the
speakers: Olli Kallioniemi (National Human Genome Research Institute), Chris Chute (Mayo
Clinic) , Paul Spellman (UC Berkley) and Richard Lieberman (U of Michigan). This also
includes those members who contributed data and pilot data elements: John Gilbertson
(University of Pittsburgh) and David Romer (Ohio State University) and Leona Ayers (Ohio
State University). We would also like to thank Frederico Monzon (Jefferson University Medical
School) who contributed a tma data structure prototype as a meeting abstract. We would also
like to thank Mike Becich (U of Pittsburgh), Bruce Friedman (U of Michigan), Sandy Wolman
and Wally Hendricks (Cleveland Clinic) who encouraged the workshop and who devoted their
resources to making the workshop possible.
Morning Lectures
The workshop began with a series of lectures. Dr. Olli Kallioniemi, one of the inventors of the
tissue microarray, who provided an overview of tissue microarray technology and its
applications. Dr. Chris Chute, one of the leading figures in clinical data standards, reviewed the
history of standards development and utilization in epidemiology. Dr. Paul Spellman, one of the
group leaders of MGED, described his ongoing efforts to provide data standards for gene
microarrays (a technology with obvious relationships to the tissue microarrays). Dr. Richard
Lieberman provided a working model for managing tissue microarray data as implemented at the
University of Michigan.
Afternoon Workshop
The afternoon workshop was led by Dr. Mary Edgerton. Two documents were distributed.
One was the general guidelines for a tissue microarray standard discussed at the first workshop
(held May 30, 2001, Ann Arbor, MI). In brief, those attending the May workshop recommended
that there will be a sustained effort from the TMA community to create a public domain
exchange standard for tissue micoarray data. The primary working vehicle for the effort will be
the tma_users listserver:
The May, 30, 2001 workshop summary is attached (wkshpmi.txt)
Drafts of the Standard (lists of data elements, DTDs, sample XML files) will be circulated to the
listserver for open community comment and contributions. Once or twice a year, there will be a
TMA workshop to which all the TMA users will be invited. Eventually, a community standard
will emerge, and the TMA community will be in charge of updates or modifications to the
standards.
The second document was the working draft of the suggested TMA elements.
The group reviewed the elements and made comments regarding elements that should be
withdrawn, added, or modified. The group had an opportunity to better define many of the
tagged elements, for the purpose of achieving consistent usage.
Dr. Edgerton has updated the file of data elements based on the workshop discussions, and this
file is attached (Proposed common data elements.doc).
Some of the most difficult issues discussed relate to how the standard will be implemented. In
our opinion, the most frequent point of confusion related to the distinction between a data
exchange standard and a data format.
A data exchange standard is a common way of expressing information that can be mapped from
any tma data format into any other tma data format. A good example of a data exchange format
is the RTF (rich text format) standard used to exchange wordprocessor files between non-compatible proprietary applications. Wordperfect and MS-Word are commercial applications that
use their own proprietary algorithms to embed text, images, and detailed formatting information
in a non-ascii file format. MS-Word cannot "open" Wordperfect files and Wordperfect cannot
open MS-Word files. But both software applications can translate their files into RTF files that
are simple ascii files with annotations that capture all the formatting information from the
original proprietary files. When you want to send a colleague a file that any wordprocessor
application can "open", you should use your application's "save as" option to convert the file to
an RTF file, and send the RTF version to a colleague. Your colleague can then "open" the RTF
file with any proprietary software, and the RTF file will emerge as a fully formatted document in
the proprietary software's format.
The RTF format is a "middle" or translational format. By intention, RTF format is not used as
the native format of any proprietary word processor. However, if a software designer wanted to
design a wordprocessor that used RTF natively, that would be possible.
We are designing the TMA data exchange standard to be an open format that anyone can use as a
native data structure for TMA data or as a "middle" data structure for exchanging data between
otherwise incompatible systems. Our TMA standard will work for TMAs the same way that RTF
works for wordprocessing.
This means that the data elements must be self-describing, so that anyone or any machine that
opens the file can recreate in their own system the intended the meaning of the data. Data that
describes data is called metadata, and much of the effort of the project will be spent on choosing
and defining the metadata for the TMA file. A well-described meta-data dictionary will allow
developers of proprietary TMA data systems to successfully map their data elements to the data
exchange standard.
Another issue that arose related to the re-use of certain groups of TMA data elements. For
instance, a single TMA paraffin block may be used to prepare more than 100 TMA slides. If 100
slides prepared from a single block were used in 100 TMA experiments, there would be 100
different TMA files created. However, all of these files would share lots of identical information
related to the block from which they all were cut. Also, many (maybe all) of the slides may be
used by a single laboratory, in which case all of the files would contain identical information
describing the laboratory that owned the TMA data file. Finally, the pathologic or clinical data
contained in the TMA files may all be abstracted from a single database source that contains data
annotating a banked tissue specimen (i.e. tissue repository data). In other words, a single TMA
may lead to an enormous amount of redundant information.
Redundant information creates many problems in addition to the obvious problem of increasing
file size. "To err is human", and the most important problem with data redundancy relates to the
difficulty of "correcting" errors that may occur in redundant files. It requires tracking down
every file that contains the redundant information (almost always an impossible task). Typically,
you end up with some files that are "corrected" and others that aren't, giving rise to the problem
of not knowing which data is "really" different or just not corrected.
How do you deal with this problem? One way is to use namespace pointers to data that can
reside outside the file. A good example of a namespace pointer is a publication reference or a
web address (URL). When you cite a publication in a paper, you simply provide a pointer to its
whereabouts (authors, title, journal name, volume, pages, date), and the reader can find and
extract any necessary information from the original data source. The same can be done in any
dataset. You simple point to the location of a data source and provide the precise information
that will bring you to the data object of particular interest. Using this approach, data belonging to
a TMA file can be retrieved as needed from another dataset (tissue bank, gene array, surgical
pathology report) provided you have a good pointer to good data. The reason that the term
namespace is often applied to this kind of pointer, is that the pointer name must have sufficient
information to distinguish the desired data object from all other data objects in the universe,
including data objects that may have the same name (like John Smith, Lassie, and
Tissue_Diagnosis). This may enable us to integrate TMA files with other clinical and/or
bioinformatics files.
The third problem that came up at the workshop relates to two general approaches to data
structure: richness versus parsimony. We can design a tma structure that includes every
imaginable data element that anyone might possibly want to include in a TMA data file. As a
group, we can determine which of these elements should be required in a TMA file and which
should be optional.
We could also take a parsimonious approach and try to determine the minimal set of data
elements that would be needed to produce a TMA data file. Vendors and labs that need
additional data elements can just add them as they like, just so long as their file maintains the
general organization of the standard TMA and includes the minimal set of data elements.
The group did not come to any conclusion regarding this fundamental issue, and the issue will be
left to future meetings.
Current Status of Tissue Microarray Data Exchange Standard
Jules Berman is finishing a draft DTD. It will include many of the structural innovations
discussed in the workshop summary. He will submit it to Mary for initial comment and revision,
and then Jules and Mary will submit it to the group. Hopefully, it can be made public in the next
few days.
Future Actions
The next scheduled TMA workshop will be held in conjucntion with the University of
Michigan's AIMCL, summer 2002.
Jules J. Berman, Ph.D., M.D.
Program Director, Pathology Informatics
Cancer Diagnosis Program, DCTD, NCI, NIH
EPN - Room 6028
6130 Executive Blvd.
Rockville, MD 20892
email: bermanj@mail.nih.gov
voice: 301-496-7147
fax: 301-402-7819
Mary E. Edgerton, M.D., Ph.D.
Director of Pathology Informatics
Vanderbilt University Medical Center
Department of Pathology
Nashville, Tennessee
Mary.Edgerton@mcmail.vanderbilt.edu