Tissue Microarray Workshop

Sponsored by API, APIII

and

The National Cancer Institute

Held October 6, 2001

Marriott City Center, Pittsburgh

Summary



Acknowledgements

First, we would like to thank the individuals who contributed to the workshop. This includes the speakers: Olli Kallioniemi (National Human Genome Research Institute), Chris Chute (Mayo Clinic) , Paul Spellman (UC Berkley) and Richard Lieberman (U of Michigan). This also includes those members who contributed data and pilot data elements: John Gilbertson (University of Pittsburgh) and David Romer (Ohio State University) and Leona Ayers (Ohio State University). We would also like to thank Frederico Monzon (Jefferson University Medical School) who contributed a tma data structure prototype as a meeting abstract. We would also like to thank Mike Becich (U of Pittsburgh), Bruce Friedman (U of Michigan), Sandy Wolman and Wally Hendricks (Cleveland Clinic) who encouraged the workshop and who devoted their resources to making the workshop possible.

Morning Lectures

The workshop began with a series of lectures. Dr. Olli Kallioniemi, one of the inventors of the tissue microarray, who provided an overview of tissue microarray technology and its applications. Dr. Chris Chute, one of the leading figures in clinical data standards, reviewed the history of standards development and utilization in epidemiology. Dr. Paul Spellman, one of the group leaders of MGED, described his ongoing efforts to provide data standards for gene microarrays (a technology with obvious relationships to the tissue microarrays). Dr. Richard Lieberman provided a working model for managing tissue microarray data as implemented at the University of Michigan.

Afternoon Workshop

The afternoon workshop was led by Dr. Mary Edgerton. Two documents were distributed.

One was the general guidelines for a tissue microarray standard discussed at the first workshop (held May 30, 2001, Ann Arbor, MI). In brief, those attending the May workshop recommended that there will be a sustained effort from the TMA community to create a public domain exchange standard for tissue micoarray data. The primary working vehicle for the effort will be the tma_users listserver:

tmausers@list.vanderbilt.edu

The May, 30, 2001 workshop summary is attached (wkshpmi.txt)

Drafts of the Standard (lists of data elements, DTDs, sample XML files) will be circulated to the listserver for open community comment and contributions. Once or twice a year, there will be a TMA workshop to which all the TMA users will be invited. Eventually, a community standard will emerge, and the TMA community will be in charge of updates or modifications to the standards.

The second document was the working draft of the suggested TMA elements.

The group reviewed the elements and made comments regarding elements that should be withdrawn, added, or modified. The group had an opportunity to better define many of the tagged elements, for the purpose of achieving consistent usage.

Dr. Edgerton has updated the file of data elements based on the workshop discussions, and this file is attached (Proposed common data elements.doc).

Some of the most difficult issues discussed relate to how the standard will be implemented. In our opinion, the most frequent point of confusion related to the distinction between a data exchange standard and a data format.

A data exchange standard is a common way of expressing information that can be mapped from any tma data format into any other tma data format. A good example of a data exchange format is the RTF (rich text format) standard used to exchange wordprocessor files between non-compatible proprietary applications. Wordperfect and MS-Word are commercial applications that use their own proprietary algorithms to embed text, images, and detailed formatting information in a non-ascii file format. MS-Word cannot "open" Wordperfect files and Wordperfect cannot open MS-Word files. But both software applications can translate their files into RTF files that are simple ascii files with annotations that capture all the formatting information from the original proprietary files. When you want to send a colleague a file that any wordprocessor application can "open", you should use your application's "save as" option to convert the file to an RTF file, and send the RTF version to a colleague. Your colleague can then "open" the RTF file with any proprietary software, and the RTF file will emerge as a fully formatted document in the proprietary software's format.

The RTF format is a "middle" or translational format. By intention, RTF format is not used as the native format of any proprietary word processor. However, if a software designer wanted to design a wordprocessor that used RTF natively, that would be possible.

We are designing the TMA data exchange standard to be an open format that anyone can use as a native data structure for TMA data or as a "middle" data structure for exchanging data between otherwise incompatible systems. Our TMA standard will work for TMAs the same way that RTF works for wordprocessing.

This means that the data elements must be self-describing, so that anyone or any machine that opens the file can recreate in their own system the intended the meaning of the data. Data that describes data is called metadata, and much of the effort of the project will be spent on choosing and defining the metadata for the TMA file. A well-described meta-data dictionary will allow developers of proprietary TMA data systems to successfully map their data elements to the data exchange standard.

Another issue that arose related to the re-use of certain groups of TMA data elements. For instance, a single TMA paraffin block may be used to prepare more than 100 TMA slides. If 100 slides prepared from a single block were used in 100 TMA experiments, there would be 100 different TMA files created. However, all of these files would share lots of identical information related to the block from which they all were cut. Also, many (maybe all) of the slides may be used by a single laboratory, in which case all of the files would contain identical information describing the laboratory that owned the TMA data file. Finally, the pathologic or clinical data contained in the TMA files may all be abstracted from a single database source that contains data annotating a banked tissue specimen (i.e. tissue repository data). In other words, a single TMA may lead to an enormous amount of redundant information.

Redundant information creates many problems in addition to the obvious problem of increasing file size. "To err is human", and the most important problem with data redundancy relates to the difficulty of "correcting" errors that may occur in redundant files. It requires tracking down every file that contains the redundant information (almost always an impossible task). Typically, you end up with some files that are "corrected" and others that aren't, giving rise to the problem of not knowing which data is "really" different or just not corrected.

How do you deal with this problem? One way is to use namespace pointers to data that can reside outside the file. A good example of a namespace pointer is a publication reference or a web address (URL). When you cite a publication in a paper, you simply provide a pointer to its whereabouts (authors, title, journal name, volume, pages, date), and the reader can find and extract any necessary information from the original data source. The same can be done in any dataset. You simple point to the location of a data source and provide the precise information that will bring you to the data object of particular interest. Using this approach, data belonging to a TMA file can be retrieved as needed from another dataset (tissue bank, gene array, surgical pathology report) provided you have a good pointer to good data. The reason that the term namespace is often applied to this kind of pointer, is that the pointer name must have sufficient information to distinguish the desired data object from all other data objects in the universe, including data objects that may have the same name (like John Smith, Lassie, and Tissue_Diagnosis). This may enable us to integrate TMA files with other clinical and/or bioinformatics files.

The third problem that came up at the workshop relates to two general approaches to data structure: richness versus parsimony. We can design a tma structure that includes every imaginable data element that anyone might possibly want to include in a TMA data file. As a group, we can determine which of these elements should be required in a TMA file and which should be optional.

We could also take a parsimonious approach and try to determine the minimal set of data elements that would be needed to produce a TMA data file. Vendors and labs that need additional data elements can just add them as they like, just so long as their file maintains the general organization of the standard TMA and includes the minimal set of data elements.

The group did not come to any conclusion regarding this fundamental issue, and the issue will be left to future meetings.

Current Status of Tissue Microarray Data Exchange Standard

Jules Berman is finishing a draft DTD. It will include many of the structural innovations discussed in the workshop summary. He will submit it to Mary for initial comment and revision, and then Jules and Mary will submit it to the group. Hopefully, it can be made public in the next few days.

Future Actions

The next scheduled TMA workshop will be held in conjucntion with the University of Michigan's AIMCL, summer 2002.


Jules J. Berman, Ph.D., M.D.
Program Director, Pathology Informatics
Cancer Diagnosis Program, DCTD, NCI, NIH
EPN - Room 6028
6130 Executive Blvd.
Rockville, MD 20892
email: bermanj@mail.nih.gov
voice: 301-496-7147
fax: 301-402-7819


Mary E. Edgerton, M.D., Ph.D.
Director of Pathology Informatics
Vanderbilt University Medical Center
Department of Pathology
Nashville, Tennessee
Mary.Edgerton@mcmail.vanderbilt.edu