Bringing order to data chaos

November 2003

CAP TODAY (Excerpt)
Feature Story

Eric Skjei

Pressure to adopt standards for structuring and communicating pathology data is mounting, and the move to create the standards is gaining momentum.

"A few months ago the NIH came out with a requirement that all grants where the request is larger than $500,000 per year must include a section in the application that describes how the authors plan to share their data," says Jules J. Berman, MD, PhD, program director for pathology informatics at the National Cancer Institute. The good grant applications, he says, will do more than just say they will make their data publicly available. "They will show they have a realistic way of organizing their data so that it can be merged into other data sets," says Dr. Berman.

Those who write gene expression array-related grants, he notes, already are aware of the need to share data and typically use the Microarray Gene Expression Databases or the Minimum Information About a Microarray Experiment specifications for their gene expression array data.

"I think this will probably be the case for any kind of research funded by the federal government," he says. "The data must be presented in a way that can be understood by humans and by computers. Research files should be self-describing and prepared in a way that meets the general rules for the standard exchange of data. Taking these steps will enhance the chances that a grant will be reviewed favorably."

Many journals now require that those who publish papers include the data that support the assertions made in the paper, he says, which means authors will submit supplemental data files that will appear on the journal's Web site. And to be usable, these files will have to adopt standard formats. "Just about anything anyone does will have to be exchangeable, the exchanges are going to be done electronically, and they will need to be done in such a way that any given data set can be integrated with every other data set of the same type," Dr. Berman says.

Not easy, of course, but we're moving in that direction. A modest milestone in the history of standards development in pathology was achieved May 23 with the publication of "The Tissue Microarray Data Exchange Specification: A Community-Based, Open Source Tool for Sharing Tissue Microarray Data." Authored by Dr. Berman and pathologists Mary E. Edgerton, MD, PhD, and Bruce A. Friedman, MD, the specification is the result of more than two years of effort and the first such work published by the Association for Pathology Informatics. (It can be found at www.biomedcentral.com/472-6947/3/5.)

Tissue microarrays, or TMAs, are a relatively recent technology that allows researchers to include and study hundreds of samples on a single slide. To facilitate data exchange among researchers and data inclusion in scientific journals, leaders in the standards area realized that a TMA data-exchange specification was needed. Work began on one in 2001 under the auspices of the API with funding from the National Cancer Institute. The resulting specification describes an XML document with four required sections: header, block, slide, and core. Eighty common data elements constitute the universe of XML tags used in the TMA specification. In addition, six simple semantic rules -for example, "Every TMA file must consist of well-formed XML"-guide the development of TMA data-exchange specifications. (A common data element is a generic type of data-for example, lab name-that is likely to show up in any TMA file. XML, or Extensible Markup Language, is a high-powered, more flexible cousin of HTML that allows users to create their own tags, among other things, and is rapidly becoming the de facto tool for creating medical informatics standards and specifications of a variety of types.)

Like an RTF file

An easy way to understand the significance of the TMA specification is to refer to a commonly used file exchange function of most commercial word-processing programs, the RTF (Rich Text Format) file. Take, for example, Microsoft Word and WordPerfect. A person writing in Word cannot send a Word file to someone working in WordPerfect (or vice versa) and have the recipient open it and read it. An intermediate file, functioning something like a lingua franca among all mainstream word-processor files, must be used. RTF serves this purpose well, preserving text, format instructions, and image placement as the original writer intended. In other words, RTF serves as a gateway through which data, in this case word-processing data, can be exchanged among heterogeneous or "foreign" applications.

The new TMA data-exchange specification serves precisely this purpose for those working with TMA data. In most cases, users of the specification will also require a script, similar to the "save as" option found in most word-processing programs, to convert their own specific TMA data files into the published TMA data-exchange format. This script is not part of the TMA data-exchange specification. Again, the specification is analogous to the RTF format, but it does not include the simple programming tools needed to convert proprietary TMA data formats into the XML-based specification format.

"The purpose of a data-exchange specification," Dr. Berman says, "is so that anybody who has their own way of making TMAs can exchange that data with others, regardless of what company they bought their arrayer from, what image-analysis software they're using, or how they've set up their database." The assumption, he says, is that individuals are going to create their TMA files in their own way for their own purposes. "What we wanted to do was create a format that was so plastic that anyone could port their own data into a very general data-exchange envelope that would accept almost any kind of TMA data but that would be self-describing, so that a script could reinterpret it into someone else's own system."

To date, Dr. Berman says, the specification has been well received and appears to be serving the intended need. The API plans to continue to develop the specification as needed, based on user input, over the next several years.


Continues......

Eric Skjei is a writer in Stinson Beach, Calif.