Free ebook: Machiavelli's Laboratory
"Ethics taught by an unethical scientist"
MESH TREES (MEDICAL SUBJECT HEADING TREES): A MISNOMER
MeSH (Medical Subject Headings) is a wonderful nomenclature of medical
terms available from the U.S. National Library of Medicine.
The download site is:
http://www.nlm.nih.gov/mesh/filelist.html
MeSH is one of the greatest gifts provided by the U.S. National Library of Medicine
and can be used freely for a variety of projects involving indexing,
tagging, searching, retrieving, coding, analyzing, merging, and sharing
biomedical text. In my opinion, there are many projects that rely
on commercial and legally encumbered nomenclatures that would be
better served by MeSH.
My only quibble with MESH is that it is incorrectly described as a Tree structure.
Here is the official word (from the NLM website) on Mesh Trees
from: http://www.nlm.nih.gov/mesh/intro_trees2007.html
"Because of the branching structure of the hierarchies, these lists are sometimes referred to as "trees". Each MeSH descriptor appears in at least one place in the trees, and may appear in as many additional places as may be appropriate. Those who index articles or catalog books are instructed to find and use the most specific MeSH descriptor that is available to represent each indexable concept."
When you look at individual entries in MeSH, you find that a single
entry may be assigned multiple MeSH numbers.
For example, the MeSH term, "Family" is assigned two MeSH numbers,
MN = F01.829.263
MN = I01.880.225
For each Mesh number, there is a separate hierarchy.
The parent "number" for any MeSH number is found by removing the last set of decimal demarcated digits.
For example:
F01.829.263 MeSH name, Family
F01.829 MeSH name, Plychology, Social
F01 MeSH Name, Behavior and Behavior Mechanisms
It is tempting
to think of each hierarchy for each number as a tree (then MeSH could
be envisioned as a dense forest), but each parent term could be
assigned multiple MeSH numbers, each producing a multi-branching
hierarchy.
Because each Mesh term (including the ancestral terms for a Mesh term) may
be assigned multiple Mesh numbers, each with its own hierarchy, the Mesh data structure
is more accurately thought of as a complex ontology, with terms existing in
multiple classes, with specified relationships among any class and its
parent classes.
The tree metaphor breaks down because branches and nodes within a branch
can be connected to other branches and to other nodes. Trees do not
do this kind of thing.
It is possible to write a script that parses through every MeSH entry,
finds all of the MeSH numbers for the entry, determines the parent
terms for the MeSH numbers, determines all of the alternate MeSH
numbers for the parent terms, then finds all of the grandparent terms
for all of the parent terms, etc., until all of the hierarchical
terms for the term are found.
Here is the Perl script. This Perl script is provided "as is", by
its creator, Jules J. Berman,
without warranty of any kind, express or implied, including but
not limited to the warranties of merchantability, fitness for a
particular purpose and noninfringement. in no event shall the authors
or copyright holders be liable for any claim, damages or other liability,
whether in an action of contract, tort or otherwise, arising from, out
of or in connection with the software or the use or other dealings in
the software.
#!/usr/local/bin/perl
open(MESH, "D2007.BIN"); #the file name for the raw ascii MeSH version
open(OUT, ">mesh.out");
$/ = "\n\n\*NEWRECORD\n";
$line = " ";
@cumlist;
%numberhash;
%namehash;
while ($line ne "")
{
my $numbers = "";
$line = <MESH>;
$line =~ /\nMH = ([^\n]+)\n/;
$name = $1;
while ($line =~ m/\nMN ?= ?([^\n]+)(?=\n)/mg)
{
$number = $1;
$number =~ s/^ *//o;
$number =~ s/ *$//o;
$number =~ s/ +/ /;
$numberhash{$number} = $name;
$numbers = $numbers . " " . $number;
}
$numbers =~ s/^ *//o;
$numbers =~ s/ *$//o;
$numbers =~ s/ +/ /o;
$namehash{$name} = $numbers;
}
close(MESH);
while((my $key, my $value) = each (%namehash))
{
@cumlist = ("");
print OUT "\nTERM LINEAGE FOR " . uc($key) . "\n";
my @valuelist = split(/ /,$value);
@cumlist = (@cumlist, @valuelist);
&splitlist(@cumlist);
for(1..30)
{
@cumlist = grep { $marked{$_}++; $marked{$_} == 1; } @cumlist;
undef(%marked);
&allmeshnums(@cumlist);
@cumlist = grep { $marked{$_}++; $marked{$_} == 1; } @cumlist;
undef(%marked);
&splitlist(@cumlist);
}
@cumlist = grep { $marked{$_}++; $marked{$_} == 1; } @cumlist;
undef(%marked);
foreach my $thing (@cumlist)
{
print OUT "$thing $numberhash{$thing}\n";
}
}
sub splitlist()
{
@valuelist = @_;
foreach my $meshno (@valuelist)
{
for(1..30)
{
if ($meshno =~ /\.[0-9]+$/)
{
$meshno = $`;
push(@cumlist, $meshno);
}
else
{
last;
}
}
}
}
sub allmeshnums()
{
@meshnumber = @_;
foreach my $thing (@meshnumber)
{
my $name = $numberhash{$thing};
my $value = $namehash{$name};
my @valuelist = split(/ /,$value);
@cumlist = (@valuelist, @cumlist);
}
}
exit;
The output file, mesh.out is over 9 megabytes in length.
Here is an example of one entry, in the output file, mesh.out
TERM LINEAGE FOR GIANT CELLS, FOREIGN-BODY
A11.118.637 Leukocytes
A15.145.229.637 Leukocytes
A15.382.490 Leukocytes
A11.118.637.555 Leukocytes, Mononuclear
A15.145.229.637.555 Leukocytes, Mononuclear
A15.382.490.555 Leukocytes, Mononuclear
A15.378 Hematopoietic System
A11.148 Bone Marrow Cells
A15.378.316 Bone Marrow Cells
A12.207.152 Blood
A15.145 Blood
A11.118 Blood Cells
A15.145.229 Blood Cells
A11.329.372.376 Giant Cells, Foreign-Body
A11.502.376 Giant Cells, Foreign-Body
A11.627.624.480.376 Giant Cells, Foreign-Body
A11.733.397.376 Giant Cells, Foreign-Body
A15.382.680.397.376 Giant Cells, Foreign-Body
A15.382.812.522.376 Giant Cells, Foreign-Body
A11.329 Connective Tissue Cells
A11 Cells
A11.502 Giant Cells
A11.118.637.555.652 Monocytes
A11.148.580 Monocytes
A11.627.624 Monocytes
A11.733.547 Monocytes
A15.145.229.637.555.652 Monocytes
A15.378.316.580 Monocytes
A15.382.490.555.652 Monocytes
A15.382.680.547 Monocytes
A15.382.812.547 Monocytes
A11.627 Myeloid Cells
A11.733 Phagocytes
A15.382.680 Phagocytes
A15.382 Immune System
A15 Hemic and Immune Systems
A11.329.372 Macrophages
A11.627.624.480 Macrophages
A11.733.397 Macrophages
A15.382.680.397 Macrophages
A15.382.812.522 Macrophages
A15.382.812 Reticuloendothelial System
A12.207 Body Fluids
A12 Fluids and Secretions
When we examine the multi-lineage ancestry of "Foreign body giant
cells" we see that MeSH is not a tree hierarchy. This means that
the MeSH data structure is highly complex and requires
some computational know-how to fully explore all the term relationships.
Page written by Jules J. Berman, April 17, 2008
Jules Berman's blog
Jules Berman's home page
MeSH and other open source nomenclatures are described in my
book,
Biomedical Informatics
Page last modified: April 17, 2008
tags: Perl programming for medicine and biology,
nomenclature, thesaurus, nlm, medical subject
headings, open source, medical indexing, medical data retrieval,
medical informatics, biomedical informatics,