Free ebook: Machiavelli's Laboratory
Books by Jules J. Berman, covers
"Ethics taught by an unethical scientist"

MESH TREES (MEDICAL SUBJECT HEADING TREES): A MISNOMER

MeSH (Medical Subject Headings) is a wonderful nomenclature of medical terms available from the U.S. National Library of Medicine.

The download site is:
http://www.nlm.nih.gov/mesh/filelist.html

MeSH is one of the greatest gifts provided by the U.S. National Library of Medicine and can be used freely for a variety of projects involving indexing, tagging, searching, retrieving, coding, analyzing, merging, and sharing biomedical text. In my opinion, there are many projects that rely on commercial and legally encumbered nomenclatures that would be better served by MeSH.

My only quibble with MESH is that it is incorrectly described as a Tree structure.

Here is the official word (from the NLM website) on Mesh Trees from: http://www.nlm.nih.gov/mesh/intro_trees2007.html

"Because of the branching structure of the hierarchies, these lists are sometimes referred to as "trees". Each MeSH descriptor appears in at least one place in the trees, and may appear in as many additional places as may be appropriate. Those who index articles or catalog books are instructed to find and use the most specific MeSH descriptor that is available to represent each indexable concept."

When you look at individual entries in MeSH, you find that a single entry may be assigned multiple MeSH numbers.

For example, the MeSH term, "Family" is assigned two MeSH numbers,
MN = F01.829.263
MN = I01.880.225

For each Mesh number, there is a separate hierarchy.

The parent "number" for any MeSH number is found by removing the last set of decimal demarcated digits.

For example:
F01.829.263 MeSH name, Family
F01.829 MeSH name, Plychology, Social
F01 MeSH Name, Behavior and Behavior Mechanisms

It is tempting to think of each hierarchy for each number as a tree (then MeSH could be envisioned as a dense forest), but each parent term could be assigned multiple MeSH numbers, each producing a multi-branching hierarchy.

Because each Mesh term (including the ancestral terms for a Mesh term) may be assigned multiple Mesh numbers, each with its own hierarchy, the Mesh data structure is more accurately thought of as a complex ontology, with terms existing in multiple classes, with specified relationships among any class and its parent classes.

The tree metaphor breaks down because branches and nodes within a branch can be connected to other branches and to other nodes. Trees do not do this kind of thing.

It is possible to write a script that parses through every MeSH entry, finds all of the MeSH numbers for the entry, determines the parent terms for the MeSH numbers, determines all of the alternate MeSH numbers for the parent terms, then finds all of the grandparent terms for all of the parent terms, etc., until all of the hierarchical terms for the term are found.

Here is the Perl script. This Perl script is provided "as is", by its creator, Jules J. Berman, without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.
#!/usr/local/bin/perl
open(MESH, "D2007.BIN"); #the file name for the raw ascii MeSH version
open(OUT, ">mesh.out");
$/ = "\n\n\*NEWRECORD\n";
$line = " ";
@cumlist;
%numberhash;
%namehash;
while ($line ne "")
  {
  my $numbers = "";
  $line = <MESH>;
  $line =~ /\nMH = ([^\n]+)\n/;
  $name = $1;
  while ($line =~ m/\nMN ?= ?([^\n]+)(?=\n)/mg)
     {
     $number = $1;
     $number =~ s/^ *//o;
     $number =~ s/ *$//o;
     $number =~ s/ +/ /;
     $numberhash{$number} = $name;
     $numbers = $numbers . " " . $number;
     }
  $numbers =~ s/^ *//o;
  $numbers =~ s/ *$//o;
  $numbers =~ s/ +/ /o;
  $namehash{$name} = $numbers;
  }
close(MESH);
while((my $key, my $value) = each (%namehash))
   {
   @cumlist = ("");
   print OUT "\nTERM LINEAGE FOR " . uc($key) . "\n";
   my @valuelist = split(/ /,$value);
   @cumlist = (@cumlist, @valuelist);
   &splitlist(@cumlist);
   for(1..30)
     {
     @cumlist = grep { $marked{$_}++; $marked{$_} == 1; } @cumlist;
     undef(%marked);
     &allmeshnums(@cumlist);
     @cumlist = grep { $marked{$_}++; $marked{$_} == 1; } @cumlist;
     undef(%marked);
     &splitlist(@cumlist);
     }
   @cumlist = grep { $marked{$_}++; $marked{$_} == 1; } @cumlist;
   undef(%marked);
   foreach my $thing (@cumlist)
      {
      print OUT "$thing $numberhash{$thing}\n";
      }
   }
sub splitlist()
   {
   @valuelist = @_;
   foreach my $meshno (@valuelist)
     {
     for(1..30)
       {                      
       if ($meshno =~ /\.[0-9]+$/)
          {
          $meshno = $`;
          push(@cumlist, $meshno);
          }
       else
          {
          last;
          }
       }
     }
    }
sub allmeshnums()
  {
  @meshnumber = @_;
  foreach my $thing (@meshnumber)
    {
    my $name = $numberhash{$thing};
    my $value = $namehash{$name};
    my @valuelist = split(/ /,$value);
    @cumlist = (@valuelist, @cumlist);
    }
  }
exit;
The output file, mesh.out is over 9 megabytes in length.

Here is an example of one entry, in the output file, mesh.out
TERM LINEAGE FOR GIANT CELLS, FOREIGN-BODY
A11.118.637 Leukocytes
A15.145.229.637 Leukocytes
A15.382.490 Leukocytes
A11.118.637.555 Leukocytes, Mononuclear
A15.145.229.637.555 Leukocytes, Mononuclear
A15.382.490.555 Leukocytes, Mononuclear
A15.378 Hematopoietic System
A11.148 Bone Marrow Cells
A15.378.316 Bone Marrow Cells
A12.207.152 Blood
A15.145 Blood
A11.118 Blood Cells
A15.145.229 Blood Cells
A11.329.372.376 Giant Cells, Foreign-Body
A11.502.376 Giant Cells, Foreign-Body
A11.627.624.480.376 Giant Cells, Foreign-Body
A11.733.397.376 Giant Cells, Foreign-Body
A15.382.680.397.376 Giant Cells, Foreign-Body
A15.382.812.522.376 Giant Cells, Foreign-Body
A11.329 Connective Tissue Cells
A11 Cells
A11.502 Giant Cells
A11.118.637.555.652 Monocytes
A11.148.580 Monocytes
A11.627.624 Monocytes
A11.733.547 Monocytes
A15.145.229.637.555.652 Monocytes
A15.378.316.580 Monocytes
A15.382.490.555.652 Monocytes
A15.382.680.547 Monocytes
A15.382.812.547 Monocytes
A11.627 Myeloid Cells
A11.733 Phagocytes
A15.382.680 Phagocytes
A15.382 Immune System
A15 Hemic and Immune Systems
A11.329.372 Macrophages
A11.627.624.480 Macrophages
A11.733.397 Macrophages
A15.382.680.397 Macrophages
A15.382.812.522 Macrophages
A15.382.812 Reticuloendothelial System
A12.207 Body Fluids
A12 Fluids and Secretions
When we examine the multi-lineage ancestry of "Foreign body giant cells" we see that MeSH is not a tree hierarchy. This means that the MeSH data structure is highly complex and requires some computational know-how to fully explore all the term relationships.

Page written by Jules J. Berman, April 17, 2008

Jules Berman's blog

Jules Berman's home page

MeSH and other open source nomenclatures are described in my book, Biomedical Informatics

Page last modified: April 17, 2008

tags: Perl programming for medicine and biology, nomenclature, thesaurus, nlm, medical subject headings, open source, medical indexing, medical data retrieval, medical informatics, biomedical informatics,

Books by Jules J. Berman, covers