Books by Jules J. Berman, covers


Ruby, Perl and Python scripts
Extracting a phylogenetic hierarchy for each individual organism listed
in the European Bioinformatics Institute data file, taxonomy.dat

Taxonomy.dat is a large, publicly available list of organisms. The file is available from the European Bioinformatics Institute (EBI). It contains over 400,000 species:

[A sample record in Taxonomy.dat]
[ID : 50]
[PARENT ID : 49]
[RANK: genus]
[GC ID : 11]
[SCIENTIFIC NAME : Chondromyces]
[SYNONYM : Polycephalum]
[SYNONYM : Myxobotrys]
[SYNONYM : Chondromyces Berkeley and Curtis 1874]
[SYNONYM : "Polycephalum" Kalchbrenner and Cooke 1880]
[SYNONYM : "Myxobotrys" Zukal 1896]
[MISSPELLING : Chrondromyces]
The taxonomy.dat file exceeds 100 megabytes in length.

The taxonomy.dat file is available for public download through anonymous ftp.

[ftp://ftp.ebi.ac.uk/pub/databases/taxonomy/]

Information about the taxonomy.dat file is found at:

[http://www.ebi.ac.uk/msd-srv/docs/dbdoc/ref_taxonomy.html]

Notice that the sample entry (above) provides an ID number for the entry organism, Chondromyces, and for it's parent class. Since every organism and class has a parent, you can write a script that reconstructs the full phylogenetic lineage for any entry in taxonomy.dat.

On this web page, I include equivalent Ruby, Python and Perl scripts that parse through taxonomy.dat, build a hash of all the child-parent relationships, then re-parse the file, building the phylogenetic lineage of each organism using the child-parent hash that was built in the first pass.

Ruby, Perl and Python scripts are provided that compute the full ancestral lineage for each organism included in taxonomy.dat. The software was created by Jules J. Berman on April 10, 2008. The software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

Here is the Ruby script:
#!/usr/local/bin/ruby
intext = File.open("taxonomy.dat", "r")
outtext = File.open("taxo.txt", "w")
parenthash = Hash.new()
namehash = Hash.new()
intext.each_line("//") do
  |line|
  line =~ /\nID\s+\:\s*([0-9]+)\s*\n/
  child_id = $1
  line =~ /\nPARENT ID\s+\:\s*([0-9]+)\s*\n/
  parent_id = $1
  parenthash[child_id] = parent_id
  line =~ /\nSCIENTIFIC NAME\s+\:\s*([^\n]+)\s*\n/
  scientific_name = $1
  namehash[child_id] = scientific_name
end
intext.close
intext = File.open("taxonomy.dat", "r")
intext.each_line("//") do
  |line|
  getline = line
  getline.sub!(/\/\//,"")
  outtext.puts(getline, "HIERARCHY")
  line =~ /\nID\s+\:\s*([0-9]+)\s*\n/
  id_name = $1
  (1..30).each do
    outtext.puts(namehash[id_name])
    id_name = parenthash[id_name]
    break if namehash[id_name].nil?
  end
  outtext.print("//")
end
exit
Here is the equivalent Perl script:
#!/usr/local/bin/perl
open(TAXO, "taxonomy.dat");
open(OUT, ">taxo.txt");
$/ = "//";
$line = " ";
while ($line ne "")
  {
  $line = <TAXO>;
  $line =~ /\nID +\: *([0-9]+) *\n/;
  $id_name = $1;
  $line =~ /\nPARENT ID +\: *([0-9]+) *\n/;
  $parent_id_name = $1;
  $parenthash{$id_name} = $parent_id_name;
  $line =~ /\nSCIENTIFIC NAME +\: *([^\n]+) *\n/;
  $scientific_name = $1;
  $namehash{$id_name} = $scientific_name;
  }
close(TAXO);
open(TAXO, "taxonomy.dat");
$line = " ";
while ($line ne "")
  {
  $line = <TAXO>;
  $getline = $line;
  $getline =~ s/\/\///o;
  print OUT $getline . "HIERARCHY\n";
  $line =~ /\nID +\: *([0-9]+) *\n/;
  $id_name = $1;
  for(1..30)
    {
    print OUT "$namehash{$id_name}\n";
    $id_name = $parenthash{$id_name};
    last if ($namehash{$id_name} eq "root");
    }
  print OUT "//";
  }
exit;
Here is the equivalent Python script:
#!/usr/local/bin/python
import re
intext = open("taxonomy.dat", "r")
outtext = open("taxo.txt", "w")
parenthash = {}
namehash = {}
cum_line = ""
childnumber = ""
parentnumber = ""
child_match = re.compile('ID\s+\:\s*(\d+)\s*')
parent_match = re.compile('PARENT ID\s+\:\s*(\d+)\s*')
name_match = re.compile('SCIENTIFIC NAME\s+\:\s*([^\n]+)\s*')
end_match = re.compile('\/\/')
for line in intext:
  p = end_match.search(line)
  if p:
    m = child_match.search(cum_line)
    if m:
      childnumber = m.group(1)
    x = parent_match.search(cum_line)
    if x:
      parentnumber = x.group(1)
    parenthash[childnumber] = parentnumber
    y = name_match.search(cum_line)
    if y:
      scientific_name = y.group(1)
    namehash[childnumber] = scientific_name
    #print childnumber + " " + namehash[childnumber] + " " + parenthash[childnumber]
    cum_line = ""
    continue
  else:
    cum_line = cum_line + line 
cum_line = ""
intext.close
intext = open("taxonomy.dat", "r")
for line in intext:
  p = end_match.search(line)
  if p:
    print>>outtext, cum_line + "HIERARCHY"
    z = child_match.search(cum_line)
    if z:
      id_name = z.group(1)
    for i in range(30):
      if namehash.has_key(id_name):
        print>>outtext, namehash[id_name]
      if parenthash.has_key(id_name):
        id_name = parenthash[id_name]
    print>>outtext, "//"
    cum_line = ""
    continue
  else:
    cum_line = cum_line + line 
cum_line = ""
exit
These scripts produce an output file, taxo.txt that exceeds 224 Megabytes in length. The output consists of the taxonomic entries from taxonomy.dat, along with the phylogentic lineage for each organism.

It takes under a minute to execute these scripts on a desktop computer running at 2.6 MHz with 512 MByte RAM.

Sample output phylogenetic hierarchy for Homo sapiens:
9606      Homo sapiens 
9605      Homo 
207598    Homo/Pan/Gorilla group 
9604      Hominidae 
314295    Hominoidea 
9526      Catarrhini 
314293    Simiiformes 
376913    Haplorrhini 
9443      Primates 
314146    Euarchontoglires 
9347      Eutheria 
32525     Theria 
40674     Mammalia 
32524     Amniota 
32523     Tetrapoda 
8287      Sarcopterygii 
117571    Euteleostomi 
117570    Teleostomi 
7776      Gnathostomata 
7742      Vertebrata 
89593     Craniata 
7711      Chordata 
33511     Deuterostomia 
33316     Coelomata 
33213     Bilateria 
6072      Eumetazoa 
33208     Metazoa 
33154     Fungi/Metazoa group 
2759      Eukaryota 
131567    cellular organisms 
1         root 
//
A web site that automatically generates the phylogenetic lineage of any entered species (listed in taxonomy.dat) is available:

http://www.julesberman.info/post.htm


key words: ruby programming, perl programming, python programming, bioinformatics, taxonomy, phylogeny, ancestral lineage, class hierarchy, tree of life, nomenclature, species, jules berman

Last modified: January 20, 2010

Books by Jules J. Berman, covers