Ruby, Perl and Python scripts
Extracting a phylogenetic hierarchy for each individual organism
listed
in the European Bioinformatics Institute data file, taxonomy.dat
Taxonomy.dat is a large, publicly available list of organisms.
The file is available from the European Bioinformatics
Institute (EBI). It contains over 400,000 species:
[A sample record in Taxonomy.dat]
[ID : 50]
[PARENT ID : 49]
[RANK: genus]
[GC ID : 11]
[SCIENTIFIC NAME : Chondromyces]
[SYNONYM : Polycephalum]
[SYNONYM : Myxobotrys]
[SYNONYM : Chondromyces Berkeley and Curtis 1874]
[SYNONYM : "Polycephalum" Kalchbrenner and Cooke 1880]
[SYNONYM : "Myxobotrys" Zukal 1896]
[MISSPELLING : Chrondromyces]
The taxonomy.dat file exceeds 100 megabytes in length.
The taxonomy.dat file
is available for public download through anonymous ftp.
[ftp://ftp.ebi.ac.uk/pub/databases/taxonomy/]
Information about the taxonomy.dat file is found at:
[http://www.ebi.ac.uk/msd-srv/docs/dbdoc/ref_taxonomy.html]
Notice that the sample entry (above) provides an ID number for
the entry organism, Chondromyces, and for it's parent class.
Since every organism and class has a parent, you can write a
script that reconstructs the full phylogenetic lineage for any
entry in taxonomy.dat.
On this web page, I include equivalent Ruby, Python and Perl scripts
that parse through
taxonomy.dat, build a hash of all the child-parent relationships,
then re-parse the file, building the phylogenetic lineage of each
organism using the child-parent hash that was built in the first pass.
Ruby, Perl and Python scripts are provided that compute the full
ancestral lineage for each organism included in taxonomy.dat.
The software was created
by Jules J. Berman on April 10, 2008.
The software is provided "as is", without warranty of any kind,
express or implied, including but not limited to the warranties
of merchantability, fitness for a particular purpose and
noninfringement. in no event shall the authors or copyright
holders be liable for any claim, damages or other liability,
whether in an action of contract, tort or otherwise, arising
from, out of or in connection with the software or the use or
other dealings in the software.
Here is the Ruby script:
#!/usr/local/bin/ruby
intext = File.open("taxonomy.dat", "r")
outtext = File.open("taxo.txt", "w")
parenthash = Hash.new()
namehash = Hash.new()
intext.each_line("//") do
|line|
line =~ /\nID\s+\:\s*([0-9]+)\s*\n/
child_id = $1
line =~ /\nPARENT ID\s+\:\s*([0-9]+)\s*\n/
parent_id = $1
parenthash[child_id] = parent_id
line =~ /\nSCIENTIFIC NAME\s+\:\s*([^\n]+)\s*\n/
scientific_name = $1
namehash[child_id] = scientific_name
end
intext.close
intext = File.open("taxonomy.dat", "r")
intext.each_line("//") do
|line|
getline = line
getline.sub!(/\/\//,"")
outtext.puts(getline, "HIERARCHY")
line =~ /\nID\s+\:\s*([0-9]+)\s*\n/
id_name = $1
(1..30).each do
outtext.puts(namehash[id_name])
id_name = parenthash[id_name]
break if namehash[id_name].nil?
end
outtext.print("//")
end
exit
Here is the equivalent Perl script:
#!/usr/local/bin/perl
open(TAXO, "taxonomy.dat");
open(OUT, ">taxo.txt");
$/ = "//";
$line = " ";
while ($line ne "")
{
$line = <TAXO>;
$line =~ /\nID +\: *([0-9]+) *\n/;
$id_name = $1;
$line =~ /\nPARENT ID +\: *([0-9]+) *\n/;
$parent_id_name = $1;
$parenthash{$id_name} = $parent_id_name;
$line =~ /\nSCIENTIFIC NAME +\: *([^\n]+) *\n/;
$scientific_name = $1;
$namehash{$id_name} = $scientific_name;
}
close(TAXO);
open(TAXO, "taxonomy.dat");
$line = " ";
while ($line ne "")
{
$line = <TAXO>;
$getline = $line;
$getline =~ s/\/\///o;
print OUT $getline . "HIERARCHY\n";
$line =~ /\nID +\: *([0-9]+) *\n/;
$id_name = $1;
for(1..30)
{
print OUT "$namehash{$id_name}\n";
$id_name = $parenthash{$id_name};
last if ($namehash{$id_name} eq "root");
}
print OUT "//";
}
exit;
Here is the equivalent Python script:
#!/usr/local/bin/python
import re
intext = open("taxonomy.dat", "r")
outtext = open("taxo.txt", "w")
parenthash = {}
namehash = {}
cum_line = ""
childnumber = ""
parentnumber = ""
child_match = re.compile('ID\s+\:\s*(\d+)\s*')
parent_match = re.compile('PARENT ID\s+\:\s*(\d+)\s*')
name_match = re.compile('SCIENTIFIC NAME\s+\:\s*([^\n]+)\s*')
end_match = re.compile('\/\/')
for line in intext:
p = end_match.search(line)
if p:
m = child_match.search(cum_line)
if m:
childnumber = m.group(1)
x = parent_match.search(cum_line)
if x:
parentnumber = x.group(1)
parenthash[childnumber] = parentnumber
y = name_match.search(cum_line)
if y:
scientific_name = y.group(1)
namehash[childnumber] = scientific_name
#print childnumber + " " + namehash[childnumber] + " " + parenthash[childnumber]
cum_line = ""
continue
else:
cum_line = cum_line + line
cum_line = ""
intext.close
intext = open("taxonomy.dat", "r")
for line in intext:
p = end_match.search(line)
if p:
print>>outtext, cum_line + "HIERARCHY"
z = child_match.search(cum_line)
if z:
id_name = z.group(1)
for i in range(30):
if namehash.has_key(id_name):
print>>outtext, namehash[id_name]
if parenthash.has_key(id_name):
id_name = parenthash[id_name]
print>>outtext, "//"
cum_line = ""
continue
else:
cum_line = cum_line + line
cum_line = ""
exit
These scripts produce an output file, taxo.txt that exceeds 224 Megabytes
in length. The output consists of the taxonomic entries from taxonomy.dat,
along with the phylogentic lineage for each organism.
It takes under a minute to execute these scripts on a desktop
computer running at 2.6 MHz with 512 MByte RAM.
Sample output phylogenetic hierarchy for Homo sapiens:
9606 Homo sapiens
9605 Homo
207598 Homo/Pan/Gorilla group
9604 Hominidae
314295 Hominoidea
9526 Catarrhini
314293 Simiiformes
376913 Haplorrhini
9443 Primates
314146 Euarchontoglires
9347 Eutheria
32525 Theria
40674 Mammalia
32524 Amniota
32523 Tetrapoda
8287 Sarcopterygii
117571 Euteleostomi
117570 Teleostomi
7776 Gnathostomata
7742 Vertebrata
89593 Craniata
7711 Chordata
33511 Deuterostomia
33316 Coelomata
33213 Bilateria
6072 Eumetazoa
33208 Metazoa
33154 Fungi/Metazoa group
2759 Eukaryota
131567 cellular organisms
1 root
//
A web site that automatically generates the phylogenetic lineage of any entered
species (listed in taxonomy.dat) is available:
http://www.julesberman.info/post.htm
key words: ruby programming, perl programming, python programming, bioinformatics, taxonomy, phylogeny,
ancestral lineage, class hierarchy, tree of life, nomenclature, species, jules berman
Last modified: January 20, 2010