Here is an outline of the overall program flow.
Read the ranks file and use it to
build an instance of the Hier class.
Create an empty Txny instance to hold
the taxonomic tree. Create an empty AbTab instance to hold the symbol table for
form codes.
Read the .std file. Each line in this
file represents at least one taxon, and if the rank
of that taxon is found in the Hier instance,
we can ignore it.
Although there are no lines representing genera, the first species in each genus is effectively the line that causes creation of a genus-level taxon in the taxonomic tree, so effectively such lines define two taxa.
If the tree is empty, this must be the first line
of the .std file; add its taxon as
the new root of the taxonomic tree.
If the tree is not empty, find the parent taxon of the current line. To do this, traverse the tree from the root, always choosing the last child of each taxon, until we reach a taxon that has no children, or whose children are at the same taxonomic level as the new taxon to be added.
Having found the appropriate parent taxon, add the new taxon (or taxa, in the case of a new genus) as that parent's new last child.
If the current line of the .std
file defines a species, add its form code to the
symbol table. The form code is derived from the
English name by applying the standard rules,
unless an explicit disambiguation code follows
the scientific name field on the line. The
symbol table logic will signal an error if this
code duplicates one already added.
Read the .alt file. Each line defines
one form code, which is added to the symbol table after
various error checks.
We check every entry in the symbol table for validity. Every referenced symbol must have be defined. There must be no cycles in the “references” relation; that is, no cases where, for example, code A is the equivalent of code B which is the equivalent of code C which is the equivalent of code A.
Write the four product files. The .tre file is produced by a depth-first preorder
traversal of the taxonomic tree: the root taxon
first, then its first child, then its first child's
first child, and so forth.
A single pass through the symbol table is sufficient
to produce both the .ab6 and .col files. We visit the entries in
ascending order by form code, writing a line to the
.ab6 file if it is a valid code or
writing a line to the .col file if it
is a collision form.
The XML output file is generated by similar techniques.
The four major sections of this file correspond to the
Hier instance, the taxonomic tree, the
valid entries in the symbol table, and the collision
entries in the symbol table. For the XML generation
technology, refer to Python XML processing with lxml.