John W. Shipman, john@nmt.edu
Zoological Data Processing
507 Fitch Avenue NW
Socorro, NM 87801
(505) 835-0235
Homepage: http://www.nmt.edu/~shipman
This document describes a system for representing bird phylogenies, that is, taxonomic arrangements of bird types, as computer files.
A taxonomic arrangement is represented as a set of three text files:
The next sections describe the format of these raw files. A final section describes a program to combine these files into a set of product files representing the entire taxonomy and all the codes.
The AOU Check-List defines a lot more taxonomic ranks than most applications will care about, so the ranks file allows the application to specify which ranks are of interest. To prepare this file, use a text editor to enumerate the ranks in descending order, starting with the rank of the root taxon of the arrangement.
Each line defines one rank. Enter these items in order:
Here is a sample ranks file; the "_" character represents a space.
c__1Class o__2Order f__2Family -f?1Subfamily g__2Genus s__2Species x__2Form
The numbers in this example allow for up to 99 orders per class, 99 families per order, 9 subfamilies per family, and so on.
The last three lines of this file use codes that do not actually appear in the standard forms file:
Applications that do not wish to track subspecific forms should use a version of the ranks file that does not contain the x line. Omitting the g and s lines is not recommended.
The order is important---always order the ranks from largest to smallest, as in the example above. The program doesn't know anything about taxonomic traditions. If you would like to create your own new ranks, like Infrasupertribes, go right ahead.
The first input file you must prepare is the standard forms file. This file enumerates all the taxa defined in your preferred standard arrangement. Give this file a name of the form f.std where f is some name suggesting the name of the authority. For example, a file containing names from the AOU Check-List, 6th ed., including all supplements through the 40th, might be named aou640.std.
Place each taxon on a separate line in the standard forms file. The taxa appear in the order in which they are presented in a checklist. The highest taxon appears first, followed by the first contained taxon, and so on down to the first species. The remaining species in that genus follow; then come the other genera in the family, and so on.
Some taxa are defined implicitly. In particular, there is no separate line for genera, since species are identified by binomials---genera are declared implicitly by their first use in a binomial.
There are two types of record in the standard forms file:
Records in the standard forms file start with three fixed columns, with the remainder of the record in a variable-length format:
c_ Class
-c Subclass
+o Superorder
o_ Order
-o Suborder
+f Superfamily
f_ Family
-f Subfamily
t_ Tribe
__ Species
The exact structure of the ``tail'' of the record (that is, the variable-length part that follows the first three columns) depends on whether the record describes a higher taxon or a species.
Place each higher taxon on a separate line, following these steps:
Here are some examples of higher-taxon records:
c Aves/Birds
-c Neornithes/True Birds
+o Neognathae/Typical Birds
o Gaviiformes/Loons
f Gaviidae/Loons
-o Pelecani/Boobies, Pelicans, Cormorants and Darters
For each species record, enter these fields on one line:
Here are some examples of species lines. The last two show the disambiguation of the collision for code BLAWAR.
___Anas strepera/Gadwall ___Anas penelope/Wigeon, Eurasian ___Haliaeetus pelagicus/Sea-Eagle, Steller's __+Camptorhynchus labradorius/Labrador Duck __?Aerodramus vanikorensis/Gray Swiftlet ___Cygnus (Olor) buccinator/Trumpeter Swan ___Cygnus (Cygnus) olor/Swan, Mute ___Dendroica fusca/Warbler, Blackburnian/BKBWAR ___Dendroica striata/Blackpoll Warbler/BKPWAR
Because field records do not always use the latest names, and because the reported forms are not always standard species, you must prepare an ``alternate forms'' file enumerating all the forms that have a six-letter code but which are not standard species names.
You must prepare an .alt file for each .std file, reflecting the exact lumps, splits, and names of the standard arrangement. The file must be named f.alt, where f is the same prefix as that of the .std file.
For example, if the standard file for the AOU Check-List, 6th. ed., including supplements through the 40th, is called aou640.std, the corresponding alternate names file must be called aou640.alt.
In the .alt file you will place several different types of records. Each line starts with the six-letter code being defined, followed by a record type code, and a variable length tail.
For each form above species rank in the hierarchy, enter a line of this format:
In the optional TeX name field, two TeX macros are used:
\def\sp#1{\itc{#1}\ sp.}%
\def\itc#1{{\it #1\/}}%
Here are some complete examples of higher-taxon records.
albatr Diomedeidae/albatross sp.
accipi Accipiter/Accipiter sp./\sp{Accipiter}
laracc Accipiter/large Accipiter sp./large \itc{Accipiter}\ sp.
For each non-standard code that is the exact equivalent of a standard code, create a record in the alternate forms file with this format:
Examples of direct-equivalent records:
amboys=blkoys Oystercatcher, American Black
amewid=amewig Widgeon, American
watpip=amepip Pipit, Water
Note: the form after the equal sign must be defined elsewhere in the standard or alternate forms file.
There are several reasons for assigning codes to forms that are a subset of a standard species:
So we use the term ``subspecific form'' loosely, to mean any identifiable form that refers to some subset of a standard species. For each such code, enter a line with this format:
Examples of subspecific form lines:
agpchi<grpchi Attwater's Greater Prairie-Chicken
agwtea<gnwtea Teal, American Green-winged
alcgoo<cangoo (Aleutian) Canada Goose
axetea<gnwtea teal, (American x European) Green-winged
blugoo<snogoo Blue Goose
branth<brant Brant (hrota)/Brant (\itc{hrota})
In order to record all the known collisions---that is, cases where two or more names encode to the same six-letter abbreviation according to the rules for abbreviation formation---you must add to the alternate forms file one line for each collision. Each such line enumerates all the disambiguations, that is, the substitute form codes that are preferred:
Examples of collision records:
barowl?brdowl:cobowl
belspa?bldspa:bllspa
columb?colba :colbid:colbin
The first example shows that two names collide for the code barowl. The forms are Barred Owl (which is given the substitute code brdowl in the standard forms file) and Barn Owl, which is an obsolete name that is equivalent to cobowl, the code for Common Barn-Owl. The last example shows a three-way collision for code columb between the codes for Columba, family Columbidae, and subfamily Columbinae. Note that a collision record may refer to forms other than standard taxa.
The substitute codes referred to may be defined elsewhere in the .alt file, or defined implicitly in the .std file.
Once you have prepared all the input files, you can compile them into a set of standard product files. These product files are all ``flat files'' that give the same information in a form more immediately usable in database applications.
The nombuild program checks the various input files and compiles them into a set of standard product files (described below).
To run this program, change to the directory containing all the input files and type the command:
nombuild
If there are any problems with the input files, the program will
produce error messages on the standard output stream, and also
produce a duplicate listing of these errors in file nombuild.log.
If there are no problems, all the product files will be written. These files are:
The tree file defines all the different scientific names used in the input. Here is the format of that file:
Dunlin
Loon, Red-throated
grebe sp.
bird sp.
bird, large sp.
teal, Blue-winged x Cinnamon
Junco, (Gray-headed x Slate-colored) Dark-Eyed
The taxonomic key number can be used to sort records into taxonomic order. It contains one or more digits for each rank (except for the root rank). The number of digits for each rank is determined by the third column in the ranks file.
For example, if your ranks file looks like the example given above (2-digit order, 2-digit family, 1-digit subfamily, 2-digit genus, 2-digit species, and 2-digit form), each taxonomic key number would have these components:
For example, code daejun (Dark-eyed Junco) might have a taxonomic key number of 21 24 3 47 01 00 (the spaces here are for clarity---they are not actually present in the record). This key would mean that this form is in the 21st order, and in the 24th family within that order, the 3rd subfamily within that family, the 47th genus within that subfamily, and it the first species within that genus.
Other forms that are included within Dark-eyed Junco will have keys 21 24 3 47 01 01, 21 24 3 47 01 02, and so on. Examples of such forms include races such as Gray-headed Junco, hybrids among the different races (e.g., ``Gray-headed x Slate-colored Junco''), and obsolete names (``Northern Junco'').
Note that the taxonomic key number can be used to deduce relationships between form codes. For example, to find out what genus a species is in, just construct a key number that is the same as the species' key number, but with its species number set to 00. Continuing the example above, suppose Gray-headed Junco has this key number:
21 24 3 47 01 01
Then we can deduce all the higher ranks by substituting zeroes
in the appropriate fields:
21 24 3 47 01 00 is the containing species, Junco hyemalis
21 24 3 47 00 00 is the containing genus, Junco
21 24 3 00 00 00 is the containing subfamily, Emberizinae
21 24 0 00 00 00 is the containing family, Emberizidae
21 00 0 00 00 00 is the containing order, Passeriformes
00 00 0 00 00 00 is the containing class, Aves
The .ab6 file defines all the six-letter bird abbreviations. Each abbreviation is specified by its taxon field, which is a link to the tree file. Fields are:
Here are examples of lines from an .ab6 file:
CACGOOBranta canadensis 2 Cackling Goose
CALLINCarpodacus mexicanus California Linnet
The first is for code CACGOO, derived from the name ``Cackling Goose,'' and it is the second subspecific form for Branta canadensis, the Canada Goose. The second line is for code CALLIN, derived from the name ``California Linnet,'' an alternate name for House Finch.
The .col file enumerates all the six-letter form codes that are involved in collisions. Each line has this format:
Here is an example showing three records from a .col file. These three lines document the collision between three names for code PASSER. The preferred substitute codes are PASINA (for Passerina), PASINE (for ``passerine''), and PASR (for Passer):
PASSERPASINA
PASSERPASINE
PASSERPASR__