The analyses with the highest number of most parsimonious trees are those with the largest matrices, suggesting that effects of missing data are most pronounced in large data sets. Matrix size can be reduced by employing arbitrary “taxon cutoffs” (Rowe 1988) or “character cutoffs” (de Quieroz & Wimberger 1993). Culled taxa can then be “re-inserted” into the cladogram using synapomorphies discovered in analysis of more complete taxa (Grande & Bemis 1998; Wilson 2002). These measures equate missing data with uncertainty, but anatomical completeness is not an index for phylogenetic informativeness – a taxon or character with large amounts of missing data can be phylogenetically informative (Wilkinson 1995) and lack of resolution may indicate character conflicts (Kearney & Clark 2003). Although we agree that arbitrary completeness thresholds may remove important data from an analysis, not every specimen attributed to the ingroup can be included in an analysis. We will restrict our analysis to species that are well-diagnosed; that is, those taxa known from sufficient material to be robustly differentiated from other species. Although subjective, this criterion is not arbitrary; it will require detailed alpha-level information about each species and will establish a morphological threshold for terminal taxon selection. The species listed in Table 1 will be added to during the course of our study.
We will also profile our data matrix to examine the taxa and character types most affected by missing data. We will distinguish missing data entries (?) in our matrix from those in which information is polymorphic (P), unresolved homology (*), inapplicable (N), or inaccessible (X) (Grande & Bemis 1998). Missing entries will be tallied for each terminal taxon and character and separated by anatomical region. This type of profiling will allow us to try to bolster character coverage in a particular region or to preferentially include terminal taxa that preserve key anatomical regions.
We will attempt to identify nodes that may be susceptible to the effects of missing data using MERDA (Norell & Wheeler 2003). This method replicates analyses with missing data cells replaces by observed values, allowing the user to examine the universe of possible outcomes. The relative frequency of a particular clade in replicate trials yields an index of its robustness.