Current support in some applications and data standards

■ Support in DELTA and New DELTA is identical and has already been discussed above.

■ NEXUS (p.18) does not support character dependency. As mentioned above (p.76), methods of phylogenetic inference like maximum parsimony or maximum likelihood assume that char-acters are independently, identically distributed. In practice it is known that this requirement is almost never fulfilled and that statistical inferences (e.g., bootstrapping, Felsenstein 1985) will yield only approximately correct results. However, the kind of absolute character depend-ency discussed here as character applicability seems to be not foreseen in NEXUS.

■ DiversityDescriptions supports the DELTA “Applicable” and “Inapplicable Characters” direc-tives in import, but internally converts them to inapplicable-if rules (p.329). As discussed above, this causes problems in the context of character evolution, if states are added to a char-acter where the rule would be most logically expressed as an applicable-if rule.

■ CBIT Lucid (p.21) supports a similar model to DELTA.

■ SDD (p.20) supports both applicable-if and inapplicable-if rules. A major innovation in SDD is that the dependencies may be defined as part of normal concept hierarchies (i.e. character trees). They are inherited down the concept hierarchy tree, affecting all characters below a given node (i.e., direct or indirect through other nodes). The advantage of this design is that whereas in other models the rules are anonymous and often difficult to maintain during the evolution of the terminology, in SDD they are attached to labeled nodes, which in themselves have a logical hierarchy. In addition, these nodes often already exist. For example, in a com-positional concept hierarchy the leaf-node will be a natural place for leaf-dependency rules.

SDD up to 1.1. does not support quantitative controlling characters yet.

In all formats and applications controlling characters must be categorical (i.e., quantitative counts cannot be controlling).

53. Character dependency definitions are important information items for data entry, character management, and analysis purposes.

54. A general form of value-dependency may predict values in another character for some (but not all) values of a controlling character. It may be desirable to implement this, but it has not been pursued in current models.

55. A special form of value-dependency is that some values predict the applicability of another character (character applicability rules). This is highly desirable and implemented in several descriptive models.

56. It is desirable that the controlling character may be of categorical or quantitative type. Cur-rent models only implement categorical controlling characters.

57. Because of character evolution issues (adding states to existing characters) and to improve the clarity of expression, both positive and negative character applicability rules (“Applica-ble-if”, “Inapplicable-if”) are desirable.

58. Combinations of applicable-if and inapplicable-if rules within a data set are desirable. How-ever, any combination of controlling and controlled character may be covered only by one form of the rule.

59. Support for the evaluation of cascading character applicability rules is desirable. This may be expressed in specialized graph structures in the information model, but may also be sup-ported only during evaluation of rules.

(Note: a related topic of method dependency will be discussed later on p.175.)

4.9. Raw data and data aggregation

Introduction

Before discussing various models to record structured descriptions, a final fundamental problem that is independent of the specific model is addressed: The level of data aggregation and the structure of the set of objects that is the subject of the description.

Data aggregation is used here in the sense that it is the process of representing multiple indi-vidual data points in a transformed, more compact form. Typical aggregation methods are statisti-cal measures (mean, variance) for quantitative data and frequency histograms for categories.

Depending on perspective, these processes may also be viewed as “summarizing” (Macfarlane

1993a, White 1994, Hagedorn 1999a), “agglomerating” (Maxted & al. 1993), “collating”, “con-solidating”, “aggregating” (three terms used frequently during SDD discussions, the last also in Diederich & al. 1997), or “abstracting”, “generalizing”, or “amalgamating” (all three by Pullan &

al. 2005) data. The term “aggregation” was selected in SDD discussions as preferable. Pullan &

al. (2005) combine the concept of data aggregation methods (where the number of data points changes) with the concept of transformations from quantitative to categorical data (where the number of data points remains constant, see “mappings” p.66) as “levels of abstraction in the description-building process”.

A major reason for data aggregation is the transformation from specimen to taxon descrip-tions. Lebbe & Vignes (1998) speak of a “double nature of a taxon as a concept and a set of in-stances”, that makes it possible to have contradictory statements (e.g., “present or absent”) in a taxon description. However, the step from the individual to the taxon is only one of many aggre-gation steps that may occur in descriptions. For example, even though a simple measurement such as the length or shape of a leaf can only be obtained by measuring a single leaf, values for leaf length or shape may be reported on the following data aggregation levels:

1. repeated measurements of a single leaf (i.e. concrete measurements);

2. multiple basal leaves of a single individual;

3. all basal and stem leaves of a single individual combined;

4. multiple individuals in a single population according to some scope (male individuals, juve-nile individuals, etc.);

6. all individuals from a single population (or infraspecific taxon);

7. all sampled populations of a species (or infraspecific taxon);

8. all taxa classified in a higher taxon (creating, e.g., a genus description).

Clearly many more combinations of these aggregation criteria are possible (all male juveniles of a species in Germany…); a fixed sequence of aggregation level seems therefore not desirable.

The problem of multiple instances of a part in individuals and aggregation at populations is also noted by Diederich & al. (1997) and Pullan & al. (2005). Ideally an information model for de-scriptive data should be flexible enough to allow entering both repeated sample data and aggre-gated data at multiple such levels.

Another distinction that is often made is that between “raw”, unprocessed data and “synthetic” or

“aggregated” data. Storing raw data is desirable:

■ in general, to archive them as a reference for data generated within a study;

■ in cases where the processing can be automated, e.g., in generating descriptive statistics of repeated measurements.

The example of leaf measurements shows, however, that this distinction has a very complex rela-tionship with the conventional taxonomic set structure of individual, lowest-ranking taxon, and hierarchical taxa. A slight simplification may be made by ignoring the relatively rare case of mul-tiple measurements on a single object as part of the methodology (e.g., to reduce measurement error). Some instruments may even do this internally, but report only a single measurement. Nev-ertheless, even if both level 1 and 2 above are considered “raw data”, whether a relation between

“raw data” and the individual organism exists or not depends on the multiplicity of the object part. In some cases this may be deduced from the fundamental organization of organisms in a larger taxon group, but in many cases this is species-specific (for example, some plant species have a single stem, others multiple stems).

Similarly, the aggregation levels 3 to 4 indicate that descriptions often have a scope that does not directly match the taxonomic hierarchical classification. A geographical scope is not necessa-rily congruent with an actual or potential infraspecific taxon. Differences between geographically scoped descriptions may be due to genetic mechanisms or may be based on, e.g., a systematic influence of climate on the phenotype.

Several requirements can already be formulated:

60. Structured descriptive information models must provide methods to describe properties of sets of objects.

61. Both repeated sample data, and the results of statistical and non-statistical data aggregation methods should be supported.

62. Aggregation methods are required for descriptions of classes. In biology, sets of individuals form taxa, sets of taxa form higher taxa. No difference could be detected between aggrega-ting from individual to lowest level class and lower level class to higher level class.

63. Aggregation methods are also required for descriptions of individuals, containing either multiple parts or changing over time (discussed in detail further down, p.93).

64. The difference between a descriptive information model for individuals (e.g., in a specimen database) and taxonomic classes (e.g., in a taxonomic database) with respect to aggregation methods is negligible.

65. Classes or sets of objects may be defined by non-taxonomic means, e.g., through a geo-graphic scope (see also “Secondary classification resulting in description scopes”, p.215).

66. A fixed sequence of aggregation levels (such as “object part, individual, taxon”) is covering only a subset of aggregation cases and should not be part of the information model.

The topic of raw data and aggregation is also presented in a use case diagram (Fig.178, p.297).

To simplify the following discussion, most examples will be discussed assuming a taxon-specific object/property model (Fig.13, p.42), but they are equally applicable to a generalized model (e.g., Fig.14, p.43).

Im Dokument Structuring Descriptive Data of Organisms — Requirement Analysis and Information Models (Seite 82-85)