Standard aggregation methods - Structuring Descriptive Data of Organisms

Descriptive data for sets of objects can simply be stored as a big collection of repeated values.

However, such collections are not very digestible or intelligible for humans. Other methods that reduce the amount of information that has to be processed are preferable.

The most frequently used data aggregation methods are based on univariate descriptive sum-mary statistics. Depending on the measurement scale (p.49), the set of univariate statistical meas-ures may be highly limited or very large. For all categorical data (nominal or ordinal) sample size, a mode, and a frequency distribution can be calculated. Furthermore, in the case of ordinal data the total range (minimum to maximum) and median are frequently used. A much wider range of summary statistics is applicable to continuously varying quantitative numeric data (e.g., mean, variance, range measures, or confidence intervals).

For some statistical measures such as confidence intervals, the number of measures is poten-tially infinite, because a confidence measure may be defined for any desired confidence probabi-lity value. A similar problem occurs with percentiles. In practice, the list of commonly used measures is limited (e.g., 90%, 95%, 99% confidence intervals, 60%, 80%, 90%, 95% percen-tiles). One solution is therefore to consider each of these frequently used confidence measures a separate category. A more general solution is to define a basic vocabulary of confidence-measure classes that is supplemented by a quantitative parameter (taking values like 0.9, 0.95, 0.99, etc.).

The latter solution is used in SDD.

Integer data may be summarized either using the methods intended for continuous data or using a frequency distribution. The latter method is especially appropriate if the distribution is unusual (e.g., spores that have only 3, 5, or 7 septa, or a leaf with either 15 or 17 leaflets).

In biological descriptive data one of the most customary aggregation methods for categorical or integer data is a special form of “frequency distribution” where the frequency information it-self is ignored or not available and simply the set of values with frequency > 0 is reported. This method is used when saying “has linear, lanceolate, or ovate leaves”. In statistical data analysis this aggregation method is sometimes called “value list”, but this term may also be used for a

total list of original value (i.e., values occurring repeatedly). It is therefore proposed to call the aggregation method “distinct value list”. It is applicable to categorical and discrete quantitative data (as in the example “3, 5, or 7 septa” above).

Where frequency information is given, it is often simplified and represented only by verbal estimates (e.g., “rarely”). It is possible to define or estimate frequency ranges corresponding to these verbal frequency statements, see “Frequency modifiers”, p.206.

In addition to using defined statistical measures, another customary aggregation method is to record human estimates of “typical ranges”. This is not truly an aggregation method for data that have already been measured, but a replacement for it. An estimate of “typical range” may be achieved by visually comparing objects and measuring samples that appear relatively small and large while ignoring unusual or extreme cases. Only the “aggregated data” will then be recorded.

The reduced precision and potential problems when analyzing estimated data are often accepted because of the increase in work efficiency.

Some statistical measures are related and may even be substitutable for certain purposes.

Mean, mode, and median are all a kind of central value; standard deviation, variance (both with df=n–1 and df=n), and standard error are a variance measure, the first four referring to values, the last to a mean (Fig.32). Such relations may be expressed in a class hierarchy or through ontological kind-of relations. This is desirable during data analysis, but may also be desirable when entering data. Thus in addition to mean, median, mode, and human mean estimate, it may be desirable to label a value directly as “central measure”.

Statistical Measures

Variance of data

Variance of mean

Median Mode

Standard Error PopulationStdDeviation

PopulationVariance SampleStdDeviation SampleVariance Central measure

Variance measure

Mean

Sample size

Figure 32. UML class diagram showing a selection from a generalization hierarchy of statistical measures.

The approximate substitutability of statistical measures is especially relevant for identification and report-generation purposes. To compare a measure of an object to be identified with descrip-tive data in a database, most kinds of range measures (confidence interval, appropriate percen-tiles, mean ± std. dev., and even estimated ranges) are useful. The different measures have to be distinguished only for more exact analyses. However, as a design requirement for descriptive data systems, knowledge about the similarity of statistical measures does not necessarily have to be a part of the terminology; it may equally well be incorporated into analysis software.

DiversityDescriptions was probably the first information model for descriptive data to intro-duce a fine-grained and extensible model for statistical measures (p.356). This included a con-cept for “undefined ranges”, but not for generalized central measures, or human mean and range estimates. SDD (p.20) strongly improves on this, containing an enumeration of 38 statistical methods (“UnivarStatMeasureEnum” and “UnivarStatMeasureWithParamEnum”), informing for each of these methods whether values are dimensionless or not, and categorizing them into repor-ting and method classes (Table 18).

The question how data referring to statistical measures may be integrated into the data storage model is discussed later on p.110.

Table 18. Classifications of univariate statistical methods used in SDD. These provide semantic information about statistical methods intended to enable the creation of generalized software.

Enumeration Category Description Reporting

classes

Central Measure Any kind of central measure, like mean, mode, median, etc.

Lower Range The lower value of any kind of range measure, like ‘mean minus standard dev.’, confidence interval, percentiles, etc.

Upper Range The upper value of any kind of range measure, like ‘mean+standard dev.’, confidence interval, percentiles, etc.

Lower Extreme The absolute minimum value (a category of its own).

Upper Extreme The absolute maximum value (a category of its own).

Variance Measure Any kind of variance measure, like standard deviation, variance, etc.

Sample size Sample size is a category of its own.

Other Any other kind of statistical measure.

Method classes

Statistical estimate

Measures estimated by statistical methods. Examples: Sample mean, minimum, confidence interval, standard deviation (n-1 degrees of freedom).

Statistical parameter

Values calculated by statistical methods that are exact in relation to the population under study (statistical estimates are exact in relation to the sample, but estimates in relation to the population under study). Examples: Sample size, standard devia-tion (n degrees of freedom).

Observer estimate Values estimated by humans without using mathematical/statistical methods.

Unknown method Values obtained by an unknown method. This may be a statistical method or a human observer estimate. Many legacy data sets and data published in print fall into this category.

Reporting classes are similar to the five basic measurement classes supported by DELTA. Most applications that report information for human use can rely on these reporting classes in their decision how to present the data, increasing com-patibility with future SDD/UBIF versions. Implementers may, however, choose different methods of handling the statistical measures.

Method classes inform about very general quality properties of measures.

67. Support for range of univariate statistical measures is required. The set of applicable meas-ures depends on data type and measurement scale.

68. Most univariate statistical measures report only a single value.

69. Some univariate statistical measures such as percentiles or confidence interval limits are best represented as a combination of a result value and a method parameter.

70. In addition to exact measures, support for human estimates such as “typical range” is required.

71. To support legacy data such as DELTA, support for undefined measures is desirable (e.g., in DELTA a value is known to be a central value, but not whether it is a single measure-ment, a mean, median, or mode).

72. Some standard descriptive statistics report a collection of data items rather than a single value for a statistical measure. Frequency distributions and distinct value lists must be supported. A distinct value list is a frequency distribution with unknown frequency.

73. It is desirable to support two forms of frequency distributions: with frequency values and with frequency categories (i.e. frequency modifiers).

Inappropriate aggregation results

Recording a temperate tree in winter as “leaves absent” and in summer as “leaves present” may well be considered appropriate. However, aggregating such data would lead to a taxon descrip-tions like “leaves present or absent” which might be considered “silly” for a temperate tree. A

very similar situation in animals may be considered less silly: The Arctic hare (Lepus timidus) having “white or brown fur” seems considerably more appropriate. And is a description of a but-terfly as “with or without wings” (i.e., as caterpillar) appropriate or not?

Again, considerable information about the expected background knowledge of the data con-sumer seems to be required. Repeating general knowledge leads to descriptions that appear

“silly”.

The problem also has some serious implications for rules for data recording, analysis, and identification. When recording data for individuals, care must be taken about the expected scop-ing of a description. A character like “leaves (presence)” may be expected to be validly coded on the taxon level, i.e., it is negative only if the organism never produces leaves. This may be de-sirable for analytical purposes studying phylogenetic or genetic processes. However, when recor-ding data on the individual level, data are both constrained by individual variability and develop-ment over time. If the individual winter tree is identified by someone not knowing that it has leaves in summer, the answer “leaves absent” would lead to a failure of identification (p.310), if the taxon description of the tree is limited to “leaves present”.

Further studies are needed to clarify how these problems can be resolved. The current stan-dard solution seems to be that in most characters a-priori knowledge of temporal variability or the entire life-cycle of an organism is expected (e.g., presence of leaves, flowers, or fruits) and therefore the taxon-level knowledge is applied to guide the user both during data recording and identification. Where the knowledge is lacking, one may chose to live with “silly” descriptions.

This solution is clearly not satisfactory, especially since the expertise of data recorders and users of identification tools will typically be quite different. Several other solutions may be envisaged:

■ Use separate characters for winter and summer.

■ Use temporal modifiers (p.204) for winter and summer. The aggregated description could then read as “leaves present (in summer) or absent (in winter)” and “brown in the summer, white in the winter; ears tipped with black all year round”.

■ Create scoped descriptions for different life-cycle or seasonal stages (see “Secondary classifi-cation resulting in description scopes”, p.215).

■ Attach metadata to characters, guiding the aggregation process as to whether temporal occur-rence of a character does occur (this is related to “Dependencies on circumstances of identifi-cation”, p.175).

The examples show, that data synthesis to obtain taxon descriptions may be more complex than using aggregation methods

74. Character metadata informing about the expected scope or recording level of a character may be required. For genetic traits the scope is summarized over the entire life-cycle, for diagnostic purposes individual points in time may be more appropriate.

75. In addition, or perhaps alternatively, character metadata informing about dependency of observation on circumstances or temporal development (and therefore the likeliness that data recorded on individuals represent the entire developmental cycle) may be required.

Im Dokument Structuring Descriptive Data of Organisms — Requirement Analysis and Information Models (Seite 85-88)