Value order in character data - Structuring Descriptive Data of Organisms

Whenever data are recorded based on a terminology, the sequence of recorded data elements may follow either the sequence defined in the terminology or the sequence of user data entry. For ex-ample, when entering fields into a database table, the sequence of data entry is not preserved, and reordering of fields will automatically reorder all data. On the other hand, if experimental data are entered into a spreadsheet-like editor, the user would expect that data entry sequence is hon-ored. The following cases may be distinguished:

1. Order of characters in a description. A general agreement seems to exist that – for the sake of comparability – characters should be in the same order in all descriptions to be compared.

Different sequences for different purposes are often desirable, but all of these then apply to all descriptions. In DELTA only a single character order may be defined; in SDD order is ex-pressed through Character Trees, of which an unlimited number may exist.

2. Order of repeated measurements (original sample data). Although the order of repeated measurements should never be semantically relevant, preserving this order is highly relevant for the sake of workflow, e.g., proofreading or other comparisons with external notes. See also “Data recording levels (sample data)” (p.89) and Table 20 (p.91).

3. Order of values within a quantitative character. Similar to characters, the order of statisti-cal measures is important for comparison purposes and normally defined exclusively in the terminology or report-generation methods. However, an important exception is that some models, including DELTA, support repeated “central measure or values” (Table 31, p.112), for which the same arguments as for original sample data apply. Thus, for repeated occurren-ces of the same measure – where the information model supports such – the original sequen-ce may have to be preserved.

4. Order of values within a categorical character. This is perhaps the most contentious issue.

Firstly, in some models it may not be recognizable, whether a sequence of states is intended as repeated measurements or as summary data. Distinguishing this should be a goal of the model (which may be in part achieved by preventing repeated state values, unless modifiers or notes differ). For the remaining states, in many cases the data set authors will desire them to be reordered if the sequence of states in the terminology is updated. In some cases, how-ever, it is vital that a sequence in a given description shall be preserved. The remainder of this section will discuss this in detail.

In general, the order of categorical character states in the terminology may express an innate order of the concepts. This is simplest in the case for ordinal data, where the order expresses that the distance from first to third state is greater than any other distance among these states (even though it is unknown whether first-to-second, or second-to-third have a greater distance). How-ever, nominal data may also be at least partially ordered, often in complex ways so that a deliber-ate choice has been made to simplify the analysis by reducing these data to the nominal scale.

This and other reasons, including traditions, may lead to an ordering expectation by human con-sumers. Defining this order once in the character terminology (or in report definitions) for all de-scriptions together fulfills a minimal requirement.

In many cases, however, terminology-defined order and the desired order in a description dif-fer. For example, in a character with the ordered states:

– 1. very short bristles – 2. short bristles – 3. long bristles – 4. very long bristles

frequency or certainty modifiers (p.206 and 207, respectively) may appear in a specific object description as “3 or rarely 1” or “4 or perhaps 3”. Clearly, although the “scoring sequence” may not matter in identification or character analysis, it greatly facilitates communication with human users, especially when generating natural language descriptions. A statement “bristles very short (rarely) or long” is considerably more difficult to understand; and “bristles rarely very short or long” is even likely to be misunderstood. Similar examples may be found for spatial or temporal modifiers (p.203 and 204), e.g, “flower blue, or violet (at the base)” or “flower violet changing to red when mature”.

If the order of modifiers has been explicitly defined to be semantic (compare “Modifier sets and sequences”, p.199), the modifier order may thus have to take precedence over the state order.

Unfortunately, some expressiveness may still be beyond analysis, e.g., where special situations are annotated in free-form text. Again, further analysis of existing data sets is required to deter-mine whether it is sufficient to place states with annotations last. The default order of states in summary data of descriptions could thus be:

■ by ascending temporal modifier rank,

■ by ascending spatial modifier rank,

■ by descending certainty rank or values (certain before uncertain states),

■ by descending frequency rank or values (frequent before infrequent states),

■ states without notes before states with notes,

■ by ascending order in which the states are defined in the character definition (for all modifier cases, character data without modifiers would be ordered first).

The minimum requirement for order of categorical states in descriptions is that the resulting re-ports and natural language descriptions should not hinder communication. Whether the rules above suffice for this and whether they can be extended to include all modifiers for which order may be defined as semantic, requires further analysis. In preliminary experiments of thought no counter-example of ranked modifiers could be found, which – if used to reorder state sequences – would not also improve the readability of the resulting description.

Despite this, many biologists may desire support for manual value ordering that goes beyond the algorithmic support outlined above. Biologists normally write descriptions as unconstrained text. Many biologists will assume that "round or obovate" and "obovate or round" are different, assuming an unequal (but unspecified) frequency to be implied in the order of states. This may be viewed as a "bad habit" in biology, but it may be unwise to try to force content providers to abandon relatively harmless bad habits. Furthermore, constraints on value ordering can only be enforced for future data; at least for legacy data (digitized descriptions, DELTA-coded data), ignoring value sequence is likely to cause some problems.

Individual applications may desire to support additional manual ordering methods or not. For data exchange standards and general information models designed to support multiple applica-tions, some agreement on manual ordering must be sought. Unfortunately, to always preserve the value order has some drawbacks, if:

■ the terminology is revised and the order of states in the terminology redefined,

■ data are aggregated (e.g., descriptions of multiple specimens combined into a new species description) that have different scoring sequences.

Current support for value order in some applications and data standards:

■ The DELTA and New DELTA standards defines that the order of states in data files is always significant.

■ The CSIRO DELTA programs (p.19) will always store the scoring sequence and ignore the se-quence of states in the terminology. Separate methods for “reordering character states” based on the terminology exist.

■ NEXUS (p.18) applications probably preserve the sequence in which polymorphic state scores (in “{}”) are written, but definite information about this is not known. Mesquite (p.18) still needs to be tested in this respect.

■ The first versions of DiversityDescriptions used only the state sequence defined in the termi-nology and considered the sequence of states in descriptions irrelevant. This turned out to be a major source of discontent for users so that later versions added an optional manual state se-quence. By default all states in a description are ordered in the sequence defined in the termi-nology, but if the user chooses to rearrange this sequence in a description, this new sequence will be permanently stored and is no longer influenced by changes in the terminology. If this decision is later considered undesirable in some characters, a method is provided to reset state order to terminology order for all descriptions in a single step. See the documentation of the logical model (p.331) for further information.

■ CBIT Lucid (p.21) never outputs any descriptions where the state sequence may be relevant.

It is optimized for interactive identification, in which case the state sequence is irrelevant.

■ SDD (p.20) the “statemodel” xml-attribute within the character data element may be changed from the default (“OrSet”) to other values like “OrSeq” (compare Table 22, p.97). SDD thus

prefers no manual ordering (therewith simplifying evolution of terminology, like adding or re-ordering states), but allows this to be changed where explicitly desired.

The solution chosen by DiversityDescriptions and SDD has the advantage that normally changes in the terminology are dynamically and automatically reflected in reports (even in federated data-bases where the terminology may be changed independently of the descriptions). However, it makes data entry slightly less intuitive, since it requires an explicit understanding and choice be-tween default and manual state order. This is especially problematic if during the part of the re-cording phase the desired scoring sequence in a given description is identical to the sequence in the terminology, but the latter is later rearranged.

The topic has previously been discussed in an SDD proposal (Hagedorn 2003g).

113. A general order of characters and a general order of character states within a character are meaningful for communication with humans, even where it is not meaningful for machine interpretation or analysis (e.g., states on the nominal scale).

114. For characters, multiple alternative ordering definitions are desirable.

115. Negative requirement: It is not necessary to preserve, in a given description, the order in which data relating to different characters have been entered.

116. In a given description and character, the order in which multiple values or states have been entered may have to be preserved. This is unequivocal for repeated measurements in sam-ple data, but restricted to special situations in summary data.

117. In a given description and quantitative character, multiple occurrences of a statistical mea-sure may have to be preserved in sequence (some models use this as a replacement for sample data).

118. In a given description and categorical character, it may be desirable to provide a method to let data set authors decide whether the sequence of multiple states may be rearranged ac-cording to the sequence in the terminology, or whether it is to be preserved.

119. When reordering the states in a given description and character, modifiers for which order has been defined as semantic (ranked modifiers) may have precedence over the state order.

120. It is desirable that the information model encourages distinguishing sample data and sum-mary data in an unambiguous way, e.g., by preventing unqualified repeated occurrences of the same value or state in summary data. States with different modifiers or annotations, however, have to be accepted.

Im Dokument Structuring Descriptive Data of Organisms — Requirement Analysis and Information Models (Seite 113-116)