Calculated characters - Structuring Descriptive Data of Organisms

Many of the mappings discussed so far may be generalized as a function projecting one value space into another. This may be a simple function (e.g., to convert °F to °C), or a categorization of data, ratios, area, etc. (Table 13). However, even where the fundamental function is relatively simple, rather complex dependencies on data aggregation (p.83), properties, and measurement methodology may exist, as the following examples show.

Petiole 25

Lamina: 75

Leaf: 100

Figure 24. Multiple measurements may have fewer degrees of freedom than variables.

Lamina: 75

Leaf: 93 Petiole

25.2

Figure 25. Depending on measuring method, length measures may not be additive.

Ratios: The most common ratios in descriptive data are based on measurements of the same dimension (or dimensionless, like counts). Examples are the ratio of the length of two parts, or the length/width-ratio of a single part of the object. Such ratios are especially convenient where parts are next to each other in a composition (e.g., sepals and petals in flowers, compare Fig.53, p.138); such a situation allows estimates (e.g., “smaller”, “about equal”, “larger”, or “twice as large”) with ease and precision. Calculations based on different dimensions are relatively rare.

An exception is perhaps counts per extension or area (for example, density of hairs). Under the

“basic property model” (p.62), a ratio may thus be based on 1 or 2 parts and 1 or 2 properties.

As mentioned above (see Table 20, p.91), ratios are an example where a calculation based on summarized measurements gives a different result than the calculation based on repeated meas-urements. When defining a calculation method, the scope of this method (repeated measurements or statistical measures like mean) should thus be defined.

Sums: In many cases the total length of a composite object may be calculated based on the lengths of individual parts (Fig.24). If the individual lengths are additive, fewer degrees of free-dom than variables exist. Note that, depending on how the measurement method has been de-fined, an interaction with other properties may exist. In Fig.25 the measurement of the petiole is defined in a way that the length of lamina and petiole are additive only if the petiole is not angled (not “geniculate”).

No current information model or implementation supports a generalized function mapping. In general, support for calculated characters encounters the following problems:

■ A sufficiently general and powerful mathematical language is required to define complex formulas. The language should, however, not be limited to a specific programming language (assuming terminological definitions should remain exchangeable). An option might be a pro-gramming-language-independent standard like MathML (Carlisle & al. 2003), but this would create a heavy burden on implementers while still lacking many concepts (current MathML is very weak on statistical functions and concepts). When the topic was discussed in the SDD no satisfactorily solution could be found (see SDD minutes: Hagedorn 2003b). A possible path could be defining a subset of MathML that covers most practical needs and at the same time reduces implementation cost to a realistic level.

■ The categorization of a quantitative measurement variable into classes (histogram) and corre-sponding categorical states requires knowledge of how these categories must be addressed.

This mixes issues of a general mathematical language (which would be able to express the class intervals through ‘>’ and ‘<’ operators) and specific issues of the descriptive model. For categorical mappings it would be necessary to support mathematical concepts from category theory like “functors” (i.e., a generalization of functions that associate every object of one category with an object of another category).

■ The available information is incomplete and different authors or publications may record dif-ferent aspects of a complex of dependent/derived properties (see Table 14). As a consequence, some “character variables” may either be a calculated result or an original value.

□ In principle, calculated properties are similar to derived class attributes in UML. However, it seems that UML implicitly assumes an inverse function (the implementation may decide which attributes to store and which to calculate). However, the calculations used most fre-quently in descriptive data have no inverse function (e.g., aggregated ratios, categorization, or statistical measures such as average or variance).

□ It might be possible to create two variables, one containing calculated, the other original values. This would put a heavy burden on queries and consumers to understand the seman-tics of this relation. SDD proposes to add attributes on data instead, informing about the origin of a value.

■ In some situations it is desirable to be able to record the parts of the description that are avail-able and remain true to the data sources. This may include:

□ Recording values as they are published, even if they may be calculated from other vari-ables (over-determined data).

□ Recording values even if some values contradict others, based on the assumptions implicit in the calculation method. Not only is it usually not immediately clear which values are correct and which false, but also, as the example in Fig.25 shows, assumptions may be erroneous. Preserving possibly contradictory data, perhaps adding annotation to this fact, is preferable to ad-hoc decisions of which one is the correct value that must be made because the system only allows the entry of “correct” data.

Table 13. Examples of potentially useful mappings of measurement characters to “calculated”

characters.

Source Destination Examples and Notes

Single quantitative measurement Quantitative Conversion of units like Fahrenheit → Celsius

Single quantitative measurement Categorical Direct measurement to classification: < 10 µm, 10-20 µm, > 20 µm Single quantitative+categorical

measurement Quantitative Spore width measured separately for septate/aseptate or hyaline/ brown spores

Multiple quantitative measurements Quantitative Length/width ratio, area, volume, etc.

Multiple quantitative measurements Categorical Comparative statements like “Petals shorter than sepals”

Multiple quantitative+categorical measurement

Quantitative Area or volume could be calculated separately for different shapes

Table 14. Example showing potential combinations of available measurements for leaf length measurements (compare Fig.24).

Petiole length Lamina length Leaf length Note

(can be calculated) 70 mm 80 mm = minimally complete 10 mm (can be calculated) 80 mm = minimally complete 10 mm 70 mm (can be calculated) = minimally complete

10 mm ? ? = incomplete

? ? 80 mm = incomplete

? 70 mm ? = incomplete

15 70 80 mm = over-defined & contradictory!

Note that “formulas” or methods to define calculated values discussed here are different from the

“Formula characters” directive proposed for “New DELTA” (p.20); these “Formula characters”

are designed as natural language formatting commands.

DELIA (p.19) uses the term “derived characters” instead of “calculated characters”.

48. Support for calculated character values (based on one or multiple values from other charac-ters) is desirable.

49. A standardized support for calculations in a descriptive information model is, however, highly problematic, both because of possibly complex dependencies and because notation systems for formulas are either specific of certain programming languages, or general but difficult to implement in a wide variety of applications. Support for calculated characters is not a priority.

4.7. Coding status

Data may be missing from a description for a variety of reasons and it is often relevant to have some classification of why this is so. The most basic distinction is between “data entry is incom-plete or erroneous”, and “data cannot possibly be supplied”. Databases usually use a Null or Nothing value to indicate missing data or incompleteness. Examples of potential semantics of the Null value in databases are “absence of data”, “unknown data”, “undefined”, “not applicable”, and “to be added later” (from the documentation of Microsoft SQL Server 2000). This semantic

“polymorphism” causes interoperability problems when exchanging data. A richer terminology is therefore desirable.

Existing forms of such “coding status” or “knowledge management” metadata in descriptive data are called “pseudo-values” (or “special symbols”) in DELTA (compare Table 15), “special states” in DiversityDescriptions (p.322), and coding status values in SDD (compare Table 16).

Coding status values may be interpreted in various generalization hierarchies (Fig.26). The SDD Table 15. Coding status values (also known as “pseudo-values”, “special symbols”, or “special states”) in DELTA.

Coding status

DELTA

symbol Notes

“no data

recorded” (implied →

no symbol) Usually interpreted as “not yet scored” or “unfinished work”

“unknown” ‘U’ The character scoring is “minimally complete”.

“not applicable”

‘–’ Manual alternative to declarative general character applicabilities. It may be used together with other states (e.g., in genus descriptions where a character may have a state in some species, but may be inapplicable in other species).

“variable” ‘V’ “Variable” may express a polymorphism, variability, or saturation (i.e., all states are true) of a character. It should not be used together with other states. Deprecated in DiversityDescrip-tions.

proposal “Indicators of coding status in class or object descriptions” discusses the topic in detail (Hagedorn 2004b). It includes detailed information on each proposed value as well as SDD XML instance examples, discussions of current practices, exclusiveness, information aggregation issues (in the sense discussed further below on p.83), and the differences between coding status meta-data of a variable and metameta-data (frequency and certainty modifiers, see p.206 and 207) of values.

Table 16. Coding status values defined in SDD 1.0 and 1.1. This enumeration extends the DELTA values ‘U’ and ‘–’, but drops the DELTA “V = variable” value. The semantics of the latter are difficult to define and its current application is doubtful.

Coding

“To be checked” “!” Explicit indicator to revisit a character later.

This may be used when data are missing (known to exist, but not at hand for entering) or together with data (check data against additional information source).

“–“ For logical reasons, it is assumed that data cannot exist.

“?” Data could not be obtained despite that an effort was made.

“#” Data are known to exist, but are purposely not coded because not even an interpretation involving certainty modifiers was deemed possible. (For the following coding status situations no explicit coding status values are defined; the status is implied:)

(“Data recorded successfully”)

(Evident from existence of data) Coded successfully

Exists Exists (“No data

recorded”)

(Implicit in not coding the character in the description at all; since neither a value nor a coding status is present this is also “status not yet evaluated”)

Not evaluated

May exist Does not exist

1 ‘Information in source data set’ refers to data storage (document or database) from which the current representation that includes the coding status values was directly or indirectly derived. The distinction is relevant in the case of ‘data with-held’, where no data exist in the current data set, but information exists in the source.

Character data

Figure 26. Generalization hierarchy of character data and coding status values. This diagram is intended to clarify intent, not as an architecture plan for implementations. In SDD (p.20) values for the five classes with a thick border are explicitly present; the remaining classes are deduced from other data or complete lack of coding.

50. Support for coding status information in the information model is a central requirement to support knowledge management and collaboration scenarios for descriptive data.

51. The existence of categorical or quantitative data as well as the lack of any data in a descrip-tion for a character that is defined in the terminology may be considered implicit forms of coding status.

52. A predefined list of coding status values is desirable to support interoperability. The hierarchical nature of coding status information may be implicit and does not have to be expressed in the data.

4.8. Character dependency

Im Dokument Structuring Descriptive Data of Organisms — Requirement Analysis and Information Models (Seite 72-76)