Character decomposition models - Structuring Descriptive Data of Organisms

The majority of descriptive data applications in biology are interested in individual objects or classes of these. In the matrix- or list-based description models discussed so far these specimens, units, taxa, items, etc. are considered a fundamental entity that is described through a list of vari-ables, called characters. These models have a number of advantages. They are close to:

■ the entity-type/attribute/value model in ER modeling,

■ the table/field/value model in the physical view of a database,

■ the class/property/value model in many object-oriented programming languages,

■ the class/attribute/value model in UML (compare also Table 3, p.34).

Object-oriented information models generally support complex object properties composed of the same (array, vector, matrix, collections) or different (structure, record) types, making it relatively straightforward to consider the character variables as object properties. This is slightly less intui-tive in traditional relational databases, where the entity attributes are often limited to simple data types (including strings). However, as the DiversityDescriptions model (p.322) shows, an entity

×character model can easily be implemented in a simple relational DBMS, without requiring an object-relational DBMS. An unstructured list of variables is the prevalent data format in

phyloge-netic or multivariate statistical analysis, and is used in the dominant DELTA or NEXUS ex-change standards for descriptive data.

These advantages of the simple character model are offset by a number of problems that are experienced when defining and applying characters. Fundamentally, the definition of character variables is a complex and difficult task. Characters should be independent (or as independent as possible). Many characters depend on specific circumstances (taxonomic or other scope, object part, instrumentation, measurement methods, etc.). Defining characters in a manner that data pro-viders and consumers communicate successfully with each other is a serious problem. It is not uncommon that even the creators of a terminology start introducing duplicate (or near duplicate) characters when a large terminology contains over 1000 characters.

In an attempt to solve these problems, a number of proposals have been made to “decompose”

the characters into more fundamental data items and base the information model on these items.

To the author’s knowledge, the first explicit application that not only conceptually analyzed char-acters as part-plus-property, but also explicitly stored them as such, is Taylor (1995). However, Taylor is primarily interested in automatically parsing large bodies of natural language descrip-tions, and only briefly describes this approach (and the need for relational characters, see below).

Two other projects have subsequently analyzed and tested the character decomposition approach in greater detail; these will be discussed in the following.

The Nemisys/Genisys model

Despite problems with the identification of object parts (Fig.10, p.38), object parts are a central concept in morphological or anatomical data. The majority of morpho-anatomical characters may be interpreted as a limited number of abstract observation methods applied to either the entire organism or a large number of object-parts (including regions and functionally defined “organs”).

Building on earlier studies by Lebbe (1991), Diederich, Fortuner and Milton in a series of articles (see Nemisys/Genisys model, p.21) proposed a descriptive information model which decom-posed characters into “structures” (i.e., “parts” or “physical components”) and “properties”.

In this model characters are the intersection of two more or less hierarchically organized di-mensions: object parts and basic properties (a concept they introduce, combining instrumentation, selected properties and methods, with data types, see “Basic property types”, p.62). The authors explicitly recognize that the model is optimized for morpho-anatomical data, but maintain that it is also useful for all other forms of descriptive data (e.g., physiological data).

Some publications on the Nemisys/Genisys model may be interpreted as a set of rules to re-structure and reorganize an existing character list. Diederich & al. (1998) mention their recogni-tion of 272 parts and over 1000 characters, and that the potential number of characters of 272 parts×20 basic properties could grow into more than 5000 characters. This suggests that the en-tity “character” remained a useful concept under their model. Some of their proposals may best be interpreted as an analytical tool to organize characters in a pattern that increases the manage-ability of the terminology and that does not affect data storage management.

On the other hand, Diederich (1997) outlines a new data storage model where the combination of object part and basic property is no longer under terminological control and where part and property concepts may be combined freely at data recording time (Table 32). In addition, they introduce a concept called “name extension” that allows ad-hoc modifications of both part and property concepts. This model might perhaps look like Table 33. No field is mentioned in Diede-rich (1997) to store the object parts of relational basic properties; a column has been added for this in Table 33. Further, the model contains extension mechanisms: a) “Name extension” for the basic property (although often object parts are involved in the extensions) and b) “Qualifier” for states. Both mechanisms are closely related to the mechanisms discussed in “Modifiers” (p.189).

Note that to directly support any kind of ratios in a fully decomposed model, further columns may have to be added. For example, in an insect a ratio value may be calculated as the distance (= “property 1”) between the attachment point of front legs (= “part 1”) at the body (= “part 2”)

and the attachment point of the middle legs (=

“part 3”) at the body (= “part 2”), divided by the length (= “property 2”) of the segment of the front leg nearest to the body (i.e., front femur, =

“part 4”). Clearly, this is a constructed example, but similar characters are not totally unrealistic because ratio estimates of immediately neighbor-ing part lengths are relatively conveniently done without precise measurements or calculations. An actual example is whether length of the hind-leg of a frog is longer than the distance from hind-hind-leg attachment to the nose of the frog.

Regardless of the details of the model, an important aspect of a property/object-part decompo-sition is that it is relational (i.e. two-dimensional) rather than hierarchically nested. For example, if during identification a compositional part of a biological object (e.g., a flower) can already be recognized, it will often be best to study multiple properties grouped by object part. If, on the other hand, the parts are difficult to distinguish, but an intuitively recognizable property concept is found, it may be more useful to list characters grouped by property. For example, in a fungal colony in a Petri dish it may be impossible to distinguish which hyphal structures are responsible for the color effect, but the color itself is readily observed.

In the above example, if the compositional hierarchy is a kind-of hierarchy (i.e., a generaliza-tion), properties could be generalized. However, the examples in Diederich & al. (1997) rather suggest a part-of hierarchy of “structures and substructure”. This topic is further discussed in

“Composition versus generalization” (p.153).

The Nemisys/Genisys model introduces valuable new approaches to descriptive data. How-ever, it seems to be optimized for a particular form of morphological data (compare also require-ment points 31ff, p.66). It is unclear to which extent it has actually been implemented; no formal documentation of an information model could be found.

Table 33. An example based on the detailed Nemisys/Genisys model (including the “name extension” and “qualifier” proposals).

Entity (Taxon)

Object-part

(Structure) Basic property

Related part (Structure)

“Name

extension” State Qualifier 1 Hemizonid position relative to excretory pore - anterior slightly

1 Body “kind” (color) - at excretory pore brown -

1 Eye Distance to Antenna - touching -

Notes: After Diederich (1997), where the model is outlined and discussed, but not shown in exactly this form. Here two columns are added: an ID-reference column for the entity, and a column to express the second object-part (structure) discussed for relational basic properties. The discussed version column is not shown here. The first example is from Diederich, the other added.

The Prometheus description model

As mentioned, the Nemisys/Genisys model is a conceptual model that is only partly documented and the concepts of which are evolving from publication to publication. The Prometheus descrip-tion model (p.21) is the second character decomposition model described so far and develops a fully functional model. As discussed on p.31, Prometheus replaces the term “character” with

“description element”, partly perhaps to stress the character decomposition-model. In line with the remainder of this thesis and to facilitate the comparability with the Nemisys/Genisys model, this is not accepted and the term “character” is maintained here.

The Prometheus model differs from Nemisys/Genisys in several respects, for example:

Table 32. Fundamental part/property model.

Entity (Taxon)

Object-part (Structure)

Basic

property State

1 Body “kind” (color) brown

1 Body shape ellipsoid 1 Head “kind” (color) dark red 1 Head shape round

■ Generally much more focus is placed on defining the terminology. Where in the Nemisys/ Genisys model it is occasionally unclear whether ad-hoc natural language definitions are sup-ported or even intended, Prometheus is unambiguously clear to support only defined terms.

■ A strict distinction is made between quantitative and qualitative (i.e. categorical) properties.

□ “Quantitative properties” are a subclass of “Defined Terms”, may not be hierarchically ar-ranged, and may not be constrained to the context of specific object components (“struc-tures”).

□ “Qualitative properties” (term used in Fig.1 and 4 of Pullan & al. 2005) or “State groups”

(term used in text of Pullan & al. 2005) are no “Defined Terms”, may be hierarchically ar-ranged, and may be constrained to specific object components (“structures”).

□ The data recording process follows this distinction. Whereas quantitative properties are ex-plicitly recorded together with the value, for categorical data only the character states are recorded; the qualitative properties are implied.

■ Structural models (using part-of relations) are used both in the terminology (potential composition, general constraints), and in description instances (actual composition). This mechanism is used for several purposes:

□ It supports and constrains data recording;

□ it replaces specialized plant structure terminology, but building specializations (leaves on stem, in inflorescence) in an ad-hoc mode, constructing “structural paths”;

□ it provides the containers for part-specific measurement or aggregation data (compare

“Boolean operators between characters”, p.98).

■ The problem of spatial areas or regions like tip, bottom, or center is addressed by introducing a subclass of structure (“Region”) which is freely combinable with structure. A similar sub-class “Generic Structure” is created for parts that occur inconveniently frequently on multiple other parts (e.g. hairs). Again, instances of Generic Structure may be used on any part, with-out relations being present in the terminology.

■ The explosion of codable points that is a result of freely combinable part and property termi-nology is reduced, first by placing constraints on state terms to which object components they are applicable, and secondly by allowing project managers to create so-called “pro-forma”

definitions (i.e., something not essential, but only for form's sake). The “pro-forma” mecha-nism seems to be related to database views, creating restricted subsets of the entire terminol-ogy, but may entail other, more complex setup information as well.

■ The concept of modifiers is significantly enhanced. This is discussed separately, see p.196.

The hierarchical arrangement of properties addresses many of the issues criticized for the “Basic property types” (p.62) of the Nemisys/Genisys model. Also, the qualitative property/state group model is freely extensible, avoiding the need for artificial catch-all properties like “kind”. How-ever, it remains unclear why quantitative and qualitative properties may not both be hierarchical and both be defined terms (or “concepts”). It would be desirable to create a hierarchy for size measurements (e.g., length, width, length including bristle, excluding bristle, etc.) and probably other properties as well. Furthermore, in the light that quantitative measurements may be expres-sed both as continuous and categorical measurements, and that mappings (p.66) between these may be defined, it seems unfortunate not to be able to browse data using a property hierarchy ir-respective of the data type used (including the use of complex data types, p.59, e.g., for color).

The direct recording of character states without going through a hierarchical level of proper-ties is certainly a very interesting feature. However, it seems to be truly a question of the user interface, not requiring changes to the information model (see p.129). Indeed it may be noted that the assumption that the property is implied by the states holds only for data recorded using unique identifiers. In natural language categorical states may be ambiguous (“hot” in an animal is likely to be temperature, in a fungus likely to be taste). Conversely, for qualitative data the prop-erty is often implied in the measurement unit (e.g., ‘g’ or ‘°C’ have implicit properties “weight”

and “temperature”).

The introduction of “structural paths” seems to be a generalization of the two-level (structure and substructure) storage model described in some versions of the Nemisys/Genisys model.

Structural paths simplify the structural terminology by avoiding the need for terms like “ground leaves”, “stem leaves”, etc. At the same time, they remove the possibility to give these parts a name. This seems to be unfortunate, since the question whether a part has a separate name is language- and culture-dependent as the example of German and English shows. In German, it would be logical to construct both petiole (leaf stalk) and pedicel (flower stalk) as a structural path, because German botanical language has no special terms for these. It is questionable whether this is desirable to an English botanist. Moreover, not only bracts (i.e., “leaves with single flower growing in axil”), but also sepals, petals, etc. are in fact modified leaves that could be expressed using structural paths rather than specialized terms.

Structural paths come at the expense of a complication of the storage model, requiring some means to store a path of unlimited length, and be able to both search for the exact path (e.g., only

“upper-stem-leaves”) as well as for generalized concepts (e.g., “any kind of leaf”). It is unclear, whether Prometheus stores the entire path in each in the description, or whether an anonymous specialized concept is created for each path, which is then referenced by a system identifier.

In general, Prometheus seems to deduce some of its requirements from particular features of the English language, e.g., when requiring that “state terms” may not be used as part of structure terms, and that any term may at most be “coded using one or two words” (Pullan & al. 2005).

Such rules need some generalization to make them compatible with languages that prefer derived nouns over adjective-noun clauses or require more than two words to express a single concept.

Even in English it is doubtful whether these rules indeed guarantee that “data can only be coded one way, even when entered by different authors” (Pullan & al. 2005).

If for each description an actual “structural path” is created that is based on first class object parts (i.e., structural terms for which compositional constraints exist in the terminology), then this is easily extended by adding elements for which no such constraints are defined in a similar fashion. These are the “Generalized Structure” and “Region” terms introduced by Prometheus. Of these, the regions are truly general (compare “Absolute object orientation” and “Relative object orientation in compositions”, p.147). The concept of regions seems to be closely related to spa-tial modifiers (which also exist in Prometheus, see p.196) and it remains open why two separate mechanisms are required. The mechanism of “Generalized structures” is less a logical require-ment than a convenience mechanism. Clearly, part like “hairs” are not truly applicable to all part of an organism, especially not if the compositional hierarchy includes anatomical parts. However, it remains at the discretion of the builder of an ontology, whether the mechanism is used or not.

The pro-forma mechanism is further discussed on p.127 (compare also Figs. 39-40).

An essential feature of the Prometheus model is that it tries to reform the way taxonomy and descriptions are performed. Although other models allow recording of individuals (including DELTA, and with increasing support DiversityDescriptions and SDD), Prometheus goes to the extent of considering abstract taxon descriptions as a set of “virtual specimens”, thus encouraging to record actual specimen data instead. Similarly, biological terminology may only be used if it fits the assumptions of the structure+property/state group model. In some cases a decomposition of characters into parts and properties requires reformulating biological terminology and organiz-ing knowledge differently, especially where functional concepts are used as organizorganiz-ing princip-les. This may be less convenient during identification because it corresponds less well with ex-pectations of the identifying person, but it may lead to more consistent use of terminology (Table 34).

Clearly, such an approach has advantages, but it is yet unclear whether it provides the flexibi-lity that biologists desire for their work. The Prometheus authors themselves refer to extensive testing that is required. Results of this have not yet been published.

Table 34. Examples of conventional characters that are difficult to decompose into property and object part (Lucid, left) and proposals how they might be handled in Prometheus.

Lucid: Prometheus description model: (Notes by T.

Character States

tolerance • plants tolerating high salt levels (halophytes)

• plants not salt tolerant

Entire Plant Ecological

adaptations halophytic (list of alternatives, or “not”)

Entire Plant Habit tree, shrub, herb, etc.

General

Entire Plant Architecture climbing, bushy, creeper, twining etc.

• plants growing in soil (not epiphytic or lithophytic)

• plants growing on other plants or on bare rock surfaces (epiphytic or

• rooted in substrate with leaves mostly submerged

• rooted in substrate with leaves mostly floating on the water surface

• rooted in substrate with leaves mostly emergent

Entire Plant Lifespan annual, biennial, ephemeral, perennial

Entire Plant Reproduction vegetative (list of alternatives, Bulb Presence present, absent or “not”)

Corm Presence present, absent Tuber Presence present, absent Rhizome Presence present, absent Stolon Presence present, absent Root-sucker Presence present, absent Detached

Inflorescence Type proliferous (‘types’ of structures

= associated sets of states; aerial stem parts might be a type of stem) Leaf-

chlorophyll

Presence present, absent Stem-

chlorophyll Presence present, absent

(uses structural

• present (plants green or grey-green)

• partially or totally parasitic on other plants

• sticky glands or glandular hairs on leaves/stems

• trap like irritable leaf blade segments

Notes on Table on previous page: Examples based on public postings to tdwg-sdd@listserv.nhm.ku.edu on 2004-03-17; available in TDWG-SDD list archive. Left-hand columns based on the Lucid key “The Families of Flowering Plants of Australia”, provided by Kevin Thiele; right-hand columns based on reply by Trevor Paterson.

121. Summary statement: The Prometheus description model has very special requirements on the information model. It elaborates and modifies the concepts of the Nemisys/Genisys model. It is implemented and tested. The extent to which this model is specific to certain kinds of data needs to be assessed as experience with the model grows.

122. Summary statement: The Prometheus description model provides for the definition of a subset of all possible object-part/property combinations for data entry. For different pro-jects, different sets of “enabled” object-part/property combinations may be defined. The union of all enabled selections is roughly equivalent to characters in character or character state matrix models.

Relational characters revisited

As mentioned in the discussion of the character decomposition models, a number of characters typically used in biological descriptions have a different structure than part+property+value.

Taylor (1995) was the first to introduce special data structures for “relational characters”, and both Nemisys/Genisys and Prometheus are addressing the problem. The following cases may be distinguished (Table 35):

Table 35. Cases that may be termed a “relational character”.

Situation Examples depending on two parts Examples depending on two properties 1 A single measurement is by

necessity dependent on two object parts/properties

Distances or angles-between two

parts (no example found)

2 A measurement primarily depends on one part/property

Im Dokument Structuring Descriptive Data of Organisms — Requirement Analysis and Information Models (Seite 116-125)