Current usage of modifier-related concepts

DELTA and New DELTA: Although the need for modifications of character state×entity in-stances was recognized (Dallwitz 1980), the provision of free-text comments (Fig.99 top) was considered sufficient at the time. Proposals for a revised “New DELTA” (p.20) did include sev-eral additional mechanisms, called “coded comments”. Some of these are related to modifiers:

■ Two forms to express probability or frequency values (“<@probability x>”, “<@x%>”).

These can be combined with a free-form text like “frequently” or “usually”, but do not replace them. An exception is “rarely” which is available as a system-defined coded comment

(“<@rarely>”). No other frequency categories are defined in “New DELTA”, nor can they be defined by the content authors. It is not possible to distinguish between probability based on frequency, and probability based of uncertainty estimates.

■ The coded comment “<@about>” for quantitative numeric data as a system-defined approximation modifier.

■ The coded comment “<@?>” to mark guessed values, essentially a system-defined “proba-bly”. No other certainty modifiers (such as “perhaps”) are available.

■ The coded comment “<@reliability x>” to modify the reliability defined for the character in the terminology in a specific description.

Further coded comment mechanisms in “New DELTA” support information about values that are inherited or calculated along the taxonomic hierarchy (“<@up>”, “<@down>”), supply hidden notes not visible in natural language output (“<@note: text>”, replacing the DELTA ‘inner com-ment’ mechanism), and specify alternative character values for particular applications (“<@use n: s>”). These are all unrelated to the concept of modifiers.

The coded comment system of New DELTA is not extensible, only modifiers defined in the standard may be used.

NEXUS (p.18): For categorical data, NEXUS supports counts or frequencies if the subcom-mands “StateFormat=Frequency” or “StateFormat=Count” are given). All entries in the “Data, Matrix” block then must be enclosed in parentheses. For example, in “taxon_1 (0:0.25 1:0.75) (0:1.0 1:0.0 2:0.0)” the first character is polymorphic with states ‘0’ and ‘1’, the second monomorphic with state ‘0’ and alternative states ‘1’ and ‘2’. No other forms of modifiers (or free-form text comments) are available in NEXUS.

DiversityDescriptions (Fig.99 top): At about the same time when the New DELTA was ini-tially proposed, a generalized modifier concept was developed in DiversityDescriptions. Already the first version (Hagedorn 1997) included a flexible concept of user-definable, reusable modifi-ers. Modifiers were defined in a single list for an entire project (Table 46), with a separate list defining the applicability of modifiers to characters. In descriptions, only applicable modifiers are selectable for a given character. For categorical characters, the list of states and the list of appli-cable modifiers can be freely combined in descriptions.

Modifiers were categorized into usage categories (which could be used when defining the ap-plicability of modifiers to characters) and a number of properties could be defined for each modi-fier (influence on character value reliability, output in natural language before or after the value, and with or without a blank). A template definition of modifiers was provided as a convenience to content authors, but modifiers could be freely added or deleted.

The “usage categories” of DiversityDescriptions are unordered sets; it is not generally possi-ble to define a ranking within such a set. In newer versions of DiversityDescriptions, a ranking can be indirectly defined for frequency modifiers through a quantitative frequency interval (attri-butes “LowerFreq”, “UpperFreq”, not shown in Table 46). As a result of the discussions in the SDD group, later versions of DiversityDescriptions further added the concept of misinterpretation modifiers (see below, p.209) pioneered by CBIT Lucid/Lucid3.

Both the list of modifiers (Table 46) and the usage categories are fully extensible in Diversity-Descriptions, separately for each project.

Table 46. Excerpt from template modifier definitions in DeltaAccess/DiversityDescriptions 1.0 (Hagedorn 1997; later versions added additional concepts).

Usage

* Usage class = broad categorization of modifiers. Reliability = expression allowing a modifier to influence the reliability score of the base statement or character. Postfix = in natural language, the modifier is rendered after the modified statement (else before). UseBlank = the modifier is connected with the statement using a blank. Compare p.347.

CBIT Lucid (p.21, Fig.101 top): The absence of structured mechanisms to express fre-quency and misinterpretation information was a major reason for the developers of Lucid to de-velop a proprietary exchange format instead of using DELTA (K.Thiele, pers.comm.). Lucid contains a small set of value modifications, namely “rare”, “misapplied”, and “uncertain”. This list is not extensible because it is part of the character scoring mechanism inside the data matrix (i.e., modifications are not added to a score, but implied in the alternative score options: “absent, present, unknown, rare, commonly misinterpreted, and rarely misinterpreted”). Note that no “of-ficial” documentation of the LIF format was found; the information given here is inferred from data sets analyzed (Leary & Hagedorn 2004) and confirmed by K.Thiele (pers.comm.), one of the main designers of Lucid. In his comparison, Dallwitz (2006d) records Lucid as having “value is unknown”, but no “value probability” (i.e. “uncertain”) modifier. Lucid indeed calls the modi-fier in question “unknown”. If all states of a character are scored such, the result will indeed be equivalent to a coding status value “unknown” for the entire character (no data have or could be observed, see p.74). However, if some states are scored normally and others as “unknown”, the result will be best interpreted as these states being “uncertain”. The developers of Lucid support this interpretation of “unknown” when they state that “Lucid can encode uncertainty for a state, while in DELTA uncertainty can only be encoded for a character” (http://www.lucidcentral.org/

lucid3/lucid_translator.htm). Lucid supports no unconstrained text, so that all expressiveness is limited to the three modifiers supported, and to modifiers embedded in the character or state

defi-nitions. This design decision makes Lucid simple to use for identification purposes, but limiting when aiming for comprehensive natural language descriptions.

XPER and XPER²: These programs (tested March 2007, latest version XPER²: 1.70) do not support modifier concepts. XPER² supports unconstrained text on character data (“descriptors”), but not on individual state scores. This makes it difficult to express state-specific modifier infor-mation the way the original DELTA does.

Nemisys/Genisys: The mechanism of modifiers is related to the “name-extension” mechanism (Fig.101 bottom) proposed in Diederich (1997) and Diederich & al. (1997). A name-extension modifies the semantics of either the part (i.e. “structure”) or the property name of a character. It is proposed as a mechanism to curb the explosions of the number of characters that may occur otherwise; e.g., Fig.9 in Diederich (1997) lists eleven different ways of measuring body diameter of nematodes.

Although both constituents of the decomposed character potentially need modification in a single character, according to Diederich (1997) only a single data element exists for the “name-extension”. Thus either structure or property, but not both may be modified. The examples given in the tables in Diederich & al. (1997) seem to support this interpretation, making “name-exten-sions” and modifiers rather similar (compare examples given in Table 47). Some guidelines are given in Diederich & al. (1997) indicating that modifiers referencing structural terms (at mid-body, at anus) are to be constrained to existing, defined structures, and extensive guidelines are given on the use of spatial modifiers (anterior/posterior etc.). No similar guidelines are given for non-structural modifiers.

Table 47. Examples of use of name-extensions from Diederich & al. (1997).

Structure Structure extension Substructure Property Property extension State Excretory pore – – position relative to {median bulb, nerve ring} anterior

Stylet – – length {along the axis} (value)

Body – – diameter {at mid-body, at stylet,

at Vulva, at Anus}

(value) Body {anterior part,

posterior part} Lateral fields orientation – symmetrical

As defined above, “name-extensions” are meant to modify only structure or property terms, not quantitative values or categorical states. Consequently, no equivalents for frequency of prob-ability modification (“rarely”, “perhaps”, etc.) are discussed in the Nemisys/Genisys model.

It remains unclear whether a separately defined, reusable modifier terminology is intended or not. Diederich (1997) states that “Instances of basic properties maintain a list of name exten-sions”, and “when a character is created, an instance of a basic property is created”. However, Diederich (1997) does not distinguish between character variable (terminology) and character data (description); the term character may refer either to “(biological structure, property, state/ value)” or to “(structure name, property) tuples, ignoring the states”. Thus, the statement may refer either to constraining name-extensions in descriptions by a list of defined extensions in the terminology, or to a model where within each description, a list of multiple name extensions is allowed.

“Name extensions” are further used to represent the additional information necessary for Diederich and Fortuner's “relational properties” (e.g., “presence-at …”, requiring a second struc-ture, see Table 9, p.63 and compare “Relational characters revisited”, p.122). One may view “re-lational properties” as properties that are constrained to require a structural modifier. It is unclear, whether in this case the “name extension” is limited to defined structure terms, or whether it is in fact considered free-form text, allowing a combination of modifier and related-structure informa-tion. In the first case, it may be difficult to express a relation to a part of a structure that would otherwise require a modification. Furthermore, the main structures in the Nemisys/Genisys model

are divided into structure and substructure, whereas name extension is a single data element, again pointing to the interpretation that the name extension is originally designed as a free-form text element.

Prometheus description model (p.21): This character decomposition model elaborates and re-fines the Nemisys/Genisys proposals. The name “modifier” is explicitly used for the Nemisys/ Genisys “name extensions” and the missing concept of frequency modifiers (certainty seems to be not mentioned, but would be a simple extension of the model) is added. Similar to the Nemi-sys/Genisys name extensions, Prometheus modifiers have a dual nature:

■ A basic type of modifier modifies an otherwise complete statement: How frequently was something observed, when was it observed in time, or where exactly was something observed within a structure. Pullan & al. (2005) further distinguish:

□ Simple categorical modifiers (they only name frequency as an example); these may be stored directly as an attribute of a description element.

□ Modifiers which modify a statement by reference to a spatial or temporal “landmark” (ex-amples: modifier = “at”+spatial landmark = “breast height” or modifier = “during”+ tem-poral landmark = “summer”). In this structure, the modifier part will only take very few values (e.g., “at, below, above” or “after, before, during, or while”). It seems implied that this information cannot be stored as part of description elements but requires a separate data structure.

■ An entirely distinct type, the relative modifier combines two existing description elements (rather than referring to one), and either expresses knowledge about relative order or rank (supporting the fixed operators “>, ≥, <, ≤, =, ≠, ratio-of”), which may be combined with a value. Using this method information such as “leaf width > length”, or “petal length > 2× sepal length” may be expressed.

The second concept is related to the use of name extensions together with the “relative proper-ties” in Nemisys/Genisys. The solution is, however, different and more general. Whereas in Ne-misys/Genisys the name-extension would be used to complete an otherwise incomplete single statement in a single record, Prometheus seems to introduce a concept of non-scored or non-val-ued “description elements” (i.e., a part-property-value tupel without a categorical or quantitative value). Two such description elements (valued or unvalued) are then combined with a relative modifier, to form a new statement. Table 48 is an attempt to illustrate the proposed solution.

It is debatable whether this is appropriately called a “modifier”. The Prometheus “relative modifiers” are structurally completely different from the basic modifiers, and it remains unclear which information they modify – they rather create an entirely different form of statement. It does not meet the definition of modifiers proposed here, and neither does any of the dictionary definitions of “modifiers” cited above support calling comparison operators “modifiers”. Pullan

& al. (2005) themselves explicitly state the requirement that “when querying descriptions, it is possible to ignore these modifiers without detrimental effects to the results.” This requirement is clearly violated by “relative modifiers”.

Further problems with the concept of relative modifiers are:

■ It is unclear how a statement combining a ratio and a relative operator “length-width-ratio >

2.0” (a frequent form in mycology) is handled. This problem seems to indicate that handling comparison operators and functions in a single data element is inappropriate.

■ Pullan & al. (2005) in their Figure 2 explicitly mention a requirement for a “Defined Unit”

(i.e., measurement unit, like cm) for relative modifiers. This requirement needs further study:

ratio- or comparison statements should normally always be dimensionless and no example for a relative modifier requiring a measurement unit was given.

Returning to the basic modifiers, it is debatable whether the distinction between simple and landmark-modifiers is justified. The separation of spatial modifiers into two parts (modifier plus landmark) is clearly often advantageous since defined structural terms can be reused (“stem hairy”+“at”+“tip”, “base”, “middle”, “inflorescence”). However, the quoted example of “at”+

“breast height” already indicates spatial modifier landmarks will not always be a natural part of the part-of vocabulary used in character decomposition (other example: “width”+“at”+“widest point”). In the case of temporal modifiers the landmark vocabulary will even be less reusable (e.g., “nightfall”, “fruiting”, “first flowering”, or “spring after cutback”). However, if a separa-tion into two data elements is already introduced for spatial modifiers, reusing it for temporal modifiers is logical. It remains unclear in the model, in which part of the terminology landmark terms (special spatial as well as temporal ones) will be defined, and whether separate data struc-tures are envisioned for this.

None of frequency, spatial or temporal modifiers (e. g., frequency, distance along stem, or time of day) may carry a quantitative expression in Prometheus; quantitative values are reserved for relative modifiers.

Table 48. Interpretation of the structure of relative modifiers, based on information in Pullan & al.

(2005). Note that description elements may or may not have values. Note the interpretation of

“value” in the modifier entity differs strongly whether expressing is length/width = 2.1, or length ≥ 2× width.

Operator Value Meas. Unit

1 1 2 Ratio 2.1

"Name extension" seems to be related to modifiers; it remains unclear, however, whether it is to be constrained by a terminology or not.

Figure 101. Simplified comparison of CBIT Lucid3 and Nemisys/Genisys models in regard to modifiers and free-form text annotations. Compare Fig.89 (p.179) for additional information about the Nemisys/Genisys model.

SDD (p.20) defines modifiers as part of the descriptive terminology. These modifiers are grouped into sets of modifier concepts. In descriptions, character data may be modified by refer-ring to these defined modifier concepts. SDD also supports the unconstrained free-form text for individual character state scores, replicating the functionality present in DELTA.

Modifiers were intensively discussed in the SDD group and repeatedly changed. In SDD ver-sion 1.1, the main concept ontology, which is also used to create character hierarchies, is also used to define modifiers. Thus a hierarchy of modifier concepts may be created, e.g., with seve-ral different sets for frequency modifiers. Modifier concepts may contain a special specification, whether the sequence of modifiers defined within a concept is considered to be significant (“or-dered=true”, as in ‘weakly’–‘moderately’–‘strongly’) or not. With “ordered=false”, the sequence of modifiers is intended for display purposes only and carries no additional semantics.

Both the concepts (modifier sets) and individual modifiers are fully user-definable and not constrained by SDD. However, to support application interoperability and identification proc-esses, this is complemented by two modifier specification attributes: A modifier Class, with an enumerated list of values (see Table 49), and a quantitative range, that, depending on the modifier class, is interpreted as a quantitative frequency, certainty, etc. estimate. Modifier classes have been defined where a quantitative range was considered desirable, or where modifiers are expec-ted to influence data analysis and identification processes (especially, frequency, probability, and misinterpretation). The modifier class “Other” is left undefined, to support any future uses of modifiers.

Table 49. Enumerated modifier classes in SDD 1.1 (based on annotations in the SDD schema).

Class Description Interpretation of Proportion

Frequency Frequency modifier.

Examples: “rarely, occasionally, usually”.

Values specify a frequency range Certainty Certainty modifier. Examples: “perhaps, probably”. Values specify a certainty range Seasonal Seasonal modifier. Example: “in spring”. Values specify a season of the year. The

value 0 is to be interpreted as day 1, the value 1 as day 365 of the year

Diurnal Diurnal modifier, referring to parts of the day (24 h clock, i.e., including ‘nocturnal’ events).

Examples: “in the morning, at night”.

Values specify a time of the day. The values 0 and 1 are both to be interpreted as midnight. Example: A modifier “at night” may be specified as ‘0.8-0.2’.

TreatAs- Misinter-pretation

States to which modifiers of this class are added are known to be intentionally wrongly scored to anticipate known misunder-standings of the character under study. Example: if bracts look like petals, petals may be scored as ‘white (by misinter-pretation)’.

None

Other All other modifiers for which specifications are not yet defined.

Examples are developmental, absolute, and relative spatial modifiers, or modifiers of degree.

None

A separate mechanism allows recommending modifiers for certain characters. This “enabling”

of modifiers for characters has been deliberately not formulated as an identity constraint; it is per-fectly legal to have a modifier in a description that is not currently recommended for this charac-ter (it must only be present in the charac-terminology of the data set). A major reason for this design is that large institutional data sets may contain descriptions from various sources (e.g., NLP-proces-sed natural language descriptions, imported DELTA, or CBIT “LIF” data). These older descrip-tions may use a richer set of modifiers than what has been agreed in a project to use in the future.

SDD conforming applications may choose whether to use the information about recommended modifier/character association for data entry or not. They are encouraged to do so.

Im Dokument Structuring Descriptive Data of Organisms — Requirement Analysis and Information Models (Seite 192-199)