Terminology modules and class hierarchy - Structuring Descriptive Data of Organisms

It is conceivable to create a hierarchy of terminology modules (i.e., sets of terminology elements) that follows a taxonomic hierarchy (Fig.92). Although the model is attractive, it has several limitations:

■ Phylogenetic classification is an area of active research, and the taxonomic hierarchy in many biological groups is not stable. Changes in a taxonomic hierarchy that defines usable termi-nologies would be difficult to implement once thousands of researchers would use such a central terminology system on the internet.

■ The characters that are desirable at a higher level for the purpose of identification are not necessarily phylogenetically informative. A purely phylogenetic design of the taxonomy-de-pendent hierarchy is therefore not possible. For example, the vegetative stage of a fern like Marsilea quadrifolia L. may easily be confused with a flowering plant. Thus, even if leaf size

and shape are too variable to be used for phylogenetic purposes, it is desirable to have them at a very high taxonomic rank, to support vegetative identification without prior knowledge of taxonomy.

■ The scientific process of revising taxa is a bottom-up process. The most urgent need for termi-nology and digital descriptions is present at the level of genus or family. It would be unpro-ductive to postpone using advanced computer-supported description software until the taxo-nomic tree is stable and the terminology modules for the higher taxotaxo-nomic levels have been agreed upon.

Despite these limitations, a hierarchy of terminology modules designed for taxonomic groups is desirable and should be supported in the information model. At the moment, however, the taxo-nomic hierarchy should not be a required element in the organization of the terminology. Instead, it may be used to label and organize terminology modules that are then manually selected and combined in a project. Judicious use can limit the danger that may result from changes in parts of the phylogenetic classification that are poorly understood. For example, it may be desirable to skip a poorly defined order rank and duplicate a few characters in multiple family terminologies.

Similar to taxonomy-specific standard terminology modules, terminology modules specific to methods or instrumentation could be defined and standardized. The complete terminology for a descriptive project could then be a combination of terminology modules (Fig.93) plus local terminology extensions.

Figure 93. Combining multiple terminologies (“character definitions”) can also be useful to com-bine characters defined for different methods and add them to the current project as needed.

Order 2

Figure 92. Example for a hierarchy of terminology modules that follows a taxonomic hierarchy.

Additional taxonomic ranks may be present above, below, and in between those depicted here.

Note that even species-specific terminology would be conceivable, e.g., to distinguish infraspeci-fic taxa.

188. It is desirable to express the scope of terminology modules relative to the taxonomic hierarchy. Application may use this information to manage availability of terminology items for different taxa.

Models to support multiple distributed terminologies

Three basic approaches to connect local descriptive data with standardized terminologies can be distinguished:

■ The namespace model, in which the standard terminology resides entirely on the internet and is only referenced in the local terminology. A local cache may be present, but no local

changes or extensions are possible (Fig.94, right side).

■ The template model, in which a standard terminology is copied to a local terminology and can then be changed. Provided some kind of identifier remains unchanged in the copy, the identity of origin may then be used for data integration. However, without human control the local changes may substantially change the semantics of the terminology up to the point where data integration is no longer sensible (Fig.94, left side).

■ The declarative model, in which the terminology is defined locally, but the developer de-clares that the definition of a given term (character, state, etc.) follows a published standard.

This may be achieved by citing a standard identifier or reference, version, plus a specific code for each term from the standard.

Use external terminology for a new project

Copy external terminology

Link to external terminology

Locally store an updatable cache

«extend»

Figure 94. External terminology may be copied or linked, the latter optionally with a local cache.

Compare also the general use case diagram (Fig.91).

Namespace model: Standard terminologies could reside on multiple servers on the internet and could be used directly from there (Fig.95). This is similar to the use of multiple XML name-spaces (with a schemaLocation as a resolution method) in a single XML document. Given that online internet connections may be expensive, unreliable, or even unavailable (e.g., on a note-book in the field), a mechanism to locally cache external terminologies is desirable.

Using a namespace model, a standard terminology module would always be included in its entirety. This may be acceptable if each standard is split into small modules (e.g., separate mod-ules for methods/instrumentation) so that the amount of unnecessary terminology that may con-fuse users is minimal. Alternatively, the information model could provide a local mechanism that allows defining subset views on the standard terminology.

Service X Service Y

User Project Internet connection

Local terminology

Figure 95. Network namespace model for federated terminologies. Multiple standardized terminologies are stored on the internet and used directly from there. Only terms not defined in a standard are stored in a local terminology.

The disadvantages of the namespace model are:

■ The standard terminology modules would have to be available before the work on a project begins. It is difficult to combine this model with locally defined terminologies. If local termi-nologies overlap with only recently developed standard termitermi-nologies, the descriptions have to be ported to make use of the new standard terminology. The model itself provides no mecha-nism to do this gradually.

■ A similar problem may arise, if a new version of a standard is published. A new version that is not fully backwards compatible would not replace an existing standard, but would be added as a new namespace. Changing the referenced standard itself is feasible only in very limited cir-cumstances, since any substantial changes would invalidate the descriptions that use the older definitions.

■ Standards could be published only digitally. This may be acceptable if all programs use a common exchange standard format. As long as multiple formats are used, and given the on-going importance of printed publications as long as digital publications are too unstable to guarantee retrieval at a future date, this is, however, undesirable.

Template model: If one or several standard terminologies are used in a new project, they can be copied from templates that are available from a library of standard terminologies (Fig.96). To trace the definition of a character back to the standard template it originates from, an explicit mechanism such as a Globally Unique Identifier (GUID) is required to remain unchanged in each term.

Once a template is copied into a local terminology, it can (and usually needs to be) changed.

In these cases great care must be taken that the changes do not lead to situations where the hu-man-readable definition in character or states contradicts the semantics of the original definition in the standard used as a template. The developers of terminology are ultimately responsible that the terminological concepts perceived by users using the terminology for coding and identifica-tion remain sufficiently similar to the concepts defined in the standard terminology template.

local extensions selected from

standard 3 selected from

standard 2 selected from

standard 1

Service X Service Y

(copy standard templates from internet, perhaps selecting a subset)

Standard 1

Standard 3 Standard

User Project

Figure 96. Template model for federated terminologies. Several terminology modules are copied from templates and can then be changed similar to local definitions.

Declarative model: In this model any terminology is primarily developed locally. Wherever possible, the developer adds an explicit declaration that the concept of a local character or state conforms to the concept of a character or state in a standard terminology (Fig.97). The declara-tion should consist of an identifier or a reference for the standard, the version of the standard, and a reference to the individual term. These elements may be combined, so that a single Globally Unique Identifier (GUID) may include the information on the term, the standard it is contained in, and the exact version of the standard. The standard could be identified through a URL, or through text citing a printed publication. The advantages of this approach are:

■ Unlike in the namespace model, external standard terminologies are not required to have a specific format (e.g., SDD).

■ The external standard may even be a conventional, printed publication. Printed and digital standards could exist side by side.

■ A smooth transition of existing data sets towards increasingly standard-conforming data is possible, since the declarations can be made individually for single terms (character, states) rather than being restricted to entire terminologies.

■ The process explicitly supports the process of migrating from existing terminologies to newly developed standard terminologies, or from older to newer standard versions.

Disadvantages are:

■ No automatic discovery mechanism for possible relations to standards is anticipated. The ma-chine-readable data-integration mechanism depends entirely on human comparison of local and standard concepts.

■ Developing a local terminology for a given group involves substantial work and often many revisions to correct for initial errors in the terminology.

These points can be addressed by combining the declarative model with a template model, copy-ing a ready-to-use terminology module, but maintaincopy-ing the publicly visible declarative refer-ence. If the designers of the terminology detect that they are changing a term in a way that the local and the standard concepts differ, they may remove the declarative reference to indicate this.

Standard 1

Standard 3 Standard

Local Terminology

User Project

Figure 97. In the declarative model each term of the local terminology contains, among other data elements, an optional reference to a standard terminology. This reference is set by the de-signers of the terminology to declare that the local concept is identical with the concept in the standard. The standard may be available directly in digital format, or may be published in a printed publication. If the declarative model is combined with a template model, the reference will already be set for those parts copied from a standard template.

189. Terms in the terminology modules should be identified by GUIDs.

190. It is desirable that the relation between locally defined terms and external standard terminology modules can be expressed through GUIDs. The relation may be of several kinds: e.g., “copied from template”, believed to be “similar” or “essentially identical”.

Conclusions

A desirable solution that is both flexible and efficient seems to be a combination of the declara-tive and the template model. This may allow the evolutionary term-by-term mapping of existing, locally defined terminologies towards standardized and shared terminologies, while at the same time profit for new terms from the work invested into standard terminologies. New terms could be used as templates and would – in addition to including many presentational or assumptional elements – already contain the declarative information required to map to a standard, making the laborious and error-prone ontology-mapping process no longer necessary.

By referring to published and standardized core terminologies it will be possible to create fed-erated descriptive data collections, where multiple independent sites store descriptions that can be compared or integrated. The use of Globally Unique Identifiers (GUIDs) even allows one to di-rectly join terminologies based on ID identity, requiring no online access to the standard termi-nologies to which they ultimately refer (Fig.98).

In the future it is to be hoped that a large library of reusable and tested terminology modules for a wide variety of biological groups and methods will become available. Not all such modules need to be declared a “standard”; becoming a standard could be an evolutionary process of de-mand and acceptance. Researchers starting to develop descriptive data sets for groups of organ-isms where no “standards” exist could yet use previous work as a template and revise existing definitions rather than start from scratch.

To the author's knowledge no library of terminology definitions exists so far. Even for the DELTA standard only a handful of reusable character definitions can be found on the web, since most “DELTA” data are actually the binary, encoded Intkey data usable only for identification but not as a template for further character development.

Service Y Local

Termi-nology Local

Termi-nology

Descrip-tions using project

termi-nology

Joint Termi-nology

Descrip-tions using project

termi-nology

Con-sensus

descrip-tions Service X

Figure 98. Consensus terminology created by a join of multiple terminologies from multiple sites on the internet. The descriptions can then be used and queried across database borders. The join shown is an outer join, so that no descriptors are dropped. Only the matching descriptors can be used together. Alternatively, the terminology could be reduced to matching descriptors (inner join).

4.14. Modifiers

Introduction

Composition (part-of) and generalization (kind-of) ontologies for object parts (structures), prop-erties, and measurement methods have been discussed in the previous sections (p.131ff). When studying an actual example, however, it appears that several aspects of descriptive data are in-adequately covered by these ontologies:

“Subgenus Myrmeurynota FOREL. Pronotum very broad, with a lateral, lamelli-form margin, often vaulted. Thorax rapidly narrowing behind. Epinotum very narrow at its sloping face, which often has a peculiar appendage. Gaster broad, short, and small, sometimes more or less spherical. Probably arboreal.” (Ant de-scriptions by Wheeler, example provided by R. Morris, potential modifiers em-phasized.)

These terms may be called modifiers of frequency, certainty, degree, and location. As discussed above (see, e.g., Fig.89, p.179), descriptive information often may variously be placed in termi-nology or descriptions. The options available for handling modifiers are:

■ Modifiers are embedded in the character or state terminology.

■ Modifiers are freely added to the descriptions as unconstrained text annotations (“com-ments”).

■ Modifiers are added to descriptions in a structured form, constrained by and referring to a separate mechanism in the terminology.

In classical DELTA (p.19), only the first two options are available (Fig.99 top). The second option is impractical for frequency or certainty modifiers, if the data are intended for identifica-tion or analytical purposes (is the stem of a specimen “frequently hairy”, “rarely hairy”, “proba-bly hairy”, or “perhaps hairy”?). In contrast, modifiers of degree may be embedded in states (for a character: “Stem (hairiness)” the states could be: “not hairy”, “hairy”, “slightly hairy”, “strong-ly hairy”, and “hairy at the tip”. However, this is often unsatisfactory, causing an inflation of

character states that have complex relationships among each other. Consequently, in DELTA data sets the preferred method of expressing modifier information is the use of free-form text com-ments. The problems with doing so are:

■ Important information is not accessible to machine reasoning (except perhaps by sophisticated natural language processing). For example, for identification processes, frequency, certainty, and misinterpretation information is relevant but difficult to obtain from unconstrained text.

■ Interpreting the use of similar, perhaps synonymous modifier phrases is difficult. For example,

“often” and “usually” may express the same or different frequency concepts.

■ When creating multilingual data sets, allowing people from different cultures to collaborate, comments must be translated in each description rather than a single time in the terminology.

Treating modifier information as free-form text comments drastically increases the number of comments, causing a heavy translation burden (60-90% of free-form comments are typically modifier-related, unpubl. analyses of DELTA data sets).

...

Char. ref.

DELTA model

Categorical state def.

Character definition

...

Object Descriptions Char. ref.

... ...

Char. state ref. Free-form text annotation

DeltaAccess/

SDD model

Categorical state def.

Character definition

Modifier definition

Object Descriptions

...

Modifier ref.

...

Char. state ref.

...

Free-form text

Modifier Modifier Modifier set

Modifier Modifier Modifier set Descriptive Terminology

Descriptive Terminology

Figure 99. Simplified comparison of DELTA-like and DiversityDescriptions/SDD models in regard to modifiers and free-form text annotations.

The option to add structured modifier information (constrained by a separate modifier terminol-ogy) is not available in classical DELTA but has since been proposed in various descriptive in-formation models (see below). Such an additional, independent dimension of terminology can strongly simplify a scoring scheme (Fig.100). Structured modifiers (as provided in Diversity-Descriptions and SDD, Fig.99 bottom) have many advantages:

■ Modifiers provide flexibility in the level of detail that is recorded. They decouple the level of detail imposed by the character definition from levels of detail provided in data or chosen for various analysis purposes.

□ By initially ignoring modifier information, a coarse view of descriptive data is often desir-able during identification to concentrate on major issues.

□ In data analysis one may choose between a coarse analytical treatment (ignoring frequency, uncertainty, and minor modifications of degree or location), and a detailed analysis where

values with different modifiers are considered to be different. In conventional DELTA this decision must be made when the character definition is elaborated.

■ Machine-readable frequency and certainty information can be evaluated by identification algorithms.

■ When aggregating information (e.g., from specimens to species, or species to genus descrip-tions), modifiers can generally be handled better than free-form text comments. Without NLP methods, comments can only be concatenated, whereas modifier identity or differences can be analyzed and appropriately aggregated.

■ By reducing the number of free-form text comments, they simplify the translations of a data set. Similarly to characters and states, modifiers must be translated only once for all descrip-tions, and not per-description like comments in the description.

■ If the designer of the character and modifier terminology has options to impose (constraining the validity of data) or recommend (accepting or ignoring recommendations has no impact on the validity of data) associations between characters and modifiers, additional benefits arise.

For example, it is possible to provide concise modifier pick lists in the data entry user inter-face, containing only modifiers applicable to the current character.

[No separate modifiers available]

(e. g., as in DELTA)

[Partly delegated to modifiers]

(e. g., as in DiversityDescriptions) Spore appendages

Appendage presence/frequency 1. without appendages 2. rarely with appendages ; 3. usually with appendages 4. always with appendages Diameter [ ] (µm) Diameter at base [1.4] (µm) Diameter at middle [ ] (µm) Appendage tip

1. blunt or rounded 2. pointed

; 3. strongly pointed

Spore appendages Appendage presence ; [usually] 1. present [ ] 2. absent

Diameter [at base] [1.4] (µm) Appendage tip

[ ] 1. blunt or rounded ; [strongly] 2. pointed

Figure 100. Excerpt from a scoring scheme for fungal spores. The left side illustrates several cases that can be simplified with the introduction of modifiers (right side).

Definition

The term “modifier” is used in natural language and has a specific meaning in grammar (Table 45). Both senses are in agreement with the usage proposed here in descriptive data. “Qualifier” is approximately synonymous with modifier and could be used instead of “modifier”. The different use of qualifier in UML is clearly very specific and would cause no confusion (Table 45). How-ever, no advantage of “qualifier” over “modifier” can be seen either. According to CED (1992),

“modifier” is the preferred term for the grammatical concept, and it has been used in descriptive information models for some time now (Hagedorn 1997).

As shown in the definitions, a modifier may either be a noun in a composite noun or an adjectival or adverbial word or phrase. In natural language, many character states are expressed as adjecti-ves of the objects being described (exceptions are “kind-of” states, such as: fruit = capsule, berry, nutlet, etc.). As a consequence, in English many modifiers take the form of adverbial

Im Dokument Structuring Descriptive Data of Organisms — Requirement Analysis and Information Models (Seite 183-192)