• Keine Ergebnisse gefunden

Boolean operators between states of categorical characters

In principle, descriptive statements may involve Boolean operators such as ‘and’, ‘or’, ‘xor’, and

‘not’ (where ‘A or B’ is true for ‘A’, ‘B’, or ‘A and B’, and ‘A xor B’ is true for ‘A’ and ‘B’, but not ‘A and B’). Using a syntax modeled after a combination of SDD and MathML (Carlisle & al.

2003), one may write, e. g.,

for “petal color: red or orange or yellow”:

<char id="petal color">

<apply>

<or/>

<state id="red"/>

<state id="orange"/>

<state id="yellow"/>

</apply>

</char>

and for “petal color: red and (orange or yellow)”:

<char id="petal color">

<apply>

<and/>

<state id="red"/>

<apply>

<or/>

<state id="orange"/>

<state id="yellow"/>

</apply>

</apply>

</char>

Table 21. Example data to test ‘and’ and ‘or’ statements on different aggregation levels (compare text).

Aggregation level: Values:

Species Individual Flower Single petal Color Shape

S1 I1 F1 P01 red round

S1 I1 F1 P01 orange round

S1 I1 F1 P02 red round

S1 I1 F1 P03 orange elliptical

S1 I1 F2 P04 red round

S1 I1 F2 P05 red round

S1 I1 F2 P06 red round

S1 I2 F3 P07 yellow elliptical

S1 I2 F3 P08 orange round

S1 I2 F3 P09 orange round

S1 I2 F3 P10 orange round

S1 I2 F4 P11 orange round

S1 I2 F4 P12 orange round

A major problem, however, is that the semantics of the Boolean operators ‘and’ and ‘or’ between prepositions is ill-defined if multiple aggregation levels are present and the level to which a statement should apply is not defined. Since the set of states for species S1 from Table 21 con-tains all states, descriptions may be:

■ “Species S1 has red, or orange, or yellow petals”, or

■ “Species S1 has red, and orange, and yellow petals”.

In practice, however, the second expression is misleading. Since a species is a class of individuals that cannot be directly observed in nature, most biologists would assume that the implied mean-ing is:

■ “Individuals of species S1 have red, and orange, and yellow petals”, – which is not true. A true statement would be:

■ “Individuals of species S1 have red, and orange, or orange and yellow petals”.

The closely related statement:

■ “Petals of individuals of species S1 are red, and orange, or orange and yellow”

could be interpreted that rather than that the individual has at least some petals that satisfy this condition, each petal must satisfy this condition. The corresponding true statement would be:

■ “Some petals in species S1 are red and orange, others only red, or only orange, or only yellow”.

Similar care is necessary when using:

■ “Flowers of species S1 have red and orange, or red, or orange, or yellow petals”.

This statement is true, but it remains ambiguous whether the variation is within flowers or between flowers. A more exact statement might be:

■ “Each flower of species S1 has either (red and orange) petals and red petals and orange petals, or only red petals, or yellow petals and orange petals or only orange petals.”

Such statements are normally not made in natural language descriptions and their meaning is very difficult to decode. As a consequence, much available data is ambiguous. Designers of informa-tion models that attempt to model the exact situainforma-tion would either need to decide that they can only be used for new data, or they would need to implement methods to deal with the ambiguity that is present in current data. The Prometheus model follows the latter path, and, requiring to explicitly build and clone object parts, describes them individually. This allows a much greater expressivity with regard to Boolean operators.

Existing DELTA-based software attempts to avoid the complexity of this problem. It allows a combination of states within a single character (i.e., a property of the entire object or a part of it) with either ‘or’ or ‘and’, and always assumes an ‘and’-relation between characters. However, for comparison and data-retrieval purposes during identification, the‘and’ operator is treated as iden-tical to ‘or’. The distinction is used, however, when natural language wordings for human users are to be generated. By this method, DELTA avoids the need to specify the intended aggregation level exactly, and allows both statements that are intended for individuals and statements inten-ded for the species level. The implied level is usually supplied by the background knowledge of the human reading the natural language description. The statement:

■ “Petals yellow or orange”

may mean that individual petals are both yellow and orange, but also that each flower has yellow or orange petals, or some flowers in each individual, or that some populations have yellow, others orange flowers. In contrast, when reading:

■ “Petals yellow and orange”

probably a level of individual organism, individual inflorescence, or individual flower is implied.

However, based on the use of the ‘and’ not even the individual level can always be assumed.

Whether reading:

■ “Occurring in Germany or France”, versus

■ “Occurring in Germany and France”

humans implicitly provide the knowledge that this statement indeed is meant to be expressed on the species level, since it is unlikely that the species only contains individuals growing on the exact border of the two countries. Most humans would consider the two statements to express the same semantics. Some people would consider the or-ed statement more natural, visualizing a spe-cies description containing several statements. Most biologists, however, would consider it more natural to use the ‘and’-wording, visualizing multiple individuals (or perhaps dots on a map). The importance of implied semantics becomes even clearer when contrasting the distribution example above with (a) a property that can possibly occur only a single time in an individual, such as size:

■ “size 10 cm or 11 cm”, versus

■ “size 10 cm and 11 cm”,

and (b) a property where it is ambiguous whether it refers to multiple parts of the individual or to multiple individuals in a species:

■ “roots branched or not”, versus

■ “roots branched and not branched”,

and finally (c) a property where it is intuitive that it refers to different parts of the individual:

■ “leaves opposite or alternate”, versus

■ “leaves opposite and alternate”.

Clearly the preferred Boolean operator in the case of distributions (“occurring in Germany and France”) is rejected as nonsense in the size example (“size 10 cm and 11 cm”). The choice of operator changes the perspective in the root example (“roots branched and not branched” proba-bly implies that the plant has several root systems and that the term “root” refers to the major branches rather than to the entire root system). Finally, in the leaf example it may be accepted and considered to imply that in each individual plant both opposite and alternate leaves occur.

One result of this is that the semantics of Boolean operators in legacy data (e.g., printed natu-ral language descriptions or printed keys) is often ambiguous and requires interpretation. The information model should therefore ideally support lack of semantics as well as interpreted or original semantics.

In SDD (p.20) the model descriptor for states in a single categorical character data element may be changed from the default (“OrSet”) to other values, compare Table 22.

Table 22. Values of the StateCollectionModelEnum in SDD (used in SummaryData/Categorical/

@statemodel)

Value Label Description or Examples OrSet Unordered set

of states, combined with 'or'

Multiple states scored for a character in a description form a set. The order of states has no special meaning and may be changed. In natural language output the states should be combined with 'or' to express that in individual objects (that belong to the class that is being described), the states may occur together or alone.

OrSeq Ordered sequence of states, combined with 'or'

Multiple states scored for a character in a description form a sequence, i.e., the state order carries some semantics and should be preserved in output. The precise semantics of the sequence is not explicitly defined, but assumed to be intelligible to human consu-mers; presumably relating to concepts of relevance or importance. In natural language output the states should be combined with 'or' to express that in individual objects (that belong to the class that is being described), the states may occur together or alone.

AndSet Unordered set of states, combined with 'and'

Multiple states scored for a character in a description form a set. The order of states has no special meaning and may be changed. In natural language output the states should be combined with 'and' to express that (in any individual object that belong to the class that is being described) the states will always occur together. Example: two colors that occur together in a pattern.

Multiple states scored for a character in a description form a sequence, i.e., the state order carries some semantics and should be preserved in output. The sequence seman-tics is not explicitly defined, but intelligible to human consumers and presumably relates to some concept of relevance or importance. In natural language output the states should be combined with 'and' to express that (in any individual object that belong to the class that is being described) the states will always occur together. Example: a black part with small red markings is more appropriately described as 'black and red' than 'red and black'.

WithSeq Primary together with secondary states

This is a special case of AndSeq, and in many circumstances (except natural language generation) may be treated as AndSeq. Example: "Green with brown" (often this may be two characters, e.g., base color and dot color). All states except for the first are consid-ered secondary.

Between Intermediate value between states

True value lying intermediate between (usually two) states. Example: "Between oval and elliptic" = "Oval to elliptic".

90. Boolean operators connecting descriptive statements that refer to the same property are pro-blematic because they interact with implied semantics (knowledge whether an object part is repeated or not), and the customary data representation of a property.

91. The semantics of ‘and’ or ’or’ in natural language descriptions or in DELTA data sets is often ambiguous. It may be desirable to be able to distinguish in the information model be-tween an “ambiguous or” in the sense of one of ‘and’, ‘or’, and ‘xor’, and an ‘or’ defined in the sense of Boolean logic.