• Keine Ergebnisse gefunden

Deep-Level Interpretation of Text

2.4 Interpretation for Deriving Content Descriptions

2.4.1 An Example Domain Ontology

After having described the characteristics of SLI and the input requirements of DLI, we are able to introduce an example domain ontology that can be used by these two processes.

Adomain in knowledge representation [RN03a] is a part of the world about which we wish to express some knowledge, e.g., medicine, engineering, law, athletics, etc. In this way, domain ontologies provide not only a vocabulary (S) to name concepts and roles that are specific to a domain, but also a terminology (T) to specify the semantics of that vocabulary. Appendix A shows an excerpt of the Athletics Event Ontology (AEO)2 [DEDT07] that will be used to illustrate examples throughout this work. A detailed explanation on the design of the AEO ontology is provided in Section 3.2. For now only some characteristics are described.

As stated in [DEDT07], the goal of ontological engineering is the construction of a consistent knowledge base efficient to serve the requirements of an application context.

The design of the AEO ontology follows specific design patterns, such that it serves the requirements of the interpretation process described in Section 2.5.

2Downloadable from: http://www.boemie.org/ontologies

28 CHAPTER 2. DEEP-LEVEL INTERPRETATION OF TEXT

‘13 August 2002 - Helsinki. Russia’s newly crowned European champion Jaroslav Rybakov won the high jump with 2.29 m. Oskari Fronensis from Finland

cleared 2.26 and won silver.’

Figure 2.8: Text excerpt with relevant information from the athletics domain.

Since the goal of SLI and DLI is to support the automatic annotation of media, the domain ontology should include the following elements:

• A signature that provides the required vocabulary of the domain of interest by means of atomic concept descriptions and atomic role descriptions used by SLI and DLI processes.

• A Tbox containing GCIs involving complex concept descriptions and role descrip-tions useful to represent the semantics of more abstract information identified by the DLI process.

Note that the Abox part of the ontology is considered empty, since it is not used by the SLI and DLI processes, instead these processes use Aboxes that aim to describe the content of specific media objects. The elements above specify the extraction requirements imposed on SLI and DLI processes. In other words, the SLI process should be able to find media segments that contain information suitable for their formal representation with the use of atomic concept and role descriptions of the domain ontology. The DLI process should be able to interpret the SLI results with the help of reasoning on rules and GCIs to extract additional information with more abstract content semantics.

To determine the signature of a domain ontology, the ontology engineer should observe the content of a media corpus. The ontology engineer starts by identifying relevant domain-related facts found explicitly in media content, and later by understanding the more abstract information induced by those relevant facts. Explicit facts help to define names for atomic concept descriptions. For example, Figure 2.8 shows an abstract of a document about athletics news. The underlined words highlight the information that is commonly found explicitly in athletics news. Concept names such asDate, PersonName, Performance, CountryName, etc., are relevant. In this way, the signature can have the following concept names.

(CN)Athletics={Date, P ersonN ame, P erf ormance, CountryN ame}

After identifying the relevant explicit facts, more abstract information should be identified.

Abstract information is composed of attributes represented by the explicit facts. For example, consider again the text in Figure 2.8 and the concept names above. PersonName and CountryName can be modeled as attributes of a more abstract concept such as

Athlete, where an athlete is a person which has a name and a nationality. Thus, the concept names Athlete and Person and the role names hasName and hasNationality are identified and added to the signature.

(CN)Athletics={Date, P ersonN ame, P erf ormance, CountryN ame, Athlete, P erson}

(RN)Athletics={hasN ame, hasN ationality}

As will be explained in Section 2.5, the DLI process requires concept and role assertions as input. Thus, SLI should also recognize relations between media segments. For exam-ple, consider again Figure 2.8 (page 28). SLI should be able to extract the relations of the segment containing the string “Oskari Fronensis” with the segments containing the strings “Finland” and “2.26”. To express such relations, role names are required such as personNameToCountryName and personNameToPerformance. These role names are generic names that express an association between two attributes of a more complex object. Notice that they are more specific names than a role name such as associated-With. More specific role names, such as personNameToPerformance, provide a hint on the required domain and range restrictions that should also be modeled in the ontology.

Domain and range restrictions contribute to the performance of the DLI process. In this way the signature is augmented as follows.

(CN)Athletics={Date, P ersonN ame, P erf ormance, CountryN ame, Athlete, P erson}

(RN)Athletics={hasN ame, hasN ationality, personN ameT oCountryN ame, personN ameT oP erf ormance}

As Section 2.6 describes, the DLI process can also be used to interpret image content as long as the input is a set of concept and role assertions. Thus, provided a SLI pro-cess able to extract information from image content about objects and spatial relations between them, the ontology should also model concept names and role names for image content. For example, consider the image in Figure 2.9 (page 30), concept names such as PersonFace, PersonBody, FinishLine, etc., can be identified. Also spatial relations between objects can be identified such as adjacent, near, etc. Thus, the signature of the AEO ontology is augmented with those names.

(CN)Athletics={Date, P ersonN ame, P erf ormance, CountryN ame, Athlete, P erson, P ersonF ace, P ersonBody, F inishLine}

(RN)Athletics={hasN ame, hasN ationality, personN ameT oCountryN ame, personN ameT oP erf ormance, adjacent, near}

The concept names SLC and DLC, stand for, Surface-Level Concept and Deep-Level Concept, respectively. The role names SLR and DLR stand for Surface-Level Role and Deep-Level Role, respectively. They are also included in the signature and represent the

30 CHAPTER 2. DEEP-LEVEL INTERPRETATION OF TEXT

Figure 2.9: Information from visual modality in the domain of athletics events.

most generic concepts (roles), after>, in the Tbox hierarchy. Thus, the ontology contains various inclusion axioms such as the following:

P ersonN ame v SLC CountryN ame v SLC P erson v DLC Athlete v DLC hasN ationality v DLR personN ameT oP erf ormance v SLR

. . .

These generic concepts and roles proved to be practical for applications, such as in [PKM09a]. They are used while querying the knowledge base, to distinguish between explicit and implicit media content. They are also added to the signature (see Appendix 2.4.1 in page 27). Moreover, this distinction simplifies the work of an ontology engineer in the modeling of disjointness axioms between descriptions extracted by SLI and DLI processes. This is explained in Section 3.2 (page 84).

After identifying most of the elements in the signature, complex concept descriptions can be defined. As previously described, abstract concepts are composed of various attri-butes. Thus, defining the concept Athlete can be done with the help of a complex concept descriptions using the signature and some constructs as follows.

P ersonvDLCu ∃1hasN ame.>

u ∃1hasN ationality.>

u ∃1hasP art.P ersonBody u ∃1hasP art.P ersonF ace AthletevP erson

The Tbox containing various complex concept descriptions can be seen in Appendix 2.4.1 (page 27). The DLI process uses the Tbox as a restriction mechanism to select between consistent and inconsistent interpretation results. DLC concepts are not exhaustively defined given that media content is usually incomplete. Thus, only some attributes are actually found within the media content. Moreover, SLI processes can fail to identify information. Notice that if SLI and DLI processes interpret text and image content, the attributes of an abstract object, e.g., Person, are characteristic from different content modalities. For example, the description of the concept Person (see above) is composed of SLC concepts that represent the content semantics from segments found in text, e.g., PersonName and from segments found in an image, e.g., PersonFace.

Domain and range restrictions for SLR roles should also be modeled. For example, for the roles personNameToCountryName and personNameToPerformance, the following restrictions are added.

∃personNameToPerformance.> v PersonName

> v ∀personNameToPerformance.Performance PersonName u (≤1personNameToPerformance.Performance)

∃personNameToCountryName.> v PersonName

> v ∀personNameToCountryName.CountryName PersonName u (≤1personNameToCountryName.CountryName)

The AEO ontology (see an excerpt in Appendix 2.4.1) models athletics events, competi-tions, rounds, trials and athletes of many sport disciplines. Only the axioms required to illustrate the running example are shown in the appendix.

After giving a short description of the AEO ontology, in the next section the SLI pro-cess is described. The aim is to show that state-of-the-art shallow-propro-cessing techniques are good enough to fulfill the input required by the DLI process.

32 CHAPTER 2. DEEP-LEVEL INTERPRETATION OF TEXT