The Contents of Visual Scene Representations

6.4 Representing Situation-Dependent Visual Context

6.4.1 The Contents of Visual Scene Representations

The amount of information that can potentially be extracted from a visual scene is enormous. A number of cognitive top-down processes such as visual attention and context-based expectation, however, help to reduce this large amount of infor-mation down to a cognitively manageable set of salient features that are extracted from the visual scene for further processing. Strohner et al. (2000) report a num-ber of experiments that illustrate the strong influence of attentional focus upon cross-modal reference formation in ambiguous cross-modal matching situations. As we deliberately exclude the complexity of these top-down processes from considera-tion in our model, we need to make appropriate assumpconsidera-tions about the effect that these processes have upon the representation of visual percepts. For our model, we assume that top-down cognitive processes have already effected a pre-selection of entities from visual context. Precisely these entities will be represented and shall be modelled to interact with linguistic processing. In our context models we encode exactly this selected visual scene information. The selection processes that lead to the decision which part of visual context shall be focussed on are considered outside of the present scope of our model.

Visual scenes offering several situations for extraction can be studied with our model by designing a separate context model for each of those situations. Each situation then gives rise to a separate modal interaction with language. Each cross-modal interaction requires a separate parse run with a distinct context model. Our model can therefore only approximate the effect of multistable visual percepts (see Section 3.1) upon linguistic analysis. As the context representations in our model are inherently static, each of the multistable states needs to be represented as a distinct context model that gives rise to a separate cross-modal interaction with linguistic processing.

In order for linguistic processing to be influenced by visual understanding, the rep-resentation of visual understanding must contain the linguistically relevant entities and relations. Our context models intend to represent the output of the process of visual understanding. As such, they primarily represent the entities observed in a visual scene, the situation that binds these entities together and the thematic rela-tions that relate the entities to each other. We also include information beyond the

visually perceivable when this information is likely to be known or inferable from prior knowledge or world knowledge. An example for this is the visual perception of entities that are identified by their relation to other entities which themselves are not part of the visual scene. Consider the context representation resulting from the visual perception ofDominik’s son. Prior knowledge identifies the visually perceived person as Dominik’s son, even if Dominik is not part of the visual scene. Our con-text model of this visual scene will hence include a representation of both entities, son01 and dominik 01.

Some thematic roles may be easier to observe visually than others. AGENT and

THEME, for example, are generally quite easy to extract from a visual scene, espe-cially, if dynamic rather than static visual scene information is available. The role

OWNER, on the other hand, is an example of a role that is more difficult – if not even impossible in some cases – to extract from inspection of a visual scene. For our model, we assume that additional knowledge about the entities perceived in the situation is also incorporated in the process of visual understanding. It is additional knowledge in the form of prior context and world knowledge that permits the assign-ment of the visually less accessible thematic role OWNER.

As an example, consider Figure 6.5, where one participant has been identified as a PhD student.¹ If we assume that this participant is already known as ‘the re-searcher’s PhD student’, the recognition of the entity phd.student.f01 in the visual scene permits the inclusion of the additional, visually inaccessible thematic relation in the output representation of visual understanding. An implication of this argument is that the output representation of visual understanding can, in some cases, include entities that are not even physically present in the observed vi-sual scene. In the example in Figure 6.5, prior knowledge about phd.student.f01

can warrant the inclusion of the is OWNER f or relation with researcher.f 01 in the context model, even if the latter entity physically is not present in or detectable from the scene. This argument is in line with our approach to use the visual context model as a representation of all linguistically relevant entities and relations identi-fied in the process of visual understanding.

Refinements to the semantic representation of visual scene context such as modal aspects or negation are demanded by Requirement R20. These aspects have not been incorporated into our representation of visual context to limit the modelling complexity in the interaction with linguistic analysis. We hypothesise that modals and negations differ in their effect on linguistic analysis from factual assertions and hence require different modelling with regards to their effect on the assignment of se-mantic dependencies in the linguistic analysis. An appropriate modelling approach for these contextual aspects may presumably require different types of inferences and, possibly, even a different logic. We recommend that a systematic investigation into these phenomena be undertaken in the context of future research.

1We omit the indication of German gender marking in the English translation of German sentence material, unless it is essential for the argument.

Binary Visual Scene Context:

VK-274 ‘. . . , dass die (Doktorandin der Forscherin) den Beweis lieferte.’

. . . that the researcher’s PhD student delivered the evidence.

Class Assertions in Cross-Modal Context:

phd.student.f01 is instance of

−−−−−−−−→ phd.student.f usingular researcher.f01

is instance of

−−−−−−−−→ researcher.f usingular evidence01 is instance of

−−−−−−−−→ evidence usingular etw.liefern01

is instance of

−−−−−−−−→ etw.liefernag th

Object Property Assertions in Cross-Modal Context:

phd.student.f01

isAGENT f or

−−−−−−−−−−→ etw.liefern01

evidence01 ^isTHEME f or

−−−−−−−−−−→ etw.liefern01

researcher.f01

isOWNERf or

−−−−−−−−−−→ phd.student.f01

Figure 6.5: The inclusion of the thematic roleOWNERinto the representation of visual context to reflect the contribution of contextual and world knowledge.

According to Jackendoff’s Conceptual Semantics, only the entities that have pro-jected into Conceptual Structure subsequently have the potential to interact with syntactic representation. An entity needs to have been cognised, or — in terms of Conceptual Semantics — must have projected into Conceptual Structure in order to be able to affect linguistic processing (Jackendoff,1983, p. 35). We consequently re-quire that only those entities may exert an influence upon linguistic processing that have been represented in the context model. Effectively, this assumption provides a closure on the default open-world assumption of OWL reasoning. As our model centres around a constraint-based linguistic processor, we need to make this closed-world assumption to be able to derive constraints on linguistic analyses that do not receive the support of positive evidence in visual context. A purely OWL-based for-malism does not provide closed-world inference mechanisms. We hence implement these inferences in the PPC at a process stage posterior to communication with the OWL reasoner. A detailed description of the inferences resulting from this closure is provided in the description of the PPC’s scoring algorithm in Section7.4.

A mental representation of cross-modal context according to Conceptual Semantics is a representation of cognised entities, encoded as concept instances and thematic relations between them. The creation of such a representation presupposes the iden-tification of perceived entities and hence can only be populated by the process of

Visual scene context:

A man is giving a woman a book.

Context Model Class Assertions:

man 01

is instance of

−−−−−−−−→ manu singular

book01 is instance of

−−−−−−−−→ bookusingular woman 01

is instance of

−−−−−−−−→ womanusingular jmd.etw.geben01 is instance of

−−−−−−−−→ jmd.etw.gebenag re th

Context Model Property Assertions:

man 01 ^isAGENT f or

−−−−−−−−−−→ jmd.etw.geben01

book01

isTHEME f or

−−−−−−−−−−→ jmd.etw.geben01

woman 01 ^isRECIPIENT f or

−−−−−−−−−−−−−→ jmd.etw.geben01

Figure 6.6: Typical assertions contained in a context model.

visual understanding. We also expect a representation of the output of visual un-derstanding to comprise additional cognitively relevant information such as spatial, temporal or causal relations. In the current form of the model, however, this type of information is not captured in our representation of visual context. Due to the absence of spatial information in our context models, our representation of visual context in its current form also fails to meet Requirement R16 for pointers to sen-sory representations.

The contents of a context model comprise instances of acting entities or, more gener-ally, situation entities, instances of actions or, more genergener-ally, situation concepts and thematic relation assertions between those instances. As such, our knowledge repre-sentation of visual context satisfies Requirement R18 for the abstract reprerepre-sentation of actions and acting entities. Typical context model assertions are exemplified in Figure 6.6.

To achieve a further reduction of the modelling complexity in the interaction be-tween visual understanding and linguistic processing, we make another assumption:

The representation of cross-modal context as provided by the A-Box remains valid throughout the course of sentence processing. We hence assume that the visual scene information provided at the onset of linguistic processing is static and remains un-changed by the interim and final results of linguistic processing. This assumption hence excludes the possibility that linguistic processing influences the process of visual understanding. In natural systems, the latter kind of interaction is observed frequently, for example, in cases where local syntactic or referential ambiguities are resolved by means of visual information in the course of incremental sentence pro-cessing. In those cases, the disambiguated linguistic information is found to have a

directing effect on the course of eye fixations in a co-present visual scene (Tanenhaus et al., 1995, Ambiguous conditions of Experiments 1 and 2). A full model of the cross-modal interaction between vision and language — as opposed to a model of the cross-modal influence of vision upon language — will need to incorporate such bidirectional cross-modal interactions. Their exclusion from the scope of our model results from a technical limitation of the parser’s predictor interface: the predictor interface only permits the unidirectional integration of non-linguistic information.

A bidirectional interaction at parse time is currently not possible.¹

Assuming that the representation of visual context is static over parse time justifies the use of a predictor for the computation of thematic relation scores prior to parse time. If the contextual representation remains unaffected by the course of linguistic processing it makes no difference whether we query that representation prior to or during the process of parsing.

A criticism that has been raised against our format of context representation is that the influence of the visual context upon linguistic processing is not cross-modal in nature anymore.² We argue against this view for two reasons: First, our model does indeed influence the processing of one representational modality based on informa-tion contained in another representainforma-tional modality: the semantic representainforma-tion of visual context modulates linguistic analysis on the syntactic level of representation.

Importantly, both of these representational modalities are independent of each other and are representationally encapsulated in the sense of Jackendoff (1996). Second, the information encoded in the representation of visual context is an adequate ap-proximation of some of the non-linguistic information that is readily available from inspection of a visual scene. As none of the entities or relations used to encode visual context information have linguistic properties, the nature of this representation as well as the information it encodes are genuinely non-linguistic.

Im Dokument A Computational Model for the Influence of Cross-Modal Context upon Syntactic Parsing (Seite 132-136)