Cross-Modal Matching - A Computational Model for the Influence of Cross-Modal Context upon Synt

Cross-modal matching in natural systems as introduced in Section 3.7 refers to the establishment of co-reference between representational entities from different modal-ities. In our model, we face the challenge of matching homonyms, whose meaning is expressed in terms of concepts in the concept hierarchy, with sets of concept in-stances in the representation of visual context. Effectively, this process results in the creation of cross-modal referential links between entities in the linguistic modality

1In first approximation, semantic preferences could be modelled by the inclusion of normalised weights that reflect the contribution of each conceptualisation. Ideally, these weights should be context-sensitive rather than static.

Homonyms

H1 schenkt:=[base:schenken,cat:VVFIN,...,person:third, number:sg,...,sem val:ag re th,valence:’a+d’,...];

H2 schenkt:=[base:schenken,cat:VVFIN,...,person:second, number:pl,...,sem val:ag re th,valence:’a+d’,...];

H₃ schenkt:=[base:schenken,cat:VVFIN,...,person:third, number:sg,...,sem val:ag th,valence:a,...];

H₄ schenkt:=[base:schenken,cat:VVFIN,...,person:second, number:pl,...,sem val:ag th,valence:a,...];

Concepts

C₁ etw.schenkenusingular

has lexicalisation:schenken, situation valence:ag th, C2 jmd.etw.schenken u singular

has lexicalisation:schenken, situation valence:ag re th C₃ etw.schenkenuplural

has lexicalisation:schenken, situation valence:ag th, C₄ jmd.etw.schenken u plural

has lexicalisation:schenken, situation valence:ag re th Cross-Modal Matching

H₁ H₂ H₃ H₄

normalises to normalises to normalises to normalises to

? ? ? ?

schenken schenken schenken schenken

{C₁, C2, C3, C4} {C₁, C2, C3, C4} {C₁, C2, C3, C4} {C₁, C2, C3, C4}

compatible valence compatible valence compatible valence compatible valence

? ? ? ?

ag re th ag re th ag th ag th

{C₂, C4} {C₂, C4} {C₁, C3} {C₁, C3}

compatible number compatible number compatible number compatible number

? ? ? ?

singular plural singular plural

{C₂} {C₄} {C₁} {C₃}

Figure 7.1: The effect of the three implemented criteria in linguistic bottom-up grounding for the present tense indicative VVFINhomonyms of ‘schenkt’give(s).

to entities in the visual modality. The underlying idea of our model architecture is that the process of assigning semantic dependencies in linguistic processing shall be influenced by cross-modal context information if the homonyms to be scored in the linguistic analysis have a cross-modal match in visual context.

Similarly to grounding, cross-modal matching in natural systems is a bidirectional process susceptible to bottom-up and top-down influences. Top-down influences such as expectations arising from a percept in one modality will influence likely matching candidates in the other modality. For example, hearing the noise of a loud diesel engine approach when attempting to cross a road will trigger visual search for the corresponding large vehicle; the perceived auditory stimulus willnot be at-tributed to the bicycle seen passing at the same time. Conversely, noticing a heavy truck approach without hearing the corresponding motor noise would give rise to an extremely bewildering percept.¹

In our model implementation we reduce the complexity of the cross-modal match-ing process down to a smatch-ingle criterion: concept compatibility. A linguistic entity is modelled to be co-referent with an entity in visual scene context if its conceptual-isation is compatible with the concept instantiated by the visually observed entity or entities. This, clearly, is a simplification in several respects. First of all, we assume that the given utterance is about the visual scene thus and makes reference to the entities or situations in the visual scene. Roy and Mukherjee (2005) refer to this approach as the assumption of immediate reference. We hence assume that the natural language utterance refers to the immediate visual scene context represented in the context model. This assumption may not hold for all cross-modal interactions between vision and language since not all situated utterances actually make refer-ence to entities in the scene in which they are being uttered.

Secondly, we assume that cross-modal reference is established between entities that activate concepts which are semantically consistent or compatible with each other.

While, in first approximation, this assumption is plausible for descriptive utterances, there are a number of linguistic devices such as irony or sarcasm which may not obey this rule.

Thirdly, conceptual compatibility is a weaker criterion than actual co-reference.

Concept compatibility is a necessary but not a sufficient criterion for co-reference.

An illustration of this point is given in Figure 7.2: The presupposition arising from the use of the definite article in ‘der Mann’ the man results in a strong preference for the interpretation that ‘der Mann’ and ‘der Schauspieler’the actor donot refer to the same male individual, despite the fact that these words have conceptually compatible conceptualisations.

1The deliberate violation of such cross-modal top-down expectations based on world-knowledge has occasionally been used for artistic effect, e.g. in the deliberately bewildering cinematographic art of the FrenchNouvelle Vaguedirector Alain Resnais (*1922).

Input Sentence:

‘Der Mann sieht den Schauspieler in einem Kinofilm.’

The man is seeing the actor in a movie.

From the T-Box:

( is satisfiable,man uactor, T-Box ) = true Preferred Interpretation:

‘der Mann’ −−−−−−→^{ref ers to} man 01

∧ ‘der Schauspieler’ −−−−−−→^{ref ers to} actor 01

∧ man 016=actor 01

Figure 7.2: In the majority of cases, concept compatibility is a necessary – but not a sufficient – criterion for co-reference.

A homonym may have several meanings, each of which can be compatible with a different concept instance in the representation of visual context. As a result, a homonym can match an entire set of entities in visual context. The mapping from homonym to concept instances hence need not be injective (one-to-one) or surjective (onto-mapping), let alone bijective (both one-to-one and onto-mapping). All of the matched entities, however, must instantiate a concept compatible with at least one of homonym’s conceptualisations.

The cross-modal matching example in Figure 7.3 illustrates a case in which no homonym matches more than one instance in visual context, which actually is a special case. Due to the comparative looseness of the applied cross-modal match-ing criterion of concept compatibility the mappmatch-ing turns out to be non-injective in the majority of cases. In particular, semantically underspecified word classes such as pronouns tend to map to several entities in visual context. In analogy to the uni-modal linguistic bottom-up grounding of homonyms, our set-based approach permits the robust handling of the influence of lexical ambiguity and homophony upon cross-modal matching as well. Words that have several distinct meanings also have the potential to refer to different entities in a visual scene context.

In fulfilment of Requirement R1 our model implements the process of cross-modal matching as mediated by a representation of word meaning. The experimental find-ings by Cooper (1974) and Huettig and Altmann (2005) presented in Section 2.2 further support this modelling decision. With our model’s focus on the influence of visual context upon linguistic processing this realisation of cross-modal matching maps entities from the linguistic input to concept instantiations in the represen-tation of visual context. It therefore fulfils Requirement R29 for the formation of cross-modal referential links from the linguistic to the non-linguistic modalities. As we have excluded the cross-modal interaction in the reverse direction from the mod-elling scope, our model does not meet Requirement R30 for establishing cross-modal referential links in the reverse direction.

Input Sentence:

‘Er h¨ort die M¨anner singen.’

He hears the men sing.

Class Assertions in Cross-Modal Context:

man01

is instance of

−−−−−−−−−→ man usingular man02

is instance of

−−−−−−−−−→ man uplural etw.hoeren01

is instance of

−−−−−−−−−→ etw.hoerenag th

null.singen01 is instance of

−−−−−−−−−→ null.singenag

Object Property Assertions in Cross-Modal Context:

man01 ^isAGENT f or

−−−−−−−−−−−→ etw.hoeren 01

man02 ^isTHEME f or

−−−−−−−−−−−→ etw.hoeren 01

man02

isAGENT f or

−−−−−−−−−−−→ null.singen01

Bottom-Up Linguistic Grounding:

‘Er’He is conceptualised by

−−−−−−−−−−−−→ {maleu singular}

‘h¨ort’hears is conceptualised by

−−−−−−−−−−−−→ {etw.hoerenag th}

‘die’the is conceptualised by

−−−−−−−−−−−−→ {}

‘M¨anner’men is conceptualised by

−−−−−−−−−−−−→ {man uplural}

‘singen’sing is conceptualised by

−−−−−−−−−−−−→ {null.singenag}

Cross-Modal Matching:

‘Er’He −−−−−→^matches {man 01}

‘h¨ort’hears −−−−−→^matches {etw.hoerenag th 01}

‘die’the −−−−−→^matches {}

‘M¨anner’men −−−−−→^matches {man 02}

‘singen’sing −−−−−→^matches {null.singenag 01}

Figure 7.3: An example of cross-modal matching based on concept compatibility.

Concept compatibility depends on the semantic properties asserted for that concept in the T-Box. The concept’s position in the T-Box conceptual hierarchy as well as its list of disjoint classes determine which other concepts it is compatible with.

An illustration of cross-modal matching based on concept compatibility is given in Figure7.3.

The implicit assumption that all members of the set of cross-modal matches are equally likely cross-modal referents of a given homonym clearly constitutes a sim-plification. In humans, the preference for one lexical reading over another one is

expected to propagate from linguistic grounding into cross-modal matching (cf.

Section 7.2): a preference for one specific conceptualisation of a homonym should also result in a preference of the homonym’s cross-modal match. Future extension of our model should incorporate the ability to propagate semantic saliencies from homonym grounding into the process of cross-modal matching.

As outlined above, cross-modal matching on the basis of conceptual compatibility may result in cross-modal matching ambiguity in cases in which the mapping from homonym to concept instance is not injective. Since this seems to be the norm rather than the exception, natural systems apply additional criteria to reduce ambiguity in cross-modal matching. One factor to establish cross-modal matching preferences is the degree of conceptual fit: a homonym that matches several context entities will preferentially match that entity which exhibits the largest conceptual overlap with the homonym’s preferred conceptualisation. An implementation of this notion in our model would require a gradable measure of conceptual overlap in addition to a weighted representation of word meaning. At the level of implementation described in this work, neither of these has been included.

We finally need to explicate the minimal conditions under which our model can produce a cross-modal influence of visual context upon linguistic processing. Our model is based on the notion that only those semantic relations in visual context can give rise to dependency score predictions which have been asserted between entities that match homonyms in the input sentence. Our model thus requires at least two homonyms from different slots to have different cross-modal matches in order for the context model to be able to affect linguistic processing. Otherwise, the PPC cannot make a context-based prediction and all context-based predictions for semantic dependencies will default to unity. We now give a complete description of the PPC’s scoring algorithm in the following section.

Im Dokument A Computational Model for the Influence of Cross-Modal Context upon Syntactic Parsing (Seite 144-149)