• Keine Ergebnisse gefunden

A model for a verb lexicon

Arabic VerbNet: Previous work and motivation

3.1 A model for a verb lexicon

We have reached the point where we should start thinking about a verb lexicon which allows us to integrate the theoretical achievements outlined in the previous chapter and taking under account the following requirements:

a. Representing the facets of a speaker’s lexical knowledge.

b. Providing a degree of generality which allows us to capture verb meaning in terms of classes.

c. Making explicit the correlation between the syntactic surface forms of verbs as reflected in their argument structure and their semantic content.

c. Representing meaning shifts including temporal and aspectual changes related to linguistic items such as adverbs.

d. Estabishing a machine-readable and compact large scale knowledge-base for verbs to be easily used in natural language processing (NLP).

e. Maintaining and updating the knowledge-base in a easy and intuitive way.

The current state of the art offers many lexica that more or less relate to the criteria outlined above. As we mentioned in the last chapter, most of these

resources are for English or other well-resourced languages. However, in the last decade much work has been conducted on similar resources for languages like Arabic. In most cases, well established resources for English are used as a model.

The next sections present an overview of related work and resources and their availability for Arabic as well as a discussion on the motivation behind developing VerbNet for Arabic as part of this dissertation.

3.2 WordNet and Arabic WordNet

WordNet is a lexical database for English developed by George Miller, Christiane Fellbaum and others at the Cognitive Science Laboratory of Princeton University (Fellbaum, 1998). It groups words (nouns, verbs, adjectives and adverbs) into sets of synonyms (synsets). Polysemous words have more than one synset. A synset is related to other synsets by semantic relations like hypernymy, hyponymy, holonymy and meronymy. Synsets of verbs are connected using additional relations such as troponymy, entailment and coordination.

For verbs, WordNet provides generic sentence frames which correspond to the argument structure in which the members of the synset are allowed to appear. In addition, each verb entry is complemented by an example sentence illustrating the concrete realization of its argument structure (1).

(1) Verb eat in WN

eat: (take in solid food) “She was eating a banana”; “What did you eat for dinner last night?”

Somebody —s something

The development process of WordNet including the Princeton WordNet and Euro WordNet was used to build similar lexica for other languages. Thus, Arabic WordNet was developed–and is still in development– using an Upper Merged Ontology as an interlingua to link Arabic words to synsets of previously developed WordNets, in particular the English WordNet (Elkated et al., 2006; Fellbaum et al., 2006).

Although WordNet is an indispensable resource in many NLP tasks, it shows

several shortcomings especially when it comes to the treatment of verbs. WordNet is more focused on detecting paratactic semantic relations between words and provides no lexical information to represent the compositionality of its concept and the elements such as thematic roles, semantic predicates and temporal structures.

This information turns out to be particularly important for a number of NLP tasks such as semantic parsing, knowledge representation and reasoning. Additionally, WordNet provides no information on how semantics and syntax interact to build a meaning and how the regular syntactic behaviors of words can be exploited to build classes of words besides the synsets. In addition, it provides no possibility of representing the temporal structure of predicates and its context-dependent coercion as proposed by Dowty (1977) or Moens & Steedman (1988).

3.3 FrameNet and Arabic FrameNet

FrameNet is a lexical resource developed at the International Computer Science Institute in Berkeley under the initial leadership of Charles Fillmore, the creator of the Frame Semantics (this serves as theoretical basis for the resource) (Ruppfen-hofer et al.). In this resource, words are defined in terms of frames describing essential elements and relations involved in the situation or the interaction de-scribed by these words. Each frame names a relation between participants such as Locative relation, Being born. Thus, the meaning of the verb “sell” is only accessible when we understand the situation of commercial transfer referred as commerce sell with its participants: the seller, the buyer, the goods,money and different relations between seller and buyer, between seller and money, between buyer and money, etc. A frame is independent of the syntactic realization of a certain word. Therefor, the two sentences in (2) are described in terms of the same frame.

(2) a. The child broke the window.

b. The window broke.

FrameNet also specifies different relations between frames. These relations are (among others): inheritance,subframe, causative of, inchoative of and using. For

instance, the frame commerce sell inherits properties from the general frame of giving , is itself inherited by the frame of renting and is used by the two related frames carry goods and exporting. FrameNet also provides a corpus annotated with semantic structures representing empirical evidences about the realization of frames in the syntactic level. It has been developed for other languages such as for German, Japanese, Spanish and Danish. For Arabic, some work has been done to provide frame annotation for corpora, but until this moment no serious results can be reported.

FrameNet has been criticized for being not generic enough to be used in a unified manner in NLP tasks such as machine learning (Gildea & Jurafsky, 2002). Thus, elements are dependent on the concepts described by the frames and can appear or disappear according to the current lexical item (Kipper Schuler,2005, 23).

3.4 PropBank and Arabic PropBank

PropBank (PB) (?) is a corpus annotated with verbal propositions and their arguments. It defines a set of underlying semantic roles for each verb and annotates each of its occurrences in the corpus of the original Penn Treebank (Taylor et al., 2003). Thematic roles are generic numbered arguments of the kind ARG0, ARG1, ARG2, where ARG0 corresponds to the roles assigned to the subject,ARG1 to roles assigned to the direct object, etc. Note that these tags are consistent across different argument realizations of the same verb in the corpus. They are defined as two-part descriptions: The first part consists of the argument and the second part of a verb-specific meaning description. Thus, the two main arguments of the verb attackhave the specifications [attacker] for ARG0and [attacked] for ARG1.

The total number of argument types in PB is 24 with five primary numbered arguments (ARG0, ARG1, ARG2,ARG3, ARG4) and 19 adjunctive arguments (ARG0-STR,ARG1-PRD, etc.).

An argument set or aroleset contributes to defining the frameset, which describes the broad meaning of a verb and its possible syntactic realizations. It comprises a roleset and its description, a general meaning description and an example in the form of a sentence and/or a parse tree extracted from the Penn Treebank (3).

(3) Attack: to make an attack, criticize strongly Roles:

Arg0: attacker

Arg1: entity attacked Arg2: attribute Example: transitive ...

Mr. Baldwin is also attacking the greater problem: lack of ringers.

Arg0: Mr. Baldwin Argm-dis: also Rel: attacking

Arg1: the greater problem : lack of ringers ...

PropBank has been developed for other languages such as Chinese (Palmer et al., 2005b), Korean (Palmer et al.) and Arabic (Palmer et al.,2008;Zaghouani et al., 2010).

Arabic PropBank is a semantically annotated corpus developed (and still under development) at the University of Colorado at Boulder using a similar approach as in the development of the English and the Chinese PropBanks (Palmer et al., 2005a,b). It relies on the Penn Arabic Treebank (PATB) (Maamouriet al.,2004) by providing it with dependency structures with argument labels and sense tags.

It does not only consider verbs, but also other predicative elements like deverbals, adjectives, and in some cases nouns.

Although PB offers a rich resource for verb semantics with a solid empirical base, it shows some shortcomings which summarise in the following:

1. PB annotates individual items in the corpus without linking them in terms of common semantic properties or syntactic behavior. This is also the case for pairs of derivationally related items.

2. PB does not define selectional restrictions for arguments.

4. Features used to describe verb meaning are also associated with individual verb entries and cannot be used in a general and consistent way.

3. PB does not define temporal structures of predicates.

3.5 VerbNet

Among the existing lexical resources that are good candidates for the requirements oulined above, we choose VerbNet (Kipper Schuler, 2005). VerbNet is a large scale lexical resource that uses the notion of verb classes and integrates the information described in the last sections better than other resources such as WordNet, FrameNet or PropBank. It benefits from Levin’s verb classes and enriches them with novel verb classes extracted automatically from a corpus for English (Korhonen & Briscoe, 2004). Each class is defined by a set of diathesis alternations reflecting all possible argument structures of verbs associated with a more or less stable semantic meaning (see Table3.1). Each part of the alternation is a frame providing a description of the typical syntactic structure of the verb associated with a semantic property. Syntactic structures are defined in terms of arguments selected by the verbs and complements introduced by some of their meaning aspects (for instance, the resultative adjective in change of state verbs).

Arguments are constrained according to selectional criteria like [+/− plural] for collective or [+/− sentential] for sentential complements. Each class defines the set of thematic roles assigned to its arguments in a manner similar to those introduced by Fillmore and used in the generative semantics (Fillmore, 1968).

The semantic description of each frame has the form of a compositional semantics similar to the proposals of Jackendoff (1990a,b) and Rappaport et al. (1993).

They describe the meaning of verbs by conjoining semantic predicates like cause, state with thematic roles and temporal functions which are inspired by Moen’s tripartite nucleus structures and which indicate the part of time in which the event is true. A class is a hierarchical structure which hands down its properties to its subclasses. The important information of the class resides in the frames that reflect alternations in which the verbs can appear. Every frame is represented as an example sentence, a syntactic structure and a semantic structure.

Every class can have subclasses for members which deviate from the prototypical verb in some non central point. A subclass recursively reflects the same structure as the main class and can itself have subclasses. A subclass inherits all properties

of the main class and is placed in such a way that the members in the top level exclude the information it adds.

Class: Amuse

Members: abash, affect, afflict, affront, aggravate, aggrieve, impress, incense, inflame, infuriate, irk, irritate, jade, jolt, lull, intimidate, intoxicate, jade, jar, jollify, madden, menace, mesmerize, miff, molest, . . .

Lexicalization: [EMOTIONAL-STATE], [CAUSE]

Roles and Restrictions: Experiencer [+animate], Causer

Frames Examples Roles Assignement

NP V NP The clown amused the children Causer V Experiencer NP V ADV Little children amuse easily. Experiencer V ADV NP V NP The clown amused. Cause V

NP V NP PP The clown amused the children with his antics.

Causer V Experiencer {with}Oblique

NP V NP The clown’s antics amused the children.

Cause <+genitive>(’s) Oblique V Experiencer NP V NP ADJ That movie bored me silly. Causer V Experiencer

Table 3.1: Theamuse class in English