• Keine Ergebnisse gefunden

2.3 NLP Methods Supporting Document Exploration

2.3.1 Concept Map Mining

The automatic creation of concept maps from an unstructured text has been studied in sev-eral areas8and is often referred to asconcept map mining. Different techniques towards that goal have been suggested for single documents (Oliveira et al., 2001, Valerio and Leake, 2006, Villalon and Calvo, 2009, Kowata et al., 2010, Aguiar et al., 2016) and for sets of documents (Rajaraman and Tan, 2002, Zouaq and Nkambou, 2008, Zubrinic et al., 2012, Qasim et al., 2013), spanning a broad range of text genres including scientific papers (Qasim et al., 2013), legal documents (Zubrinic et al., 2012), news articles (Kowata et al., 2010), student essays (Villalon and Calvo, 2009) and general web pages (Rajaraman and Tan, 2002). Most of that work focuses on processing English texts, with notable exceptions that target Portuguese (Kowata et al., 2010) and Croatian (Zubrinic et al., 2012).

ment is 62% and 27% for the first annotation run and 77% and 85% after refining the guidelines and the tool, introducing the described extractiveness requirements (Villalon, 2012).

8Being published in different communities, the work follows various scientific standards and practices. As a result, some papers do not provide the level of detail and experimental rigorousness common in the NLP community nowadays, which makes it hard to compare and reproduce such work.

2.3. NLP Methods Supporting Document Exploration

Concept Mention Extraction

Relation Mention Extraction

Concept Mention Grouping

Relation Mention Grouping

Concept Labeling

Relation Labeling

Importance Estimation

Concept Map Construction

Figure 2.2: Subtasks of concept map mining and their dependencies.

Typically, computational methods approach the task with a pipeline of several steps that turn the input text(s) into a concept map. Within this thesis, we will use the com-prehensive list of subtasks depicted in Figure 2.2. It subsumes most other suggested lists, e.g. by Villalon and Calvo (2008), and provides a framework to structure and compare pro-posed techniques.Concept andrelation mention extraction refer to the tasks of identifying spans in the input documents that describe concepts and relations between them, while the subtask ofmention grouping deals with determining which of the extracted mentions refer to the same concept or relation. The subsequent steps ofconcept andrelation label-ing andimportance estimation assign labels to concepts and relations and determine how relevant these elements are. Finally, a concept map isconstructed from (a subset of) them.

In Section 3.3, we discuss these subtasks and their challenges in detail. Some methods also establish a hierarchical organization of the concepts to satisfy Novak’s hierarchy require-ment (see Section 2.2.1), but since this is strongly connected to the visual layouting of the concept map, it is usually seen to be out of scope of the concept map mining task.

Existing work on concept map mining used a variety of evaluation protocols to study the effectiveness of their proposed methods, ranging from qualitative expert judgments (Kowata et al., 2010, Zubrinic et al., 2015, Qasim et al., 2013) to automatic comparisons against manual annotations (Aguiar et al., 2016, Villalon, 2012) and extrinsic, task-based evaluations (Rajaraman and Tan, 2002, Valerio et al., 2012). The exact evaluation procedure and data varies from paper to paper and the number of concept maps evaluated is usually small (often <5). While some evaluate their proposed approach in isolation, others com-pare it against baselines. However, we are not aware of a single paper that makes a direct comparison to any of the other works discussed in this section. This is a serious problem of the research on concept map mining so far, as it remains unclear which method performs best and how absolute and relative performances might differ depending on text genres, document types or other influencing factors.

We want to briefly mention additional work that is related to concept map mining, but does not produce concept maps as defined by Novak. For instance, de la Chica et al. (2008) focus on extracting sentence-long concept descriptions, making their work, although aimed

Chapter 2. Background

at concept maps, essentially multi-document summarization (see Section 2.3.2). Olney et al.

(2011) propose a method that uses concepts and relations from pre-defined lists for biology education rather than the text, which makes it laborious to apply the method to other do-mains. Kof et al. (2010) extract concepts from two different documents and try to align them with each other, a process they refer to as “concept mapping”. There is also work on the creation of mind maps that uses the term concept map despite creating unlabeled relations, such as papers by Chen et al. (2006), Tseng et al. (2010), Chen and Bai (2010) and Lee et al.

(2015). The applicability of these techniques to concept map mining is very limited, as most of them assume given concepts and solely focus on creating unlabeled relations.

2.3.1.1 Concept Mention Extraction

Given a set of documents, the goal ofconcept mention extractionis to identify all mentions of concepts, i.e. sequences of words in the input documents referring to a concept. All existing work for this subtask relies on automatic syntactic annotations to identify con-cept mentions, in particular part-of-speech tags and constituency (Marcus et al., 1993) or dependency (de Marneffe and Manning, 2008) parse trees.

Rajaraman and Tan (2002) and Kowata et al. (2010) both use regular expressions to define relevant sequences of part-of-speech tags. They both target noun phrases, covering nouns with optional preceding adjectives and noun phrases nested via prepositions. Valerio and Leake (2006) and Valerio et al. (2008) propose to use constituency parse trees instead of just part-of-speech tags and extract “minimal noun phrases”, i.e. noun phrases that do not have smaller noun phrases within them. Aguiar et al. (2016) define regular expression patterns over constituency parse trees. Likewise, earlier work by Oliveira et al. (2001) claims to rely on parse trees, but does not provide details.

As an alternative approach, Qasim et al. (2013) define patterns over dependency syn-tax representations. They target two-token concept mentions and extract sequences such as electronic commerceorsemantic web. Using a short list of six patterns, covering adjec-tival and adverbial modifiers, noun compounds, prepositional constructions, conjunctions and simple, single-token nouns, they report good performance on their corpus. Zouaq and Nkambou (2008, 2009) pursue a similar direction and define 33 patterns. However, they only provide examples and do not reveal the full list of patterns. A slightly different approach has been suggested by Villalon (2012), who first applies a list of five transformations to a dependency tree and then selects all nodes containing a noun as concepts. The transforma-tion operatransforma-tions merge nodes together based on the relatransforma-tion between them, yielding similar noun constructions as Qasim et al. (2013)’s method.

2.3. NLP Methods Supporting Document Exploration

2.3.1.2 Relation Mention Extraction

Therelation mention extraction subtask is about finding mentions of relations in the input documents and, similar to concept mention extraction, most approaches rely on syntactic structures for this. In addition to its span in the text, a relation mention also references two concept mentions whose relationship it describes. All work we are aware of requires these three mentions to co-occur in the same sentence and does not extract relations that are expressed across sentence boundaries.

In line with their concept extraction strategies, Kowata et al. (2010) and Rajaraman and Tan (2002) use regular expressions over part-of-speech tags, targeting constructions around verbs, Valerio and Leake (2006) and Oliveira et al. (2001) select verb phrases from constituency parses, but do not reveal the exact selection algorithm, and Zouaq and Nkam-bou (2008, 2009) define patterns over dependency structures. Both Qasim et al. (2013) and Zubrinic et al. (2015) find relations by extracting verb phrases that connect two concept mentions via subject and object dependencies. Using their transformed dependency parse tree, Villalon (2012) selects the tokens along the shortest path between two concept men-tions. Olney et al. (2011) rely on semantic role labeling (SRL) in addition to a dependency parse, but, as mentioned earlier, in order to map predicates to a set of pre-defined relations rather than to find the actual words used to mention the relation.

Moreover, some authors also attempt to extract hyponymy relations between concepts that are not explicitly mentioned in the text. Qasim et al. (2013) rely on lexico-syntactic patterns such as “A including B and C”, known as Hearst patterns (Hearst, 1992), to derive that a hyponymy relation between (A, B) and (A, C) holds. Zubrinic et al. (2015) use a simple heuristic that assumes hyponymy if A is a substring of B, e.g. concept mapis a hyponym of

map. They also add additional hyponymy relations from a domain thesaurus. Since these relations lack a specific mention in the text, a default label such as “is type of” is typically used when these relations are included in the concept map.

2.3.1.3 Concept and Relation Mention Grouping

Having mentions identified, the goal ofconcept mention groupingis to group mentions of the same concept together. Similarly,relation mention grouping attempts this for relations. A simple baseline approach for this task is to group together all mentions that are exactly the same, e.g. all occurrences ofchildren. More sophisticated approaches should also recognize morphological variants, such aschild, or synonyms, such as kids, to be mentions of the same concept (or the same relation).

Towards that goal, Villalon (2012) uses stemming to unify morphological variations and Valerio and Leake (2006) merge mentions if one is a substring of another, ignoring tokens that are not tagged as nouns or adjectives. In order to also capture synonyms without any lexical overlap, a popular approach is to lookup such relationships in WordNet (Miller

Chapter 2. Background

et al., 1990), as suggested by Oliveira et al. (2001). Villalon (2012) does this to check equiva-lence on a per token basis, while Aguiar et al. (2016) use Lin (1998)’s algorithm to compute a similarity based on WordNet and merge mentions if the similarity is above a thresh-old. Zubrinic et al. (2015) additionally use a domain-specific thesaurus. As an alternative, resource-independent approach, Rajaraman and Tan (2002) propose a clustering method using term-frequency-based vector representations of concept mentions. All these works apply their methods to concept mentions, but none attempts to group relation mentions, although the techniques might be partially applicable for that as well.

The grouping subtask can be seen as a special case of coreference resolution (Jurafsky and Martin, 2009, Chapter 21.3ff), in which we are mainly interested in coreferences be-tween noun phrases, as these are the phrases mostly extracted during concept extraction.

Some work tries to resolve pronominal anaphora before concept extraction to increase re-call (Oliveira et al., 2001, Qasim et al., 2013, Aguiar et al., 2016). However, we are not aware of any work that uses existing coreference resolution systems for the grouping task.

2.3.1.4 Concept and Relation Labeling

Having grouped the mentions together, an additional step is to choose one of them per group as the representativelabel that will be used in the concept map. Rajaraman and Tan (2002) use the most frequent mention when counting exactly repeated mentions within a group. For instance, if a concept has the mentions {children, child, children, kids}, thenchildrenis most frequent as it occurs twice. All other work, even if using a grouping strategy, does not report how labels are selected.

2.3.1.5 Importance Estimation and Concept Map Construction

In the final subtask ofconcept map construction, a labeled graph, the final concept map, has to be created. The graph’s nodes are concepts and they are labeled with the representative mentions selected in the previous step. Similarly, edges are relations with their labels.

The majority of existing work simply takes everything that was extracted and adds it to the concept map. To be more selective and create maps of reasonable size even if the input is large, several authors proposed to score concepts — here referred to asimportance estimation — and then use only a fixed amount or fixed fraction that score highest. Valerio and Leake (2006) compute term frequencies for all stemmed nouns and adjectives in the in-put document and score a concept with the maximum frequency of its terms. An adaption of the popular TF-IDF scoring model (Spärck Jones, 1972) that works on the level of con-cepts, called CF-IDF, is used by Zubrinic et al. (2015). Villalon (2012) builds a term-sentence matrix, applies latent semantic analysis (LSA) to it to obtain scores for every sentence and uses only concepts and relations from the highest-scoring sentences. Instead of the input document, Aguiar et al. (2016) use the graph of all concepts and relations for scoring. They

2.3. NLP Methods Supporting Document Exploration

calculate concept scores with the Hub Authority Root Distance (HARD) model (Leake et al., 2004), choosing concepts that are central and highly connected. With regard to relation se-lection, Qasim et al. (2013) choose among multiple relations for the same pair of concepts with a VF-ICF metric, preferring verbs that often occur (verb frequency) but only co-occur with a small number of concepts (“inverse co-occurrence frequency”).

All the scoring strategies described above aim at determining the importance or rele-vance of a concept (or relation) to select a subset that is representative for the input. The map construction becomes more difficult if one also tries to optimize for other objectives, such as producing a well-connected map. Simply selecting the most important concepts can yield many unconnected ones, as there might not be any relations between them. Zubrinic et al. (2015) try to avoid this: They pre-select a subset of the 100 most important concepts according to their CF-IDF metric, build a graph from them and then iteratively remove nodes with the lowest degree until reaching a target size of 25 to 30 concepts. Choosing by node degree, their approach tries to keep the concept map as connected as possible.