• Keine Ergebnisse gefunden

5.4 Advanced Pattern Matching

5.6.2 Graph Pattern Matching

A early approach using pattern matching on dependency trees is RelEx (Fundel et al., 2007). RelEx uses a small set of fixed rules to extract directed PPIs from dependency trees. Some of these rules also take advantage of dependency types, for instance, to properly treat enumerations. A reimplementation of RelEx was recently evaluated on the same corpora we used (see Table 5.13) and was found to be on par with other systems, though some of its measures were considerably worse than those reported in the original publication (Pyysalo et al., 2008a). A notable difference to our approach is that RelEx rules were defined manually and are highly specific to protein-protein interactions. In contrast, we described a general method that performs pattern learning from automatically generated examples.

Liu et al. (2010a) utilize dependency tree patterns for event extraction on the BioNLP’09 corpus. Similar to our approach the authors also use shortest path assump-tion to generate patterns automatically. In difference to our approach, Liuet al.(2010a) learn patterns from manually annotated texts. The authors also experiment with dif-ferent matching strategies, by aggregating difdif-ferent part-of-speech tags (e.g., singular and plural nouns) or trigger words. This work is continued in the context of BioNLP’11 by Liuet al. (2011) where the authors remove patterns with low quality (i.e.,precision

≤0.25).

112

Liuet al.(2013b) use approximate subgraph matching to detect biomedical events for the BioNLP’13 challenge. Rules are subsequently ranked by precision and low ranking patterns are removed. In order to match tokens sharing the same meaning, the authors introduce a distributional similarity model (DSM). For instance, the words “interact”

and “cooperate” can both be used to describe a protein-protein interaction. The authors implement the method from Pantel and Lin (2002) to find words sharing the meaning in a specific domain. Additionally, the authors derive patterns not only from the shortest path, but rather use all possible path as patterns. On the development set, the distri-butional model leads to an increase in recall, but drastically decreases precision. The all-path patterns increases F1 moderately by 0.5 pp in comparison to the shortest path patterns. The authors removed DSM and all-path from prediction on the test set, as neither provides strong positive contributions.

MacKinlayet al.(2013) learn subgraph patterns from the BioNLP’13 training corpus.

The authors follow a self-training inspired methodology to increase coverage (recall) of patterns. To this end, the authors incorporate the top-k results of TEES, a state-of-the-art tool for recognizing biomedical events, to infer additional patterns. For pattern matching the authors utilize the same matching strategy as Liu et al. (2013b) (i.e., approximate subgraph matching and removal of low quality patterns).

Ravikumar et al. (2011) apply the distant supervision paradigm to identify protein-residue associations on MEDLINE. A notable feature is that the authors perform a

“physical validation” to remove spurious protein-residue associations. This validation matches the residue and position against the protein in question. It has been shown that physical validation of protein-residue pairs achieves very high precision (Thomas et al., 2011a); therefore leading to high quality positive training instances. Patterns are generated using the shortest path assumption for protein-residue pairs passing the physical validation step.

MEDLINE Scale Text Mining

Problems, such as synonyms, differences in word morphology, homonyms, and miss-ing adherence of nomenclature aggravate recognition of named entities (see Subsec-tion 2.4.1). This also impedes research: Ogino and Wilson (2004) pointed out that ambiguous nomenclature led to multiple discoveries of the same mutation by different groups, which could have been avoided by the usage of existing nomenclatures. Together with the rapid increase of biomedical literature (Hunter and Cohen, 2006), researchers face several problems when searching for relevant literature. For instance, Dogan et al.

(2009) reported that over one third of all 58 million PubMed queries collected for March 2008 result in hundreds or even thousands of articles. Consequently, there is a growing body of research trying to provide improved retrieval for scientific texts to end-users (Lu, 2011). A pre-requisite for such advanced search features is high-quality named entity recognition and relationship extraction.

In previous chapters we covered different aspects of relationship extraction. These studies always evaluated results by using, relatively small, manually annotated corpora.

In order to support biomedical researchers in satisfying their individual information needs, text-mining methods need to be applied to as many research articles as possible.

Beside citations contained in MEDLINE, this encompasses also full-text articles, as they exhibit a much higher information content than abstracts (Schuemie et al., 2004).

This chapter discusses the application of different information extraction components to all available citations in MEDLINE and full-texts in PubMed Central open access and is organized as follows: Section 6.1 describes the architecture and implementation of our large-scale semantic text mining engine GeneView. Section 6.2 covers computational resources needed to perform the individual text-mining steps. The user interface will be briefly described in Section 6.3. The shift towards large scale text-mining provides a setting to evaluate usability and performance of text mining tools on a larger scale without the need for relatively small gold standard corpora. Such an evaluation will be covered in Section 6.4, where we evaluate the reconstruction and expansion of an existing PPI network using text mining methods. This section also covers additional evaluations and applications using data provided by GeneView1.

1Joint work with J. Starlinger, A. Vowinkel, S. Arzt, and U. Leser

6 GeneView – End-user access to MEDLINEScale Text Mining