• Keine Ergebnisse gefunden

Information extraction deals with the extraction of previously defined facts from un-structured documents. Two popular subtasks of information extraction are named en-tity recognition and relationship extraction and will be explained in more detail in this section.

2.4.1 Named Entity Recognition

The goal of named entity recognition (NER) is to identify entities of a previously de-fined type. Examples are corporations, locations, or person names. Examples for named entities in the biomedical concepts are gene/protein, mutation, or disease names. Unam-biguous association of a named entity to a unique canonical form or database identifier is termed named entity normalization.

Several properties of name usages, such as term ambiguity or the partial use of ex-isting nomenclature make NER a difficult task. For instance, several gene names are derived from the phenotype when the gene is absent or depleted. These gene name can overlap with common English words as for the fruit fly gene breathless (FBgn0005592).

Additional ambiguous gene names are blood, disco, red, or can (Proux et al., 1998).

Another problem is that researchers tend to neglect nomenclatures and instead pre-fer previously established synonyms (Tamames and Valencia, 2006). For instance, the official gene symbol XCL1 (Entrez 6375) can be found 311 times in all of MEDLINE,

2 Biomedical Text Mining

but the synonym ATAC is mentioned 417 times and the full gene name lymphotactin occurs 255 times. According to the Entrez Gene database (Maglott et al., 2011), the human gene with most synonyms is, with 31 different entries, OR4H6P (Entrez: 26322).

The gene with most known synonyms is the drosophila gene tws (Entrez: 47877) with 89 entries. The distribution of synonyms for all human genes has been extracted from Entrez Gene and is shown in Figure 2.9.

Gene names follow no regular structure but can appear as anything from a three letter acronym to a multi-token complex name. These problems exist for other entity types, such as diseases (whose names often contain ordinary persons’ names, like “Wilsons disease”), or medical symptoms (whose names can be used in many different contexts not related to diseases, e.g., “shiver” or “cold”). Similar to gene name nomenclature, the mutation nomenclature is not fully adapted by researchers. This example will be discussed in more detail in Subsection 6.4.2.

0 5000 10000 15000 20000 25000

0 1 2 3 4 5 6 7 8 9 10-31

No. of genes

Synonyms per gene

Figure 2.9: Histogram of synonyms for all human genes according to Entrez gene.

A related problem is the introduction of morphological variations. For instance, the human gene BRCA1 is also referred to as BRCA-1, Brca-1, BRCA-I, and many more.

Another problem, often mentioned in the context of NER is the uncertainty of exact text boundaries (Wang, 2010). For instance, some annotators annotate species mentions co-located with the protein name (e.g., human hemoglobin), whereas other people only consider “haemoglobin” as the gene/protein mention.

NER should also be able to handle spelling errors, such as “colorecal cancer” (PMID:

16422107), “colorecatal cancer” (PMID: 22202261), or erroneous hyphenation such as

“colorec-tal cancer” (PMID: 19663088). Gene names can also be accidentally modified by the activated auto conversion function in word processing tools such as Microsoft Excel (Zeeberget al., 2004).

22

In contrast to other domains, where NER is considered as an essentially solved prob-lem (Balke, 2012), biomedical NER remains far from being solved in a satisfying manner.

For example, for the recognition of person names, organizations, and geographic loca-tions the best performing team achieved a F1of 96 % during the Message Understanding Conference-6 (Grishman and Sundheim, 1996). In contrast performance for gene, chem-ical, and disease named entities have been estimated at about 61 %, 74 %, and 51 % F1 respectively during the BioCreative IV-CTD shared task (Wiegerset al., 2014).

2.4.2 Relationship Extraction

The goal of relationship extraction is the detection of relations between named entities.

This task gained much attention within the last years and a large set of publications dealing with relationship extraction appeared. In this thesis, we will focus on binary undirected relationship extraction. This annotation scheme has been identified as the greatest common factor for protein-protein interaction corpora (Pyysalo et al., 2008a).

The scientific community often distinguishes three different approaches of relationship extraction, which are not mutually exclusive. The three general approaches are described in the following part of this section. Related work concerning protein-protein interaction will be described in more detail at the end of this chapter and approaches for drug-drug interaction will be explained in Chapter 3.

Co-occurrence

Early approaches used the concept of co-occurrence to detect relations between named entities. The working hypothesis of this approach is that entities mentioned in the same textual context can be expected to share a semantic context. Textual context types are for instance document, paragraph, sentence, or phrase (Ding et al., 2002). Co-occurrence based approaches achieve very high recall and low precision as they predict a relationship for every entity pair in a given context. Depending on the frequency of positive instances, precision for PPI extraction ranges from 17 % to 50 % (Pyysaloet al., 2008a). Co-occurrence is most often used as a baseline to evaluate relation extraction approaches. Precision of co-occurrence can be substantially improved by requiring the mention of an interaction word (Kabiljoet al., 2009) or consideration of other heuristics such as sentence length or the distance between two entities.

An advantage of co-occurrence based approaches is that they require no manually annotated training data and are therefore easy to adapt to novel domains. Another advantage is that they require no sophisticated NLP analysis and are thus often used in large scale applications where run-time is important. The application on a large text repository gives co-occurrence additional strength as frequently found interactions are more likely correct. Some of these frequency based approaches use statistical in-formation measures such as χ2, mutual information content, or log-likelihood ratio to find significantly overrepresented co-occurrences (Bunescu and Mooney, 2005b; Rebholz-Schuhmann et al., 2007; Hur et al., 2009; Fleuren et al., 2011). Wright et al. (2010) showed that statistically motivated approaches often outperform purely frequency based

2 Biomedical Text Mining

co-occurrence.

Pattern based

The second type of approaches use a previously defined set of linguistic patterns to extract relationships. Early approaches in the biomedical domain relied on sim-ple patterns in form of “EntityA relation EntityB” (Blaschke et al., 1999). For this work Blaschke et al. used a predefined set of 14 verbs (e.g., associated with, bind, suppress, . . . ) and possible inflections. For instance, the regular expression regulat(ions?|(e[esd]?))matches different word inflections of the word regulate.

Similar patterns are used by Ono et al. (2001) but they define also rules to handle complex sentence structure and negations. Baumgartneret al. (2008) manually defined 67 rules1 using a regular grammar based on words, POS-tags, phrase types, and ontol-ogy concepts. Other approaches defined patterns on the dependency graph (Ding and Berleant, 2003; Rinaldiet al., 2006; Fundelet al., 2007).

Originally these pattern based approaches were based on manually defined rules, but also approaches which automatically learn patterns are proposed. Caporaso et al.

(2007b) explain a strategy to semi-automatically learn surface patterns for the recogni-tion of mutarecogni-tion menrecogni-tions. Mutarecogni-tions consist of three mandatory arguments (wildtype, location, and surrogate). Therefore, this task can be defined as three-ary relationship extraction problem. Potential patterns are automatically derived from MEDLINE, by searching sentences containing all three arguments. Recognized arguments are replaced by argument specific place holders (e.g., lysine becomes aminoacid) to increase gener-alizability. Patterns are then generated by extracting the shortest span (on the surface level) between all arguments and the words between them. Automatically generated patterns are ranked by frequency and are manually annotated for correctness. The same strategy has been exploited to learn drug-disease relationship patterns (Xu and Wang, 2013) and histone modification patterns (Thomas and Leser, 2013). In all three domains the strategy achieves excellent precision (>90 %) on manually annotated corpora. Recall levels at approximately 80 % for all three domains.

Machine learning

Several systems for information extraction (NER and RE) make use of statistical classi-fiers learned on manually labeled corpora. Relationship extraction using machine learn-ing is often cast into a binary classification problem. For each sentence withn entities, alln2possible undirected entity pairs are constructed. The task of the learned classifier is to decide if a specific entity pair interacts or not. The foundations of machine learning are covered in Section 2.2 and a more detailed comparison of supervised PPI extraction methods will be covered in Section 2.5.1.

1http://sourceforge.net/projects/opendmap/files/supplementalPatterns/

supplementalPatterns-1.0/

24

Summary

Within the NLP domain there is a constant discussion between researchers favoring ma-chine learning or rule-based approaches. Rule-based approaches often achieve excellent precision, but suffer in recall. A frequent argument against pattern based approaches is the high requirement in time and skills to build patterns. For instance, the adoption of an existing rule-based system to a Message Understanding Conference task has been es-timated with approximately 1,500 working hours by Lehnertet al. (1992). A frequently mentioned advantage of machine learning methods is that adaptation to a new domain is fairly simple. It only requires annotations for the new target domain and after a learning phase the model can be applied on the new domain. However, this is not always the case as the new domain could have distinctive properties which are not covered by the existing system. In these cases it is necessary to modify the machine learning system to cover these distinctive properties.

Recent evaluations indicate that supervised machine learning approaches achieve supe-rior performance compared to rule-based systems. For instance, only one fully rule-based system ranked better than the average of all 12 teams in the BioNLP’11 shared task (Kim et al., 2011a). Opposing results have been reported for the BioCreative II.5 shared task for PPI extraction (Leitneret al., 2010). In this competition, the best performing team implemented a rule-based system and achieved a F1 of 42.9 %(Hakenberg et al., 2010).

In comparison, the machine learning system developed by Sætre et al. (2010) achieved a F1 of 37.4 % on the same corpus. An interesting observation is that both systems reported F1 results on an independent corpus, where the rule-based system performed approximately 7 percentage points worse than the machine learning based system. This indicates the need for robust relationship extraction methods, as well as a commonly accepted evaluation strategy to allow quantitative comparisons between different ap-proaches.