• Keine Ergebnisse gefunden

Relationship Extraction using Co-occurrence

6.4 Applications

6.4.6 Relationship Extraction using Co-occurrence

GeneViews repository currently contains nine different entity types, proteprotein in-teractions, and drug-drug interactions. So far each relationship type is extracted using a specific statistical model learned on annotated data. An alternative approach was

9Joint work with A. Relogio

10Angela Relogio and Alexander Frick

134

pp

Figure 6.9: Comprehensive regulatory network for the mammalian circadian clock after annotation. In the center of the network we represent the main compo-nents of the basic feed-back loops. In the outer circle of the network we depict clock-regulated genes and proteins which feed-back to the core com-ponents and thereby influence the oscillations. Full lines, protein-protein interactions; dashed lines, protein-DNA interactions; red lines, inhibition interaction; green lines, activation interaction; black lines, other kinds of interactions.

6 GeneView – End-user access to MEDLINEScale Text Mining

presented in Section 5 where annotated data is build automatically using distant super-vision. One disadvantage of distant supervision is the requirement of a database for every relationship type. This is less pronounced in other domains, where different relationship types are manifoldly contained in one knowledge base (e.g.,Wikipedia) (Hoffmannet al., 2010). Here, we will show the usefulness of a simpler approach (co-occurrence) by the following example: Medical treatment of several cancer types depends on the mutational status of a patient. For instance, people suffering from colorectal cancer receive drug treatment depending on the mutations of several genes (Messersmith and Ahnen, 2008).

Using GeneView, we searched for the 5 mutations most frequently mentioned with col-orectal cancer. For these mutations we extracted the five most frequently co-occurring drugs. Results are shown in Figure 6.10. This plot shows several SNPs associated with treatment or progression of colorectal cancer. For instance, Val600Ala located on BRAF is a frequently used biomarker to predict reaction to Cetuximab and Sorafenib treat-ment. Other drugs associated to Val600Ala, such as Rasagiline and Dacarbazine are associated with the treatment of other malignant tumors.

Colorectal Cancer

Figure 6.10: Co-occurrence graph for mutations associated with “colorectal cancer” and drugs associated with the respective mutations. Edge labels indicate fre-quency of sentence wise co-occurrence between the connected entities.

136

6.5 Conclusion

This chapter presented GeneView, an entity-centric search engine for the biomedical literature. The system encompasses several state-of-the-art NLP and information ex-traction tools whose output are stored in an information retrieval engine (Lucene) and in a relational database. We illustrated the usability of GeneView in various biomedical applications, including trend analysis, pathway reconstruction, and pathway augmenta-tion.

Integration of heterogeneous specialized NLP tools lead to several problems, mostly due to changing requirements of data formats, multiple runtime dependencies, and ex-ecution environment. In particular, the lack of standards for representing annotated texts, which gives rise to many different ways to link annotations with text spans, cre-ates the need to perform repeated format conversions and to keep multiple copies of the text, along with brute-force mapping tables. Several tools in this pipeline use a different format for the input text and the positional annotations it returns. In the future this problem might be alleviated due to recent efforts in defining standards to the community (Hellmannet al., 2012; Comeauet al., 2013).

6.6 Related Work

Several web-based tools have been developed for the extraction and presentation of semantic knowledge from MEDLINE. Most of these tools have a rather narrow and specific purpose, like retrieval of protein-protein interaction (PPI) data (Kim et al., 2011b). We here only discuss those tools that are most similar to GeneView and refer to Lu (2011) and Rodriguez-Esteban (2009) for excellent reviews of this field.

iHop (Fernández et al., 2007) enables the interactive navigation of MEDLINE sen-tences describing two protein mentions in conjunction with interaction specific key words.

Entities different to proteins are not considered. AliBaba (Plake et al., 2006) aggre-gates extracted knowledge across all results of a PubMed query and visualizes them as a graph. In difference to this work, GeneView focuses on individual documents.

EbiMed (Rebholz-Schuhmann et al., 2007) retrieves co-occurring entities for a specific query and ranks them by frequency. Like AliBaba and unlike GeneView, it provides aggregated results. Furthermore, GeneView uses a sophisticated machine learning tech-nique to detect relationships instead of co-occurrences. Polysearch (Chenget al., 2008) and SciMiner (Hur et al., 2009) offer a similar functionality as EbiMed, but use differ-ent extraction algorithms and significance tests. Facta+ (Tsuruokaet al., 2011) enables the retrieval of indirect associations between biomedical concepts. PLAN2L (Krallinger et al., 2009) can be used to rank sentences by relevance and visualize relations between entities focusing onarabidopsis thaliana. UKPMC (McEntyre et al., 2011) extends the functionality of PubMed Central by using Whatizit (Rebholz-Schuhmannet al., 2008) to recognize and highlight entities in abstracts. The system does neither highlight entities in full texts nor does it provide functionality to search with database identifiers instead of (possibly ambiguous) entity names. Finally, GoPubMed (Doms and Schroeder, 2005)

6 GeneView – End-user access to MEDLINEScale Text Mining

recognizes genes, gene ontology, and MeSH terms and presents search results using the structure behind these vocabularies. In contrast, GeneView recognizes a broader set of entity types but not gene ontology or MeSH terms, provides search facilities using unique database identifiers, and also finds relationships between proteins in texts.

More advanced information extraction techniques are also used by Björneet al.(2010), who performed extraction of nine biomedical events types on a sample of 1 % of all PubMed citations. This analysis has been later scaled to all of MEDLINE (Björne et al., 2010) and later to PubMed Central full-texts (Landeghemet al., 2013). While the system by Björneet al. extracts biomedical events, GeneView currently only annotates entities and binary relationships.

GeneView also differs from many of these tools in terms of user interface. Most systems present their results in form of ranked lists of entity pairs or single sentences. In contrast, GeneView presents its annotations in multiple ways. First, the entire article is shown with recognized entities being highlighted in different color codes. Next, all found entities and relationships are also presented as lists, a feature especially important for quick navigation in full texts. Finally, we provide annotations for all entity classes and relations as structured text files. Note that GeneView, in contrast to many other systems, includes the complete open PMC full text corpus on top of all MEDLINE abstracts. It also often annotates a broader set of concepts and uses more recent text mining methods. Annotations are provided as downloads to support the development of new applications by freeing developers of data analysis algorithms from the necessity to deal with a multitude of text mining packages.

138

7.1 Summary

In this thesis we presented and evaluated different approaches for biomedical relation-ship extraction from texts. All proposed methods were evaluated on a set of manually annotated corpora. Methods have been assessed in various settings on corpora with particular properties, which allows us to investigate robustness in different scenarios.

Chapter 3 discusses our approach for drug-drug interaction extraction as originally proposed for the SemEval 2013 challenge. Our strategy implements a cascaded (coarse-to-fine grained) classification approach, which we evaluated on two different corpora (DrugBank andMEDLINE). The analysis reveals that training instances from DrugBank considerably help to improve DDI performance forMEDLINEarticles. In contrast, the effect of MEDLINEarticles for DrugBank is questionable and for some classifiers even misleading. Ensemble methods, combining the output of different classifiers, were used to improve performance over a set of eight individual classifiers. An important property of ensembles is that they improve robustness by reducing the risk of accidentally select-ing an under-performselect-ing classifier. Stacked generalization outperforms majority votselect-ing by 1.1 pp on the evaluation corpus. More importantly, stacked generalization seems to be not affected by adding less informative classifiers, due to increased generalization ca-pabilities over majority voting. In this intrinsic setting stacked generalization provides higher robustness than other methods.

Chapter 4analyzes the impact of self-training to improve robustness for PPI extraction on texts with unknown characteristics. Robustness is an essential prerequisite for large-scale relationship extraction, as training corpora only partially reflect the target domain.

10-fold cross-validation suffers from the weakness that source and target texts potentially exhibit different characteristics, which is not properly reflected in cross-validation. Per-formance drops considerably when switching from an intrinsic evaluation to the more realistic extrinsic situation. We assess robustness of a classifier by performing cross-corpus experiments and improve extrinsic performance by self-training. The chapter investigates two self-training strategies, called self-only and self-enriched. In our exper-iments, both self-training strategies achieve higher robustness than a well performing baseline. In general, self-only achieves better results than self-enriched.

Chapter 5analyzes the use of distant supervision for PPI extraction. Distant supervi-sion automatically labels texts without manual intervention. In comparison to manual annotation, this strategy allows to increase training set size by some orders of

magni-7 Summary and Outlook

tude. Corpora generated by distant supervision are inherently noisy, thus benefiting from robust relationship extraction approaches. In this chapter we compare two dif-ferent approaches for protein-protein interaction extraction. The first approach learns a statistical model (SVM) on subsets of positive and negative instances. The second approach learns graphical dependency patterns from all positively labeled instances.

For the first model, we implement heuristics to remove likely mislabeled instances.

We also analyze the impact of class-ratio in the distantly labeled training set as well as the amount of available training data. F1 remains comparably robust with an aver-age standard deviation of 2.6 pp for training class ratios between 0.1 to 10. We show that bagging, an ensemble learning technique, helps to improve classifier robustness by decreasing the risk of selecting an under-performing single classifier.

For the second approach, we define a set of pattern refinement strategies using general-izations and constraints. This strategy reduces noise and therefore improves robustness of learned patterns. We subsequently analyze different properties of patterns (e.g., pat-tern length, amount of available patpat-terns) on five evaluation corpora. Finally, we show that approximate graph matching allows us to emphasize our needs towards precision or recall.

Chapter 6discusses the details for building the semantic search engine GeneView. It covers the architecture of GeneView and observed difficulties during implementation. A specific problem of large-scale text mining is that some errors are only observed on a small subset of articles, which makes detecting them very hard. We applied a cascade of state-of-the-art natural language processing tools on articles contained in MEDLINE and PMC open access. We sketched several use-cases utilizing data contained in GeneView.

For instance, data contained in GeneView has been used to expand the circadian network (Relógioet al., 2014). We also applied a similar workflow to extract regulatory relations between human transcription factors on allMEDLINEcitations (Thomaset al., 2014a).

This procedure substantially decreases curation time by approximately one order of magnitude in comparison to a baseline working on co-occurrence.