• Keine Ergebnisse gefunden

In this chapter, we reported our experiences with a study in crawling and deeply analyz-ing a large domain-specific corpus from the web, exemplified for the field of biomedical research. From a domain-knowledge point-of-view, our results indicate that there is indeed a large body of biomedical knowledge on the web that is not present in the scientific literature. Clearly, much more research is necessary to substantiate this hy-pothesis and to assess the usefulness of this knowledge, which could be, for instance, reports of high quality that were not (yet) published or important text book knowledge

6.5 Summary and open questions

Figure 6.10: Annotation overlap of distinct entity names in % for different entity types and dictionary-based annotation.

that is so established that one cannot find scientific publications discussing it. However, a large fraction presumably are also false positive matches of the taggers or information of dubious quality and reliability.

Our study also brought up a number of open questions and technical pitfalls of fo-cused crawling and large-scale IE on crawled web documens, which we now briefly summarize:

Reliable MIME-type detection

Large files downloaded during crawl are often not textual but embedded presentation slides or formatted documents, which were wrongly classified as plain textual. Filtering by document size only, as we did, is not rewarding, since it easily misses relevant (and extensive) content as posted in blogs, personal websites, or on Arxiv.org. However, we are not aware of any robust tools or ongoing research for reliable MIME-type detection;

instead, detecting MIME-types usually is carried out by regular expression matching on the file name extension or by analyzing the firstn bytes of a document. We used the Apache Tika42library during crawling, which ships only with a list of a handful common MIME-types. Although this list can be extended manually with custom types, a manual extension of this list is hardly feasible for web-scale crawling due to the heterogeneity of data available online.

Robust HTML boilerplate detection

According to Ofuonye et al. [2010], 95% of HTML documents on the web do not ad-here to W3C HTML standards, most errors can be attributed to missing characters in the markup set (e.g., "<" or ">"), missing information on character set and docu-ment type. 13% of the analyzed websites had so severe issues that they could not be transcoded. However, correctly formatted HTML pages seem to be a prerequisite for most boilerplate detection algorithms. In the course of this study, we evaluated dif-ferent boilerplate detection algorithms and found them to perform reasonably well on a gold-standard set of 1,906 pages (cf. Appendix 4 for details.). Applying these tools to our crawled documents, however, revealed that they are highly sensitive to markup errors, often resulting in crashes or empty results. As a work-around, we integrated a markup repair operator in the analysis process before applying boilerplate detection, which ensured that 94% of the crawled documents survived the markup removal step.

Nevertheless, we believe that developing boilerplate detection algorithms that are more robust against errors in real life web pages is essential for seamless and comprehensive text analytics from web documents.

NLP and IE models for web documents

It is well known that ML tools work best on data sets that exhibit similar language characteristics as those used for training. Most research into NER tools in biomedicine is performed on Medline abstracts, both with respect to training data and evaluation data. On such data, ML-based NER is clearly superior to other approaches, as shown in many recent studies and international competitions [Segura-Bedmar et al., 2014].

42http://tika.apache.org(last accessed: 2016-10-05)

6.5 Summary and open questions Accordingly, all ML-based methods used in this project employ models trained on Med-line abstracts since no other training data is available. However, our study reveals that web documents and documents from Medline and PMC are significantly different in several aspects. This leads to an enormous amount of false positive matches by these tools, which are often short abbreviations. We believe that there is a great need for more sophisticated models for domain-specific entity recognition from web documents.

Note that the current research into this direction typically targets rather simple entity types, such as persons, places, or products [Etzioni et al., 2005]. To our knowledge, the performance of such methods on the more difficult biomedical entity types has not yet been evaluated.

Trade-off between precision and yield in focused crawling

When setting-up our system, we focused on a high-precision text classifier as we be-lieved that the number of true positives can be improved more easily with longer crawls than with a high-recall classifier, which might also retrieve many false positive pages.

However, we actually observed that this strategy was not as effective as we thought.

Actually, the size of the crawl we obtained was bound by the fact that our crawl frontier eventually emptied. As described in Section 6.1, we already had to significantly extend our seed list to obtain a crawl of at least the size we have now. Several strategies could be followed to create larger focused crawl. For instance, one could produce even larger seed lists, but this requires substantial preparation time given the current limits of the search engine APIs. Another approach would be to also follow links from pages classi-fied as irrelevant, but only with a small margin. Finally, one could tune the classifier towards more recall during crawling, and classify each crawled text later a second time with a model geared towards high precision. Which of these ways is the most promising one, remains an open question.

Crawling and text analytics as a consolidated process

This project pursued a two-staged approach, where crawling and text analytics was per-formed in two separate phases using very different infrastructures. However, the result of the IE pipeline could actually be a valuable input for the classifier during a crawl, as the occurrence of gene names or disease names are strong indicators for biomedical content. We believe it would be a worthwhile undertaking to research systems that would allow specifying crawling strategies, classification, and domain-specific IE in a single system. Such a system would not only greatly reduce the time needed to build web-scale domain-specific text analysis systems, but also has the potential to greatly improve crawl quality since results obtained during entity extraction could be used for proper document classification and thus further improve the focus of a topical crawler.

7 Summary and outlook

7.1 Summary

In this thesis, we presented and evaluated a query-based IE system, which enables scalable and declarative information extraction on the parallel data analytics system Stratosphere. Our system is configurable towards concrete application domains and scalable to large-scale text processing. It enables end-users to formulate complex IE tasks as queries in the structured and declarative language Meteor, which are com-piled into logical Sopremo data flows. These data flows are logically optimized with SOFA, translated into parallel data flow programs, and executed on parallel compute infrastructures.

Chapter 2introduced fundamental terminology, a summary of typical IE tasks, a dis-cussion of existing approaches and systems for IE at large scale fundamental for the remainder of this thesis. We also introduced the parallel data analytics system Strato-sphere, its layered architecture, and the query and data flow compilation process in this system focussing on the Meteor query language and the algebraic layer Sopremo.

Chapter 3 introduced domain-independent, algebraic operators addressing all funda-mental tasks in information extraction (IE) and web analytics (WA). We showed how end-users can properly combine IE and WA operators in a declarative way to create complex data flows in Meteor for domain-specific applications using a variety of con-crete operator instantiations. Furthermore, we showed how elementary operators can be composed into complex operators to ease the definition of complex analytical tasks.

Finally, we discussed differences between concrete operator instantiations regarding physical and algebraic properties and pinpointed differences in runtime and startup behaviour to highlight both the potential and importance of optimizing the execution order of data flows with UDFs.

Chapter 4 surveyed practical techniques and the state-of-the-art in optimizing data flows with UDFs. We discussed techniques for syntactical data flow modification, ap-proaches for inferring semantics and rewrite options for UDFs, and methods for data flow transformations both on the logical and on the physical level. This chapter con-cluded with an overview on declarative data flow languages for parallel data analytics systems from the perspective of their build-in optimization techniques. We found that some of the discussed techniques are available in running systems, although compre-hensive optimization of UDFs and non-relational operators still is a true challenge for many systems.

Chapter 5introduced SOFA, a novel approach for extensible and semantics-aware op-timization of data flows with UDFs. SOFA builds upon a concise set of properties for de-scribing the UDF’s semantics and it combines automated analysis of UDFs with manual annotations to enable comprehensive data flow optimization. A unique characteristic

of SOFA is extensibility: operators and their properties are arranged into taxonomies, which considerably eases integration and optimization of new operators. We evaluated SOFA on a diverse set of UDF-heavy data flows and compared its performance to three other approaches for data flow optimization. Our experiments revealed that SOFA is able to reorder acyclic data flows from different application domains, leading to consid-erable runtime improvements. We also showed that SOFA finds plans that outperform plans found by other techniques. Furthermore, we described how SOFA is integrated into the Stratosphere system to enable the end-to-end development, optimization, and execution of data flows that contain UDFs.

Chapter 6presented a case study, which investigated the real-life applicability of our operator design, extensions to the Meteor query language, and optimization approach in a challenging setting to compare the "web view" on health-related topics with that derived from a controlled scientific corpus. We combined a focused crawler, which applies shallow text analysis and classification to maintain focus, with our text analytics system built inside Stratosphere. All text and web analytics was carried out using a small set of declarative data flows and we systematically evaluated scalability, quality, and robustness of the employed methods and tools. Finally, we summarized lessons learnt during this project and pinpointed the most critical challenges in building an end-to-end IE system for large-scale analytics.