Linguistic structure - Content analysis - Scalable and Declarative Information Extraction in a

6.4 Content analysis

6.4.1 Linguistic structure

Analyzing the linguistic structure of texts is important for assessing the complexity of texts and to judge whether existing IE tools, which were trained and developed for different corpora might perform well in web documents. Particularly, we examine

• document and sentence lengths,

• incidence of negation,

• incidence of passives,

• incidence and types of pronouns, and

• incidence and types of parenthesis.

Differences in obtained measures were statistically assessed using the Mann-Whitney-Wilcoxon signed rank test. This test produces a P-value, which estimates the probability that the observed differences are due to random effects in the data.

Document and sentence lengths

Sentence lengths impact IE and NLP in different ways. First, the execution time of IE and NLP tools usually directly depends on the lengths of the sentences to be analyzed.

For example, the runtime complexity of automaton-based algorithms performing Named Entity Recognition (NER) using a fixed dictionary of search terms isO(|search terms|+

|sentence length|)[Aho and Corasick, 1975], whereas the time complexity of modern NER methods based on conditional random fields is quadratic in the length of the sentence time [Viterbi, 2006]. Second, the difficulty of constituent and dependency sentence parsing and the difficulty of modern relation extraction methods rises with sentence lengths [Tikk et al., 2013]. Likewise, if crawled web documents contain shorter sen-tences than Medline or PMC, we expect the former to be easier to analyze. Other

important measures are the document lengths in the different corpora, as these must be considered when comparing the frequency of entity mentions. Sentence length was determined in characters for each sentence in the different data sets.

Figure 6.8(a) displays the distribution of document lengths and Figure 6.8(b) dis-plays mean sentence lengths across the different data sets. Mean document lengths in relevant documents were significantly shorter than in PMC (P<0.01), but signifi-cantly longer compared to irrelevant documents (P<0.01) and to Medline abstracts (P<0.01). Document lengths for the relevant corpus show the largest variance, which increases the need for appropriate load balancing in a distributed setting. Differences in sentence lengths between the four corpora are also significant, confirming previ-ous findings from [Cohen et al., 2010] regarding Medline abstracts and PMC full texts.

These differences have to be kept in mind when selecting tools for IE that are based on gold standard data. Most tools we are aware of were trained and evaluated on Med-line abstracts and thus on rather short sentences; accordingly, we expected a lower performance of these tools on longer sentences than reported in the literature.

Incidence of negation

Detecting negation is important in many areas of natural language processing (e.g., sen-timent analysis, relation extraction) and is particularly important for analyzing biomedi-cal texts [Agarwal and Yu, 2010]. Here, we used a rather simple method for determining negations in sentences, using a set of regular expressions to find mentions of the words not,nor, andneither. As shown in Figure 6.8(c), the incidence of negation in the four corpora is significantly different (P<0.01) regarding the overall incidence of negation and the relative frequency of negation with respect to document length. Specifically, texts in the set of relevant documents have a lower incidence of negation than in PMC and the irrelevant pages and a higher incidence of negation than in Medline. Accord-ingly, appropriate treatment of negation will be more important for web data than for scientific articles.

Incidence of active and passive voice

Active and passive voice are two different methods of formulating an English sentence that use different types of verbs. Although sentences formulated in active or passive voice have the same semantic meaning, word orders are often changed in passive sen-tences and different types of verbs are used that pose challenges to syntactic parsers.

Specifically, passive verb phrases are often mislabelled as active verb phrases when the auxiliary verb is missing [Igo and Riloff, 2008]. We extracted passive voice from each set of documents using regular expressions searching for the string "ed by". Note that this underestimates the incidence of passive voice (e.g., when agents are missing), but since we applied this method to each data set the comparison is still valid.

As shown in Figure 6.8(d), the incidence of passives is significantly different across the different data sets (P<0.01). The highest mean incidence of passives was found in PMC and the lowest in articles from Medline. Incidence of passives in both relevant and irrelevant documents was comparatively small, indicating that parser errors due to a faulty recognition of passive voice might occur less often than in analyses using the PMC data set.

6.4 Content analysis

(a) Distribution of document length (b) Distribution of mean sentence length

(e) Incidence of pronouns (f) Incidence of parenthesis

Figure 6.8: Distribution and incidence of linguistic properties per document in different data sets.

Incidence of pronouns

Pronominal anaphora are important in biomedical IE to perform co-reference resolu-tion [Gasperin and Briscoe, 2008]. To measure the amount of such co-references in our corpora, we counted six different classes of pronouns in each data set. Interestingly, the incidence of demonstrative, relative, and object pronouns, which are the most im-portant pronoun classes for co-reference resolution, was significantly lower both in rel-evant and irrelrel-evant texts compared to texts from PMC (cf. Figure 6.8(e), a distinction between pronoun classes can be found in Appendix 4). We expected this observation concerning irrelevant texts since these texts are significantly shorter than texts from PMC. Our observation is surprising for relevant texts as these are significantly longer than documents from PMC. This finding might indicate that co-reference resolution on crawled texts is not as vital as in analyzing biomedical full-text literature.

Incidence of parentheses

Parentheses can hint to abbreviations, paper references, synonyms of named entities, etc., which are very important during NLP processing. Properly treating parenthesis is also highly important for syntactic parsing, as text in parentheses does typically not conform to the sentence grammar. We extracted parenthesized text using a set of reg-ular expressions and found that that their incidences differ significantly (P<0.01) be-tween all data sets (cf. Figure 6.8(f)). We observed the highest incidence in texts from PMC, followed by relevant web documents and Medline, and the lowest in irrelevant documents.

Im Dokument Scalable and Declarative Information Extraction in a Parallel Data Analytics System (Seite 133-136)