Operators for web analytics - Scalable and Declarative Information Extraction in a Parallel Dat

In this section, we introduce operators for processing and analyzing HTML pages with Stratosphere⁶. The set of web analytics (WA) operators comprises operators for prepro-cessing of HTML documents to enable downstream IE on such documents, i.e., opera-tors for HTML markup detection, repair, and removal, but also operaopera-tors for detecting structured information in HTML pages (link URLs, tables, lists). All operators described below process individual JSON records as shown in the exemplary record contained in Listing 3.17. Analog to text processing with IE operators, we require the unstructured HTML documents to be stored in the attribute"text".

3.3.1 Text preprocessing

The set of WA operators for text preprocessing consists of three elementary and one complex operator: The elementary operators perform HTML boilerplate detection (dtct-bp), markup repair (rpr-mrk), and markup removal (repl-mrk), while the complex op-eratorrm-mrkperforms markup repair, boilerplate detection, and removal in a single step. In total, the set of operators for preprocessing HTML pages comprises 14 operator instantiations, which are described below in more detail:

Listing 3.18 lists all Meteor statements for preprocessing of HTML web pages. The third line of the script displays therpr-mrkoperator for repairing erroneous markup tags in the input. This step is important for many real-life text analytics tasks, since many websites contain markup errors, which pose severe challenges to boilerplate de-tection operators [Ofuonye et al., 2010]. Internally, rpr-mrk transforms the text at-tribute of the input record by either reordering, deleting redundant HTML tags or

in-6The set of web analytics operators was implemented by Jörg Meier and Anja Kunkel under close supervision based on the requirements and specifications provided by Astrid Rheinländer.

3.3 Operators for web analytics

Listing 3.17: Exemplary JSON record for HTML pages.

1 { "id": "http://www.test.com/test.html",

2 "text": "<!DOCTYPE html><html><body>This is a simple HTML page with an absolute URL 3 <a href=\"http://www.w3schools.com\">W3Schools</a> and a relative URL: <a 4 href=\"tags/tag_a.asp\">The a tag</a>. The page contains an example of a

5 table: <br>

6 <table border="1"><thead><tr><th>Month</th>

7 <th colspan="2">Savings</th></tr></thead><tr><td>January</td><td>100</td>

8 <td rowspan="2">$</td></tr><tr><td>March</td><td>3400</td></tr></table>

9 <br><br>

10 Examples of lists are also available:

11 <ul><li>Coffee</li>

serting missing tags to produce valid XML code using the HTMLCleaner algorithm⁷. The operator is implemented in a Map second-order function and produces a single output record for each processed input record.

The boilerplate detection operatordtct-bpaims at detecting the net content of web pages and the corresponding Meteor statement is shown in Line 4 of Listing 3.18. We implemented six different operator instantiations for dtct-bp, i.e., three algorithms provided by the Boilerpipe⁸library [Kohlschütter et al., 2010], two algorithms provided by Readability⁹, and one algorithm called cuter developed by Krause [2012]¹⁰. Each algorithm can be called in Meteor by configuring the dtct-bp operator with theuse algorithmproperty. The algorithms’blp’,’blp_largest’, and’blp_news’are taken from the Boilerpipe library. Blpis the default algorithm fordtct-bpdue to its robust-ness and performance on arbitrary HTML pages. Blp_largestdetects the largest con-tent block contained in a web page, while blp_news is specifically designed for process-ing news pages. The algorithms’snack’and’rdb’both identify non-content elements reliably, but might also classify parts of the actual content as non-content elements. In contrast, ’cuter’ detects all content elements reliably, but might also classify many non-content elements as actual content. After content has been detected, thedtct-bp operator creates a new attributedetected_content as shown in Listing 3.19.

The third operator for text preprocessing isrepl-bp, which replaces the "text" at-tribute of the input record with the content detected bydtct-bp. At the same time, the attribute detected_content created by dtct-bpis deleted. The corresponding Meteor

7HTMLCleanerhttp://htmlcleaner.sourceforge.net/, last accessed: 2016-09-26.

8Boilerplate Removal and Fulltext Extraction from HTML pages, http://github.com/kohlschutter/

boilerpipe, last accessed 2016-09-26.

9Readability web parser API,http://www.readability.com, last accessed: 2016-09-26.

10Before operators were implemented for Stratosphere, Jörg Meier conducted an pre-study under close su-pervision by Astrid Rheinländer. This study compared the extraction quality and runtime properties of all implemented boilerplate detection algorithms and is available upon request. Exemplary results regarding extraction quality are contained in Appendix 4 of this thesis.

statement for this operator is shown in Line 5 of Listing 3.18 and exemplary output is shown in Listing 3.20.

Listing 3.18: Elementary web analytics operators.

1 using wa;

2 ...

3 $content = repair markup $htmlcode;

4 $content = detect content $content use algorithm ’blp’;

5 $content = replace content $content;

6 ...

Listing 3.19: Output ofdtct-bpoperator for the HTML page shown in Listing 3.17.

1 { "id": "http://www.test.com/test.html",

2 "text": "<!DOCTYPE html><html><body>This is a simple HTML page with an ..."

3 "detected_content": [{"algorithm":"blp","content":"This is a simple HTML page with

4 an absolute URL..."}]

5 }

6 }

Listing 3.20: Output ofrepl-bpoperator for the HTML page shown in Listing 3.17.

1 { "id": "http://www.test.com/test.html",

2 "text": "This is a simple HTML page with an absolute URL..."

3 }

Listing 3.21: HTML document preprocessing with the complexrm-bpoperator.

1 using wa;

2 ...

3 $content = remove markup $htmlcode use algorithm ’snack’;

4 ...

One complex operator rm-bp for preprocessing of HTML documents is available, which performs HTML markup repair, boilerplate detection, and boilerplate removal in a single step¹¹. The corresponding Meteor statement is shown in Listing 3.21. Inter-nally, this operator consists of three operators, i.e.,rpr-mrkis followed bydtct-bpand repl-bp. Analog todtct-bp, the complex operatorrm-bp can be configured for using one of the six boilerplate detection methods available fordtct-bp.

3.3.2 Structure detection

For structure detection, one elementary operatordtct-structwith three concrete in-stantiations is available:dtct-struct-linkfor detecting URL links,dtct-struct-tbl for detecting tables, anddtct-struct-lstfor detecting lists in the input records. Note that these operators operate on the original HTML input, i.e., in complex data flows, boilerplate detection needs to be performed after link, table, and list detection.

The operatordtct-struct-link retrieves and extracts all links inside the<body>

scope of an HTML page based on the HTML parser provided by Jsoup¹². The Meteor

11Note that markup repair is included in this algorithm for practical reasons, since most web pages avail-able online contain errors, which may cause severe issues in the downstream boilerplate detection phase [Ofuonye et al., 2010].

12jsoup: Java HTML Parser,http://jsoup.org/, last accessed: 2016-09-26.

3.4 Functional and runtime operator properties

Im Dokument Scalable and Declarative Information Extraction in a Parallel Data Analytics System (Seite 50-53)