Meteor/Sopremo: data flow language and operator model

2.3 The Stratosphere data analytics system

2.3.2 Meteor/Sopremo: data flow language and operator model

Meteor [Heise et al., 2012] is a data flow oriented declarative scripting language that resides at the top of the Stratosphere stack. Meteor builds upon a semi-structured data model that extends JSON (cf. Chapter 2.1). It has similar objectives as other data flow languages (e.g., Pig [Olston et al., 2008] or Jaql [Beyer et al., 2011]), namely providing a high-level, easy-to-use interface to complex, user-defined operations in data analytics systems to end users. In contrast to other languages, Meteor is based upon the semantically rich and extensible operator model Sopremo, which enables that the

4http://www.stratosphere.eu, last accessed: 2016-12-14

CLI interface

Script parser

Meteor languageandSopremoalgebraiclayer

1 using ie, base;

3 $input = readfrom‘hdfs://…/medline.json’;

5 $sentences = split sentences $input;

6 $tokens = split $sentences;

7 $tokens = annotate pos tags $tokens;

9 $words = group $tokens by{$tokens.text} into{ 10 word: max($tokens.text),

Physical runtime operators Data exchange strategies Parallelization primitives

Fault tolerance Memory management I/O management

Storage layer

Local storage

HDFS Amazon S3

Execution plan compiler

Figure 2.5: Architectural overview and query compilation in the Stratosphere sys-tem for parallel data analytics. Declarative Meteor queries are submitted through a command line interface, parsed, and translated into logical data flows in the Sopremo algebra. A schema inferencer analyzes each logical plan to generate a global record schema, which is used by the SOFA opti-mizer during logical optimization. An optimized logical plan is translated by the data flow compiler into a PACT program, which is physically opti-mized and eventually executed in parallel by the parallel execution engine Nephele.

2.3 The Stratosphere data analytics system operator’s semantics can be accessed at compile time and is thus available during data flow optimization.

In Meteor, each operator invocation starts with the unique name of the operator and is usually followed by a list of inputs and a set of operator properties, which are configured with a list of property name and expression pairs. The result of an operator invocation can be assigned to a variable, which either refers to a materialized data set or to a logical intermediate data set. Variables start with a dollar sign ($) to ease distinction between data sets and operator definitions.

An example of a Meteor query, which computes the frequencies of nouns in a col-lection of documents, is shown in the left part of Figure 2.6. The first line of the script imports Sopremo operator packages (see below), which are used in the query. Here, op-erator packages for IE and basic operations are used. Line 3 specifies the data source, which in this case is a JSON file stored in the distributed file system HDFS [Borthakur, 2008]. Lines 5–7 apply three different IE operators, which process unstructured text linguistically by first splitting it into distinct sentences (Line 5), by annotating token boundaries (Line 6), and by annotating part of speech tags (Line 7). Lines 9–13 describe an aggregation, which determines the frequencies of tokens inside agroup by opera-tor. In this operator, the grouping key is specified by the individual tokens (indicated through the operator propertyby) and acountfunction determines the frequencies for each group. The operator propertyinto specifies the output schema of the processed and aggregated records, which are subsequently filtered for records identified as a noun (Line 15). Finally, the result set is written to HDFS in Line 17.

Meteor queries are submitted to the system through a command line interface, parsed into abstract syntax trees composed of basic or complex Sopremo operators and trans-lated into logical execution plans in the Sopremo algebra (see next paragraph). The schema inference component analyzes each logical plan to generate a global record schema, which is used by the logical optimizer during optimization and the data flow compiler to translate the logical data flow into a PACT program.

Sopremo operators and packages

Meteor queries are translated one-to-one into data flows consisting of Sopremo op-erators. Each invocation of an operator in Meteor corresponds to an elementary or complex Sopremo operator and variables in Meteor are translated into data flow edges indicating the data flow between Sopremo operators.

The right part of Figure 2.6 displays the corresponding Sopremo data flow of our example query for noun frequency computation. The data flow is linear and consists of five operators, three of which are elementary operators (i.e., POS tagger, Group by, Filter), and two of which are complex operators (i.e., Sentence splitter, Tokenizer).

All Sopremo operators are organized in domain-specific packages, which are self-con-tained libraries of the operator implementations, their syntax, and semantic annota-tions. All operators are either elementary or complex and may have different instanti-ations (cf. Chapter 2.1). Stratosphere contains four packages, namely aBasepackage containing 16 operators, a package forIEwith 21 operators, a package for data cleans-ing (DC) with nine operators, and a package for web analytics (WA) with five operators.

A detailed description of the Base and DC packages is available in [Heise, 2015], details regarding IE and WA operator packages shall be presented in Chapter 3.

1 using ie, base;

3 $input = readfrom ‘hdfs://…/medline.json’;

5 $sentences = split sentences $input;

6 $tokens = split tokens $sentences;

7 $tokens = annotate pos tags $tokens;

9 $words = group $tokens by{$tokens.text} into{ 10 word: max($tokens.text),

Figure 2.6: Example Meteor query for term frequency computation with correspond-ing Sopremo data flow. The precedence graph and an logically optimized Sopremo data flow are contained in Figure 2.7

The Base package comprises mostly relational operators, such as filter, projection, transformation, join, and group. These operators are complemented by operators for semi-structured data, such as nest or unnest.

The DC package comprises six different classes of operators for data cleansing and data integration [Heise, 2015], which address common challenges of dirty or hetero-geneous data sources, such as inconsistent representation of equivalent values, fuzzy duplicates, typographic errors, or missing values.

The IE package comprises three classes of operators: one class for producing text annotations, one class for auxiliary operators to transform or merge annotations, and one class for complex operators. Operators analyze the text and add, remove, or update annotations to the record. They may also transform records, for example, the operator for sentence splitting takes as input single records formed of documents and outputs a set of records formed of sentences.

The WA package comprises operators for analyzing web documents, namely operators for HTML markup repair, for markup removal, as well as operators for extracting links, tables, and other structured information contained in web documents.

Extensibility

By extending the set of existing Sopremo operators, domain experts can easily integrate domain-specific, user-defined operators into the Stratosphere system. To support reuse of available operator implementations in Stratosphere, existing Sopremo operators can be composed, i.e., a set of elementary operators can be interconnected in a partial data flow to form a complex operator. This enables code re-use and the optimization of

2.3 The Stratosphere data analytics system complex operators, since rewriting a complex operator may be enabled by transforming its building blocks (see Chapter 5 for details).

Developers of new operators or operator packages need to ensure that all operators are self-contained regarding the following aspects for a seamless integration into So-premo [Heise et al., 2012]:

Self-contained operator implementations. Package developers provide operator semantics together with parallel operator implementations. New operators can either be defined as elementary operators, each providing it’s own low-level parallel imple-mentation (PACTs, see below), or as a complex operators, which are combination of existing operators.

Self-contained operator properties. Properties of Sopremo operators (e.g., join conditions, entity classes to be annotated) are available through a reflection API, which manages and validates requested properties directly within the concrete operator. Prop-erties are used to configure operators and to choose a concrete implementation during optimization.

Self-contained operator annotations. Optionally, package developers can provide annotations, such as cost estimates, specific rewrite rules, or algebraic properties (e.g., commutativity, associativity) as an aid to the logical optimizer inside their package. If such annotations are not provided, the optimizer relies on information inferred from the operator’s implementation (see Chapter 5 for details).

Logical data flow optimization

Once a Meteor query was compiled into an algebraic Sopremo data flow and the cor-responding global schema was determined, logical data flow optimization through the SOFA optimizer commences to determine a semantically equivalent, yet more cost-effective logical plan. In this section, we give a brief high-level overview on SOFA and its components, details of the optimization approach are contained in Chapter 5.

IE data flows are naturally UDF-heavy. A major issue in optimizing such flows is the diversity of the contained UDFs. Defining rewrite rules that respect the individual operator semantics for each possible combination of operators is merely impossible in UDF-rich systems such as Stratosphere. A particular challenge during optimization is extensibility, as every new operator in principle needs to be analyzed with respect to all existing operators to identify rewrite options.

SOFA solves this problem by means of Presto, an extensible taxonomy of operators, properties, and rewrite templates, and by reasoning along subsumption relationships encoded in Presto. The principal ingredients of Presto are two taxonomies describing generalization-specialization relationships (isA) both between pairs of operators and pairs of properties. Leaves in the operator taxonomy describe concrete implementa-tions of the abstract parent operator, for example, different concrete algorithms for entity extraction are represented as leaves, whereas the abstract operator for entity extraction is an ancestor of those leaves. Presto uses three additional relationships (hasProperty, hasPart, and hasPrerequisite) to model relations between operators and properties. Properties relevant for optimization are, for example, algebraic properties such as commutativity or associativity, the parallelization function (e.g., map, reduce), or the read/write behavior of operators at attribute level. Rewrite templates are de-fined using Presto relationships, operator properties, and abstract operators as building

Source

Figure 2.7: Precedence graph (left) and optimized data flow (right) for the example query of Figure 2.6. Precedence relationships of data source and data sink are omitted in the precedence graph to ease readability.

blocks. Reasoning along Presto relationships allows SOFA to automatically instantiate the templates with concrete operators and enables discovery of individual rewrite op-tions for concrete operator combinaop-tions on the fly.

By extending the set of Sopremo operators, package developers and domain special-ists can integrate application-specific functionality into Stratosphere as explained be-fore. Likewise, by adding their new operators to Presto, they can enable cross-domain data flow optimization and extend the optimization potential of their operators in a pay-as-you-go manner.

Using Presto, Sofa performs three steps to enumerate alternative plans for a given data flowD: First,Dis analyzed for precedence constraints between operators. This analysis yields a precedence graphP_Dused in the plan enumeration phase to secondly enumerate, and thirdly to perform cost-based ranking of valid plan alternatives. Finally, the best plan is selected and compiled into a physical data flow program using the PACT programming model. As shown in the precedence graph in the middle of Figure 2.7, the data flow from Figure 2.6 contains only one degree of freedom, i.e., thegroup by and thefilteroperator are not in a precedence relationship with each other and can be reordered during optimization. A semantically equivalent and most likely more cost-effective plan is shown on the right side of the figure.

Im Dokument Scalable and Declarative Information Extraction in a Parallel Data Analytics System (Seite 27-32)