• Keine Ergebnisse gefunden

Optimizing JSONiq Execution in Rumble using MLIR

N/A
N/A
Protected

Academic year: 2022

Aktie "Optimizing JSONiq Execution in Rumble using MLIR"

Copied!
59
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Research Collection

Bachelor Thesis

Optimizing JSONiq Execution in Rumble using MLIR

Author(s):

Reber, Manuel Publication Date:

2020-06

Permanent Link:

https://doi.org/10.3929/ethz-b-000444644

Rights / License:

In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

(2)

Bachelor’s Thesis Nr. 313b

Systems Group, Department of Computer Science, ETH Zurich

Optimizing JSONiq Execution in Rumble using MLIR

by

Manuel Ilias Reber

Supervised by Dr. Ghislain Fourny

Dr. Ingo Mueller Prof. Dr. Gustavo Alonso

Juni 2020

(3)
(4)

Acknowledgements

At this point I would like to express a special thanks to my supervisors, Dr. Ingo M¨uller and Dr. Ghislain Fourny, who even in times of Corona and Social Distancing accompanied and supported me in weekly online meetings with informative inputs and understandable answers to all my questions.

(5)

Abstract

Semi-structured data formats such asJSONo↵er the advantage of representing arbitrarily complex data in a format that can be read easily by both humans and machines. Due to their simplicity such formats are popular for applications that produce large amounts of data, where it is not clear in advance how, or even whether, the produced data will be used in the future, such that it is not worth investing e↵ort into schema design, data migration, etc. One major downside, however, is that queries executed on semi-structured data perform much slower when compared to fully-structured data.

In this bachelor’s thesis we provide an approach for increasing the efficiency of query execution onJSONdata sets by developing an Intermediate Representation for JSONiq as an MLIR dialect. We then try to find transformation mechanisms from queries in Rumble to the dialect and vice versa. Having such a round-trip we can find optimization on the Intermediate Representation and translate it back to a query which then can be executed more efficiently.

(6)

Contents

1 Introduction 4

2 Background and Related Work 5

2.1 JSON . . . 5

2.2 JSONiq . . . 5

2.3 Spark . . . 8

2.4 Rumble . . . 8

2.5 MLIR . . . 9

2.5.1 Dialects . . . 9

2.5.2 Operations . . . 9

2.5.3 Functions . . . 10

2.5.4 Blocks . . . 10

2.5.5 Regions . . . 11

2.5.6 Operation Description . . . 11

2.5.7 Declarative rewrites . . . 11

3 The JSONiq Dialects 12 3.1 Approach 1 . . . 12

3.1.1 Types . . . 12

3.1.2 Operations . . . 12

3.1.3 Representation of Queries . . . 23

3.1.4 Theoretical Aspects of the IR . . . 25

3.2 Approach 2 . . . 27

3.2.1 Types . . . 29

3.2.2 Operations . . . 29

3.2.3 Representation of Queries . . . 34

3.2.4 Theoretical Aspects of the IR . . . 35

4 Implementation 40 4.1 Transformation to the IR . . . 40

4.2 Transformation to Rumble . . . 41

4.2.1 Approach 1 . . . 41

4.3 Approach 2 . . . 42

4.3.1 Initialization of the Graph . . . 43

5 Optimizations 44

6 Testing 45

A Other Approaches 48

B EBNF Grammar 51

(7)

1 Introduction

Semi-structured data formats such as JSON o↵er the advantage of representing arbitrarily complex data in a format that can be read easily by both humans and machines. Due to their simplicity such formats are popular for applications that produce large amounts of data, where it is not clear in advance how, or even whether, the produced data will be used in the future, such that it is not worth investing e↵ort into schema design, data migration, etc. One major downside, however, is that queries executed on semi-structured data perform much slower when compared to fully-structured data.

This thesis provides an approach for increasing the efficiency of query execution on semi- structured data by optimizing Rumble queries.

Optimizations of code often take place on Intermediate Representations (IRs) in the compiler. For this reason we have designed an IR for JSONiq as an MLIR dialect. In order to obtain a more efficient execution in Rumble we now have the following two possibilities:

1. We optimize the IR and then translate it into runtime iterators 2. We optimize the IR and then translate it back into Rumble code

The latter option means that we convert the original query into a semantically equivalent query which can be executed more efficiently. Figure 1 shows an illustration of the two possibilities.

Rumble Query Runtime Iterator

IR Transformation Transformation

Optimization

Transformation

Figure 1: Illustration of the two possibilities. The part marked in green shows the scope covered in this thesis.

Due to time constraints, we have limited ourselves in this bachelor’s thesis to find trans- formation mechanisms from Rumble to the IR and from the IR back to Rumble. We have decided to design the JSONiq IR as an MLIR dialect, so that further work, which could for example deal with optimizing the IR or with the transformation mechanism to runtime iterators, can use the large amounts of compiler functionalities provided by MLIR. In chapter 2, we discuss background information and related work. In chapter 3 we present di↵erent approaches for a suitable JSONiq dialect in MLIR. Chapter 4 deals with the implementation of the transformation mechanisms and in chapter 5 we briefly show simple examples on optimizations in MLIR. Chapter 6 finally deals with the testing of the implementation.

(8)

2 Background and Related Work

This chapter gives an overview of background information and related work.

2.1 JSON

JSON [1] (JavaScript Object Notation) is a language independent text format for lightweight data-interchange. It is based on two structures:

object Object is an unordered set of name/value pairs. In most programming languages this is realized for example as a struct or a dictionary. An object always begins and ends with curly brackets ”{ }”, the name/value pairs are separated bycomma and each name is followed by a colon.

array Array is an ordered set of values and can be realized in programming languages as e.g. a vector, a list or an array. Arrays begin and end with square brackets ”[ ]”

and values are separated bycomma.

Values can be of type string, number, true, false, null, or again object or array, which makes it possible to have nested structures. Figure 2 shows an example of a JSON ob- ject.

The use of conventions that are familiar to programmers of various programming lan- guages and the fact that the format is easy for humans to read and write and easy for machines to parse and generate make JSON an ideal data-interchange language.

{ "game": "football",

"match": ["Real Madrid", "Fc Barcelona"],

"stadium_traits" : {"name" : "Santiago Bernabeu",

"capacity": 80000},

"result" : [0, 0] }

Figure 2: Example of aJSONobject

2.2 JSONiq

JSONiq [2] is a functional, declarative language specifically designed for querying JSON data.

The main instances of the JSONiq Data Model (JDM) aresequences of items, andtuples.

items Items can be JSON objects, JSON arrays, strings, numbers, booleans, nulls and also many other supported atomic values such as dates, binary, etc.

sequence of items Sequences of items, as the name suggests are sequences of zero or more items, separated by a comma. Consider the following example from jsoniq.org1.

(”foo”, 2, true,{”foo” : ”bar”}, null, [1,2,3])

Sequences are flat and cannot be nested. Further note that a sequence of just one item is considered the same as the item itself.

tuples A tuple (sometimes also called context or dynamic context) is a set of key-value pairs, where the key is the name of a variable and the value is the corresponding sequence of items [3]. Consider the following example of a tuple.

< $i : (1, null, false), $j : (”foo”), $k : ()>

1http://www.jsoniq.org/docs/JSONiq/webhelp/index.html#chapter-data-model.html

(9)

Expressions

The main building blocks in JSONiq for querying and processing data are expressions.

Expressions are defined recursively and can be composed into an expression tree, where each expression consumes sequences of items from its child-expression and each child- expression returns sequences of items to its parent. Expressions are evaluated on tuples, so formally an expression is a mapping from a JSONiq tuple to a JSONiq sequence of items [3].

e2E:T !S

whereE is the set of JSONiq Expressions,T is the set of JSONiq tuples andS is the set of JSONiq sequences of items. Figure 3 shows an illustration of a JSONiq expression.

The functions 1, ..., nare defined by the expressione. Each jcan take all the previously calculated sequences S1, ...,Sj 1 and the main input tupleTin as inputs and produce a tupleTj as output, which is then used by the child-expressionej.

Expressions include arithmetic and comparison operators, literals, dynamic object and

Child- expression

e1

...

Expression e

Child- expression

e2

Child- expression

en

T1 S1 T2 S2 Tn Sn

Tin Sout

1 2 n

output

Figure 3: Illustration of a JSONiq expression [3].

array constructors, object and array navigation, filters, data-flow expressions such as if-then-else, function calls, string concatenation and many more. The most important expression is the FLWOR expression.

The FLWOR Expression

”FLWOR expressions are probably the most powerful JSONiq construct and correspond to SQL’s SELECT-FROM-WHERE statements, but they are more general and more flexible” [2]. A FLWOR expression consists of multiple clauses. There are seven clauses in total.

• For clause

• Let clause

• Count clause

• Order by clause

• Group by clause

• Where clause

• Return clause

(10)

In FLWOR expressions clauses can appear in almost any order apart that it must begin with a for or let clause and end with a return clause.

Clauses are similar to expressions in JSONiq but instead of mapping a tuple to a sequence of items, clauses map tuple-streams to tuple-streams except for the return clause, which maps a tuple-stream to a sequence of items. Tuple-streams can be considered as vectors of tuples. The following example shows a tuple-stream.

0

@ <$i: (5),$j: (”hello”),$k: (f alse)>

<$max: (5),$min: (0),$set: ()>

<$avg: (3),$mean: (2),$obj:{”f oo” : 2,”bar” :null}>

1 A

Further let C be the set of clauses except of the return clause, and let T S be the set of tuple-streams in JSONiq, then we can express clauses formally as

c2C:T S!T S and the Return clause as

ReturnClause:T S!S.

Figure 4 shows an illustration of a FLWOR expression. The function 1 maps the input Tin to a tuple-stream T Si,1, which then flows into the first clause c1. The output of each clause c1, ..., cn is the input of the subsequent clause, and at the end the FLWOR expression outputs the result of the Return clause.

Clause c1

...

FLWOR Expression e

Clause c2

Return- Clause T Si,1 T So,1 T Si,2 T So,2 T Si,n Sn

Tin Sout

1 ...

Figure 4: Illustration of a FLWOR expression [3]. Note that c1 must be either a For clause or a Let clause.

Let’s consider the following concrete example .

for $i in (1 to $max) where $i > 3 return $i (1) Figure 5 shows an illustration of the expression in 1.

Lets say we evaluate this expression on the tuple given by

<$max : (5)> .

As described above the function 1maps the input tuple to a tuple-stream.

<$max: (5)> !1 (<$max: (5)>)

The output of 1 is then used by the For Clause. The For clause uses its own functions and child-expression to get the outputT So,1.

(<$max: (5)>) c!1 0 BB BB

@

<$max: (5),$i: (1)>

<$max: (5),$i: (2)>

<$max: (5),$i: (3)>

<$max: (5),$i: (4)>

<$max: (5),$i: (5)>

1 CC CC A

(11)

For Clause

c1

FLWOR Expression e

Where Clause

c2

Return Clause T Si,1 T So,1 T Si,2 T So,2 T Si,3 S3

Tin Sout

1

Figure 5: Illustration of the FLWOR expression in 1 [3].

As one can see in figure 5 the Where clause takes then the output of the For clause and maps it to T So,2. Note that the Where clause also uses its own functions and child- expression.

0 BB BB

@

<$max: (5),$i: (1)>

<$max: (5),$i: (2)>

<$max: (5),$i: (3)>

<$max: (5),$i: (4)>

<$max: (5),$i: (5)>

1 CC CC A

c2

! 0 BB BB

@

<$max: (5),$i: (1),$$ : (f alse)>

<$max: (5),$i: (2),$$ : (f alse)>

<$max: (5),$i: (3),$$ : (f alse)>

<$max: (5),$i: (4),$$ : (true)>

<$max: (5),$i: (5),$$ : (true)>

1 CC CC A

Finally the Return Clauses takes the output of the Where Clause, uses its functions and child-expression and maps it to the sequence of itemsS3.

0 BB BB

@

<$max: (5),$i: (1),$$ : (f alse)>

<$max: (5),$i: (2),$$ : (f alse)>

<$max: (5),$i: (3),$$ : (f alse)>

<$max: (5),$i: (4),$$ : (true)>

<$max: (5),$i: (5),$$ : (true)>

1 CC CC A

ReturnClause

! (4,5)

The FLWOR expression, then takesS3 and returns it.

2.3 Spark

Apache Spark [4, 5] is a popular and fast cluster computing system, originally developed at the University of California, Berkeley. It is similar to the MapReduce programming model, but is more flexible in the sense that it generalizes MapReduce to arbitrary DAGs (Directed Acyclic Graphs) of transformations. Spark processes on two data structures:

RDDs RDDs (Resilient Distributed Datasets) are flat collections of heterogeneous data distributed among the nodes of the cluster so that they can be processed in parallel.

DataFrames DataFrames are collections composed of homogeneous rows whose types and field names are known before the execution. Conceptually they are equivalent to tables in relational databases, hence queries on DataFrames can be optimized using the techniques from relational database systems.

Spark provides high-level APIs in the programming languages Scala, Python, Java and R.

2.4 Rumble

Rumble [6, 7] is a stable and efficient JSONiq engine implemented as a Spark application written in Java. It can process heterogeneous and nested datasets of huge amounts of

(12)

JSON objects, by dynamically pushing down computations to Spark, without exposing this to the user [8]. Queries in Rumble are parsed into an Abstract Syntax Tree (AST), which then gets translated into a physical execution plan. The physical execution plan consists of runtime iterators, where each runtime iterator corresponds to a particular JSONiq expression or to a particular JSONiq clause. The expression iterators return items, while the clause iterators returntuples. There are three di↵erent execution modes for the runtime iterators:

local-execution The local execution mode usually deals with small amounts of data at a time and is used for pre- or post-processing Spark jobs. Simple Queries even may run entirely on this mode.

RDD-based execution The RDD-based execution mode is used whenever there is no knowledge about the structure of the data, which is typically the case for sequence of items.

DataFrame-based execution This execution mode is used whenever parts of the in- ternal structures are known statically. This is the case for the clause iterators, where the variables bound in the tuples can be derived statically and therefore rep- resented as columns in tables. However some expression iterators, for which the output structure can be determined statically can also use the DataFrame-based execution mode.

One can submit the queries in Rumble either individually via the command line or in a shell.

2.5 MLIR

MLIR (Multi-Level Intermediate Representation) [9,10] is a flexible and extensible infras- tructure for compiler construction. ”It aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together” [10]. In contrast to other compiler infrastructures such as LLVM [11] or the Java Virtual Machine (JVM) [12], MLIR does not implement the ”one size fits all” approach, but allows modeling on di↵er- ent levels of abstraction.

MLIR standardize the Static Single Assignment(SSA)-based IR data structures, provides a declarative system for defining IR dialects and o↵ers a wide range of common infras- tructure such as parsing and printing logic, location tracking, pass management, etc.

2.5.1 Dialects

Dialects in MLIR represent the mechanism by which the MLIR ecosystem can be ex- tended. They allow for defining new operations, attributes and types. Each dialect is given a unique namespace and each operation, attribute and type of that dialect is prefixed by its namespace.

2.5.2 Operations

Operations in MLIR are similar to instructions in LLVM, except that operations are fully extensible, while instructions in LLVM are hard-coded. An operation in MLIR is identified by a unique string and can return zero or more results.

The following describes the grammar of MLIR operations using Extended Backus- Naur Form(EBNF) [13]. In the appendix one can find the whole EBNF grammar used in this thesis.

operation ::= op-result-list?(generic-operation | custom-operation) trailing-location?

Figure 6: EBNF description for operations in MLIR

(13)

Let’s also consider an example of an operation. This example is taken from the Toy Tutorial on the MLIR webpage2.

%t_tensor = "toy.transpose"(%tensor) {inplace = true} :

(tensor<2x3xf64>) -> tensor<3x2xf64> loc("example/file/path":12:1)

Figure 7: MLIR assembly for the Toy transpose operation

Note that in the further course of this thesis we denote the curly brackets after the arguments (in this example{inplace = true}) asdictionary.

One special instance of operations are terminator operations. Terminator operations are used to end a block, e.g. branches (see below).

2.5.3 Functions

Functions in MLIR are special operations containing one region (see below). The region of a function is not allowed to access values defined outside of the function.

The following gives anEBNF description of functions in MLIR.

function ::= ‘func‘ function-signature function-attributes?

function-body?

Figure 8: EBNF descriptions of MLIR functions The following example of a function is again taken from the Toy Tutorial2

func @toy_func(%tensor: tensor<2x3xf64>) -> tensor<3x2xf64> {

%t_tensor = "toy.transpose"(%tensor) { inplace = true } : (tensor<2x3xf64>) -> tensor<3x2xf64>

return %t_tensor : tensor<3x2xf64>

}

Figure 9: MLIR assembly of a function

2.5.4 Blocks

Blocks in MLIR represent a sequential list of operations that are performed from top to bottom. There is no control flow and the last operation is a terminator operation which ends the block. Blocks can have a list of arguments, which are defined by the block, and they can branch to other blocks.

The following gives an EBNF description of a MLIR block.

block ::= block-label operation+

Figure 10: EBNF description of a block Lets consider again an example of a block in MLIR:

^bb0(%cond: i1):

%val = "jsoniq.lit"() {value = 1 : i64} : () -> !jsoniq.sequence

"jsoniq.terminator"(%val) : (!jsoniq.sequence) -> () Figure 11: MLIR assembly of a block

2https://mlir.llvm.org/docs/Tutorials/Toy/Ch-2/

(14)

2.5.5 Regions

Regions serve to group blocks that are semantically connected. They are used when the semantics is not imposed by the IR. Regions do not have a name or an address nor a type or attributes.

Figure 12 gives an EBNF description of a region.

region ::= ‘{‘ block* ‘}‘

Figure 12: EBNF description of a region

2.5.6 Operation Description

Operational Descriptions (ODS) can be used for defining the structure of an operation and components of its verifier declaratively. ODS in MLIR are TableGen-based. TableGen is a data modeling tool designed to help define and maintain domain-specific informa- tion, extensively used in LLVM. MLIR directly translates ODS into C++ code, which interoperates with the rest of the system.

2.5.7 Declarative rewrites

Transformations of operations can sometimes be expressed as simple rewrites on the DAG defined by the relation of the SSA values. MLIR provides a graph rewriting framework, that contains the Declarative Rewrite Rule (DRR) system that makes it simple to express patterns. DRR is converted into C++ code. To find more complex patterns, one can intermix the generated C++code with code directly defined inC++using the generic graph rewriting framework. This allows MLIR to keep the common case simple without restricting the generality.

(15)

3 The JSONiq Dialects

In this chapter we introduce two approaches for dialects in MLIR that can express Rum- ble Queries as IR. For both of the approaches we have implemented a mechanism that transforms the Rumble Query into the IR and vice versa. In the appendix you will find those approaches, that we think were not suitable for our problem or even did not work.

3.1 Approach 1

Rumble queries consist of Rumble Expressions and can be formulated as Expression Trees.

For this reason we defined an MLIR dialect with namespacejsoniq, which contains types and operations that can represent Expression Trees and thus Rumble queries. In the further course of this thesis we will introduce you to the types and operations of this dialect and provide some conceptual aspects of the IR.

3.1.1 Types

As explained in Chapter 2, expressions in Rumble are mappingsT !S and clauses are mappingsT S!T S except for theReturnClausewhich is a mappingT S !S. For this reason we defined the three types: sequence, tuple<>andtuplestream<>3. With this three types we can define operations that can represent expressions and clauses4. 3.1.2 Operations

Rumble Expressions can be divided into Easy Expressions and Hard Expressions. Easy Expressions are those, in which the functions 1, ..., n as described in Figure 3, only depend on T Sin. Hard Expressions are those, in which some of the functions 1, ..., n

depend on previous -functions.

Let’s consider the following example of an easy expression:

(5 to $max) (2)

The query in 2 shows a use-case of the RangeExpression. The RangeExpression gen- erates a sequence from its left child-expression to its right child-expression. Figure 13 shows an illustration of the query in 2.

Tin

IntegerLiteral- Expression (5)

VariableRef.- Expression

($max)

Sout RangeExpression

to

1 2

output

Figure 13: Illustration of a RangeExpression [3].

The two functions 1and 2just pass the input tupleTinto the child-expressions without modifying it.

Let’s say we evaluate this expression on the tuple given by

<$max: (12)>

3The brackets after tuple and tuplestream track the bounded variables.

4In the further course of this thesis we consider Rumble Clauses as subset of Rumble Expressions.

(16)

As explained above 1 and 2 are just the identity function

<$max: (5)> !1 <$max: (5)>

<$max: (5)> !2 <$max: (5)>

The IntegerLiteralExpression then takes the input provided by 1 and returns the sequence (5).

<$max: (5)>IntegerLiteralExpression

! (5).

TheVariableReferenceExpressiontakes the input provided by 2looks up the reference

$maxand returns the value. In this case (12).

<$max: (5)>IntegerLiteralExpression

! (12).

Finally the function outputtakes as input the sequences provided by the child-expressions and returns the following sequence:

(5),(12) output! (5,6,7,8,9,10,11,12)

An example for a hard expression is the FLWOR expression. Intuitively the di↵erence between Easy Expressions and Hard Expressions is that, in Easy Expressions the child expression are independent of each other, while in hard expressions some of the child expressions depend on each other. Table 1 shows a listing of the Rumble Expressions we considered in this approach.

In the following we show for each of these expressions how they can be represented in this dialect using the operations we have defined.

Easy Expressions

Let be Eeasy the set of all Easy Expressions from table 1. Further let beEspecthe set of all Special Expressions andEreg :=Eeasy\Espec. An expressione2Ereg is represented in this dialect by exactly one operation, such that we can define a mapping

Mreg:Ereg!Op

where Op is the set of all operations we defined in this dialect. Note thatMreg is not injective, which means that some expressions are mapped onto the same operation and Mreg is not surjective, which means some of the operations are not a↵ected byMreg. We then can represent an expression e2Ereg in our dialect directly with the operation op= Mreg(e). Operations op2 Mreg(Ereg) only use sequence as type, and the exact function type of these operations is determined by the arity of the expression in the preim- age. For example ise2Erega Binary Expression, then the function type of the operation op=Mreg(e) is(!jsoniq.sequence, !jsoniq.sequence) -> !jsoniq.sequence.

Operations that represent Easy Expressions only consist of their name, their function type and sometimes a dictionary with some constants. In the following we give the name and the syntax for these operations and explain the mappingMreg.

Leaf Expressions

Leaf Expressions are those expressions which represent leaves in the expression trees. This means they do not contain child-expressions and therefore the function type of the opera- tionsop2Mreg(e), wheree2Eregis a Leaf Expression, is() -> !jsoniq.sequence. We only have defined one operation for all six Leaf Expressions and only the dictionary within the operation gives information about the type of the Leaf Expression. For the numerical Leaf Expressions, such asIntegerLiteralExpression,DoubleLiteralExpression and DecimalLiteralExpressionthe dictionary contains the key ”value” and its value is the

(17)

Rumble Expressions

Easy Expressions Hard Expressions Unary Expressions: Clauses:

ArrayConstructorExpression NotExpression

UnaryExpression

ArrayUnboxingExpression

ForClause LetClause WhereClause CountClause OrderByClause GroupByClause ReturnClause Binary Expressions: PredicateExpression

FlworExpression AdditiveExpression

RangeExpression ComparisonExpression MultiplicativeExpression AndExpression

OrExpression

StringConcatExpression ArrayLookupExpression ObjectLookupExpression Ternary Expressions:

ConditionalExpression N-ary Expressions:

FunctionCallExpression Leaf Expressions:

IntegerLiteralExpression StringLiteralExpression DoubleLiteralExpression DecimalLiteralExpression NullLiteralExpression BooleanLiteralExpression Typing Expressions:

TreatExpression CastExpression InstanceOfExpression CastableExpression Special Expressions:

ContextItemExpression VariableReferenceExpression ObjectConstructorExpression CommaExpression

Table 1: Rumble Expressions divided into easy and hard. Note that these are not all Expressions in Rumble. There are a few more which we didn’t cover

%0 = "jsoniq.lit"() {value = 12 : i64 } : () -> !jsoniq.sequence

%1 = "jsoniq.lit"() {value = 3.141 : f64 } : () -> !jsoniq.sequence Figure 14: lit operation of approach 1

corresponding literal and its type. Figure 14 shows how to represent the

IntegerLiteralExpression (12)and theDecimalLiteralExpression (3.141)respec- tively in this dialect.

For all other Leaf Expressions, the dictionary contains the key ”value” and its value is either ”true”,”false”,”null” or the literal of the StringLiteralExpression. Figure

(18)

15 shows how to represent the NullLiteralExpression, BooleanLiteralExpression (false) and StringLiteralExpression(hello)respectively in this dialect.

%0 = "jsoniq.lit"() {value = "null"} : () -> !jsoniq.sequence

%1 = "jsoniq.lit"() {value = "false"} : () -> !jsoniq.sequence

%2 = "jsoniq.lit"() {value = "hello"} : () -> !jsoniq.sequence

Figure 15: lit operation of approach 1 Figure 16 shows the mappingMreg for Leaf Expressions.

Iit Mreg :Eleaf !Op

IntegerLiteral Expression StringLiteral

Expression

N ullLiteral Expression BooleanLiteral

Expression DecimaLiteral

Expression DoubleLiteral

Expression

Figure 16: MappingMreg from Leaf Expressions to operations. Note thatEleaf ⇢Ereg is the set of all Leaf Expressions.

Unary Expressions

Unary Expressions have exactly one child-expression in the expression tree and there- fore the function type of an operation op = Mreg(e), where e is a Unary Expression, is (!jsoniq.sequence) -> !jsoniq.sequence. We have defined an operation in our dialect for each of the four Unary Expressions in table 1. Figure 17 shows the mapping Mreg for Unary Expressions.

Mreg :Eunary!Op

N ot Expression

U nary Expression ArrayU nboxing

Expression ArrayConstructor

Expression

not arrayconstructor arrayunboxing

neg

Figure 17: MappingMregfrom Unary Expressions to operations. Note thatEunary⇢Ereg is the set of all Unary Expressions.

The following shows how to represent theNotExpressionin this dialect.

%1 = "jsoniq.not"(%0) : (!jsoniq.sequence) -> !jsoniq.sequence

Figure 18: Syntax fornotoperation

Note that in this example thenotoperation is stored in in the register%1and the register

%0stores the input operation.

The syntax for all the other operations op 2 Mreg(Eunary) is the same, except of the name.

(19)

Binary Expressions

Binary Expressions have exactly two child-expressions. Therefore the function type of an operation op = Mreg(e), where e is a Binary Expression, is (!jsoniq.sequence,

!jsoniq.sequence) -> !jsoniq.sequence. Table 2 shows the mapping Mreg for Bi- nary Expressions

Binary Expressione Mreg(e)

AdditiveExpression (+) +

AdditiveExpression(-) -

RangeExpression to

ComparisonExpression (*compSign*) *compSign*

MultiplicativeExpression (*mulSign*) *mulSign*

AndExpression and

OrExpression or

StringConcatExpression ||

ArrayLookupExpression [[]]

ObjectLookupExpression objectlookup

Table 2: Binary Expressions and their corresponding operations. Note that *compSign*

2{eq, ne, lt, le, gt, ge,=,! =, <, <=, >, >=}

and *mulSign*2{⇤, div, idiv, mod} Figure 19 shows how to represent theAndExpressionin this dialect.

%2 = "jsoniq.and"(%0, %1) : (!jsoniq.sequence, !jsoniq.sequence) -> !jsoniq.sequence

Figure 19: Syntax of theandoperation

Note that in this example the andoperation is stored in the register %2. The register%0 and register%1store the input operations.

The syntax for all other operations op 2 Mreg(Ebinary), where Ebinary is the set of all Binary Expressions, is the same except for the name of the operation.

Ternary Expressions

Ternary Expressions have exactly three child-expressions in the expression tree. So the function type of an operation op = Mreg(e), where e is a Ternary Expression, is (!jsoniq.sequence,!jsoniq.sequence,!jsoniq.sequence) -> !jsoniq.sequence. In this thesis we considered only one Ternary Expression, namely theConditionalExpression.

Figure 20 shows the mappingMreg for theconditionalExpression.

Mreg :Eternary!Op

Conditional

Expression conditional

Figure 20: MappingMreg from Ternary Expressions to operations. Note thatEternary⇢ Ereg defines the set of all Ternary Expressions.

The following shows how to represent theConditionalExpressionin this dialect.

(20)

%5 = "jsoniq.conditional"(%2, %3, %4) : (!jsoniq.sequence, !jsoniq.sequence,

!jsoniq.sequence) -> !jsoniq.sequence Figure 21: Syntax of theconditional operation.

Note that in this example the conditional operation is stored in the register %5. The registers%2, %3, %4 store the input operations.

N-ary Expressions

N-ary Expressions have a variable number of child-expressions. The only N-ary Ex- pression we considered in this thesis is theFunctionCallExpressionand its number of child-expressions depends on the function. The operation op=Mreg(e) where eis the FunctionCallExpression, contains the name of the function in the dictionary. We called this operationfunc. Figure 22 shows the mappingMregfor theFunctionCallExpression.

Mreg :Enary!Op

F unctionCall

Expression f unc

Figure 22: MappingMreg from N-ary Expressions to operations. Note thatEnary⇢Ereg defines the set of N-ary Expressions.

The following example shows how to represent theFunctionCallExpressionin this di- alect.

%1 = "jsoniq.func"(%0) {funcname = "string-length" } :

(!jsoniq.sequence) -> !jsoniq.sequence

Figure 23: Syntax of thefuncoperation

Note that in this example the function is the string-length, which means that the FunctionCallExpression is unary. The func operation is stored in register %1. The register%0stores the input operation.

Typing Expressions

Typing Expressions are Unary Expressions. An operation op = Mreg(e) where e is a Typing Expression, di↵er from the operations for Unary Expressions mentioned above only in that they contain a dictionary with the type considered in the expression. We have defined an operation for each of the four Typing Expressions in Table 1. Figure 24 shows the mappingMreg for Typing Expressions.

The listing in figure 25 shows how to represent theCastableExpressionin this dialect Note that in this example the type we consider is string and therefore we write it as value to the keytypein the dictionary. Thecastableoperation is stored in register%1.

The register%0stores the input operation.

The syntax for all other operations op 2 Mreg(Etyping) is the same except of for the name of the operation and the dictionary, which depends on the type we consider in the expression of the preimage.

(21)

Mreg:Etyping!Op

T reat Expression InstanceOf Expression Castable Expression ExpressionCast

treat instanceof

castable cast

Figure 24: Mapping Mreg from Typing Expressions to operations. Note thatEtyping⇢ Ereg defines the set of Typing Expressions.

%1 = "jsoniq.castable"(%0) {type = "string"} : (!jsoniq.sequence) -> !jsoniq.sequence Figure 25: Syntax of thecastableoperation.

Special Expressions

For the Special ExpressionsEspecwe don’t have a mapping from expressions to operations anymore, because some of the expressionse2Especare not represented with an operation in this dialect and some other expressions are represented by more than one operation.

ContextItemExpression

TheContextItemExpressionis only used within thePredicateExpression and is not represented with operations in this dialect. Instead, whenever we need to represent ContextItemExpressionin the IR, we can use the block argument of the

PredicateExpression(see later).

VariableReferenceExpression

TheVariableReferenceExpressionis a Leaf Expression in the expression tree. However the operation in this dialect that represent theVariableReferenceExpression, we called itvarref, has not the function type() -> (!jsoniq.sequence)but(!jsoniq.tuple<>) -> !jsoniq.sequence. This means it takes a tuple as input, can look up the reference and then return its value as a sequence. The name of the variable is stored in the dictio- nary of thevarrefoperation.

The following example shows how to represent theVariableReferenceExpression ($i) in this dialect.

%3 = "jsoniq.varref"(%arg0) {var = "i"} : (!jsoniq.tuple<i>) -> !jsoniq.sequence

Figure 26: Syntax of thevarrefoperation.

Note that in this example thevarrefoperation is stored in the register %3. The register

%arg0 stores the input operation. The input operation of the varrefoperation always refers to a block argument (see later).

ObjectConstructorExpression

TheObjectConstructorExpressionis a N-ary Expression in Rumble. Its child-expressions are the expressions for the keys and the values. The constructobject operation from our dialect however is dyadic. This means its function type is (!jsoniq.sequence,

!jsoniq.sequence) -> !jsoniq.sequence. Theconstructobjectoperation, can there- fore only represent objects with one key-value pair. Figure 27 shows how to represent

(22)

such an object in our dialect

%2 = "jsoniq.constructobject"(%0, %1) : (!jsoniq.sequence, !jsoniq.sequence) -> !jsoniq.sequence

Figure 27: Syntax of theconstructobjectoperation.

Note that in this example the constructobject operation is stored in the register %2.

The registers%0and%1store the input operations.

To represent objects with more than one key-value pair, we defined the operationmergeobjects, which again is a dyadic operation that takes two registers that store representation of ob- jects in the IR as argument and represent the merged object. The following example shows the syntax of themergeobjectsoperation.

%9 = "jsoniq.mergeobjects"(%2, %5) : (!jsoniq.sequence, !jsoniq.sequence) -> !jsoniq.sequence

Figure 28: Syntax of themergeobjectsoperation.

Note that in this example themergeobjectsoperation is stored in the register%9. The registers%2and%5store the input operations, which in this example need to be represen- tations of objects. This means either aconstructobjector amergeobjectsoperation.

We can now represent an object withn2N>0key-value pairs usingnconstructobject operations followed by (n 1) mergeobjectsoperations.

To represent the empty Object , we defined the niladic operation emptyobject. The following example shows the syntax of theemptyobjectoperation.

%1 = "jsoniq.emptyobject"() : () -> !jsoniq.sequence

Figure 29: Syntax of theemptyobject operation.

Note that in this example theemptyobject operation is stored in the register%1.

CommaExpression

TheCommaExpressionis a N-ary Expression in Rumble. Thecommaoperation we defined in this dialect however is dyadic. This means its function type is (!jsoniq.sequence,

!jsoniq.sequence) -> !jsoniq.sequence. A singlecommaoperation therefore can only represent a sequence of two items. To represent a sequence of n >2 items, we can just use (n 1) comma operations. The following example shows the syntax of the comma operation.

%2 = "jsoniq.comma"(%0, %1) : (!jsoniq.sequence, !jsoniq.sequence) -> !jsoniq.sequence

Figure 30: Syntax of thecommaoperation.

Note that in this example thecommaoperation is stored in the register %2. The registers

%0and%1store the input operations.

Hard Expressions

The operations in our dialect that represent Hard Expression not only consist of their name, their function type and sometimes of a dictionary, but also of a region with a block inside. The following sections show how to represent Hard Expressions in this dialect.

(23)

Clauses

Clauses are Hard Expressions, because the child-expression of clauses depend on the previous Clause. In the following we show, how to represent each Clause in this dialect.

ForClause

We represent theForClausewith theforoperation in this dialect. Theforoperation is monadic and its function type is(!jsoniq.tuplestream<>) -> !jsoniq.tuplestream<>.

Similar to thevarrefoperation theforoperation stores the name of the variable in its dictionary. The argument of the block inside the region of the operation has typetuple<>.

The following example shows how to representForClause (i)in this dialect.

%1 = "jsoniq.for"(%0) ( {

^bb0(%arg0 : !jsoniq.tuple<>):

// body

}) {var = "i"} : (!jsoniq.tuplestream<>) -> !jsoniq.tuplestream<i>

Figure 31: Syntax of theforoperation.

We will see later with which operations we replace // body. Note that in this example theforoperation is stored in the register %1. Register %0stores the input operation. In the case of aforoperation the input is always the last register used to store the previous clause.

The ForClausecan be the first Clause within a FlworExpression. In this case it has no previous Clause and therefore we need an operation that represents the initial tuple- stream. The operation streamdoes exactly this. streamis a niladic operation and its function type is() -> !jsoniq.tuplestream<>. The following example shows the syn- tax of thestreamoperation.

%0 = "jsoniq.stream"() : () -> !jsoniq.tuplestream<>

Figure 32: Syntax of thestreamoperation.

Note that thestreamoperation is stored in register%0.

LetClause

TheLetClauseworks in the same way as theForClause. The operation that represents theLetClauseis calledlet. The following example shows how to representLetClause (i)in this dialect.

%1 = "jsoniq.let"(%0) ( {

^bb0(%arg0 : !jsoniq.tuple<>):

// body

}) {var = "i"} : (!jsoniq.tuplestream<>) -> !jsoniq.tuplestream<i>

Figure 33: Syntax of theletoperation.

Again we will see later with which operations we replace// body.

Note that also the LetClause can be the first clause within a FlworExpression, and therefore we also sometimes need a streamoperation in front of theletoperation.

WhereClause

The WhereClause works in the same way as the ForClauseexcept that the operation which represents theWhereClause, we called itwhere, doesn’t contain a dictionary. The following example shows how to represent theWhereClause in this dialect.

(24)

%2 = "jsoniq.where" (%1) ( {

^bb0(%arg0: !jsoniq.tuple<i>):

// body

}) : (!jsoniq.tuplestream<i>) -> !jsoniq.tuplestream<i>

Figure 34: Syntax of thewhereoperation.

Note that in this example thewhere operation is stored in the register%2. The register

%1stores the last operation used to represent the previous clause.

CountClause

We represent theCountClausewith thecountoperation in this dialect. Thecountopera-

tion is monadic and its function type is(!jsoniq.tuplestream<>) -> !jsoniq.tuplestream<>.

Unlike the other operations that represent Clauses in this dialect, the count operation does not contain a region5, but it contains a dictionary that stores the name of the variable theCountClause binds to. The following example shows how to represent the CountClausein this dialect.

%2 = "jsoniq.count" (%1) {var = "c"} : (!jsoniq.tuplestream<i>) -> !jsoniq.tuplestream<i, c>

Figure 35: Syntax of thecountoperation.

Note that in this example thecount operation is stored in the register%2. The register

%1stores the input operation, which in this case is the last operation used to represent the previous clause.

OrderByClause

We represent theOrderByClauseof a single variable by the operationorderbyin this di- alect. Theorderbyoperation is monadic and its function type is(!jsoniq.tuplestream<>) -> !jsoniq.tuplestream<>. The operation contains a region and a dictionary which stores the rule of the ordering, i.e. either ascending or descending. The block argu- ment of the region has typetuple<>. The following example shows how to represent the OrderByClauseof a single variable in this dialect.

%4 = "jsoniq.orderby" (%3) ( {

^bb0(%arg0 : !jsoniq.tuple<i, j>):

// body

}) {rule = "descending"} : (!jsoniq.tuplestream<i, j>) -> !jsoniq.tuplestream<i, j>

Figure 36: Syntax of theorderbyoperation.

Note that in this example the orderby operation is stored in register %4. Register %3 stores the input operation. We will se later with which expressions we replace// body.

If we want to represent anOrderByClauseofn >1 variables, then we can usenconsec- utiveorderbyoperations.

GroupByClause

We representGroupByClauseof a single variable in this dialect with two operations. First we define the variable by using theletoperation and afterwards we represent the grouping by this variable with the groupbyoperation. Thegroupbyoperation is monadic and its

5TheCountClauseis the only exception of a Hard Expression that does not contain a region in this dialect.

(25)

function type is (!jsoniq.tuplestream<>) -> !jsoniq.tuplestream<>. It does not contain a region, but a dictionary that stores the name of the variable. Figure 37 shows how to representGroupByClause (j)in this dialect.

%2 = "jsoniq.let" (%1) ( {

^bb0(%arg0 : !jsoniq.tuple<i>):

// body

}) {var = "j"} : (!jsoniq.tuplestream<i>) -> !jsoniq.tuplestream<i, j>

%3 = "jsoniq.groupby" (%2) {var = "j"} : (!jsoniq.tuplestream<i, j>) -> !jsoniq.tuplestream<i, j>

Figure 37: Representation of GroupByClause (j)in this dialect.

We will see later with which operations we replace // block. Note that the register%2 stores theletoperation and is used as argument in thegroupbyoperation.

If we want to represent aGroupByClauseofn >1 variables, then we just usenconsecutive representations of aGroupByClauseof one variable.

ReturnClause

We represent theReturnClausewith thereturnoperation in this dialect. Thereturnop- eration is monadic and its function type is(!jsoniq.tuplestream<>) -> !jsoniq.sequence.

It contains a region, but no dictionary. The block argument of the region has typetuple.

The following example shows how to represent theReturnClausein this dialect.

%4 = "jsoniq.return" (%3) ( {

^bb0(%arg0 : !jsoniq.tuple<i, j>):

// block

}) : (!jsoniq.tuplestream<i, j>) -> !jsoniq.sequence

Figure 38: Syntax of thereturnoperation.

Note that the return operation is stored in the register %4. The register %3 stores the input operation. We will see later with which operations we replace// block.

PredicateExpression

We represent thePredicateExpressionin this dialect with the[]operation. The[]op- eration is monadic and its function type is(!jsoniq.sequence) -> !jsoniq.sequence.

Since it represents the PredicateExpression, which is a Hard Expression, it contains a region with a block inside. The argument of this block has typesequence. The[]opera- tion does not contain a dictionary. Figure 39 shows how to representPredicateExpression in this dialect.

%3 = "jsoniq.[]" (%2) ( {

^bb0(%arg0 : !jsoniq.sequence):

// block

}) : (!jsoniq.sequence) -> !jsoniq.sequence

Figure 39: Syntax of the[]operation.

Note that the[]operation is stored in the register%3. The register%2stores the input operation.

FlworExpression

We don’t have an operation that represents the FlworExpression directly in this ap- proach. Instead we represent the FlworExpression, by representing its clauses.

(26)

Other Operations

Theterminatoroperation we defined in this dialect is used to end blocks. It is a monadic operation but returns zero results, so its function type is (!jsoniq.sequence) -> ().

Figure 40 shows the syntax of theterminator operation.

"jsoniq.terminator"(%6) : (!jsoniq.sequence) -> ()

Figure 40: Syntax of theterminator operation.

Note that since theterminatoroperation returns zero results we don’t need to store the operation in a register. Register%6stores the input operation.

3.1.3 Representation of Queries

So far we have seen how to represent single nodes of an expression tree in our IR. In the following we show how to represent the whole expression tree. In particular this means, we show how to connect the representation of the nodes of an expression tree together, such that we end-up with an IR that represents an expression tree and hence a query. Queries in Rumble always return a sequence of items. So we can represent Rumble Queries as a function with the namequeryand function type() -> !jsoniq.sequencein our MLIR dialect. Figure 41 shows how this looks like.

func @query() -> !jsoniq.sequence { // body

}

Figure 41: Query as a function in the IR.

In the body of the function we represent the expression tree. To do so we visit it and represent each node with the operations from our dialect as described above. Operations without a region use as input arguments the registers that store the last operation used to represent the child-expressions. Clauses use the register as input argument that stores the last operation used to represent the previous clause. The representation of the child- expression of a clause is inside the block of the region.

The PredicateExpression takes as input argument the register that stores the last operation used to represent its left child-expression. The representation of the right-child expression is inside the block of the region. The terminator operation takes as input argument the register that stores the last operation used to represent the body of a block.

Finally we return the register, which stores the last operation used to represent the root of the expression tree with thereturnoperation from the standard dialect [14].

Lets consider the following query as an example

for $i in (1 to 6) where $i ge 5 return $i + 1 (3) Figure 42 shows the expression tree of this query.

The representation in our dialect of the query in 3 is shown in Figure 43.

(27)

FlworExpression ReturnClause

AdditiveExpression (+)

VariableReferenceExpression($i) IntegerLiteralExpression (1) WhereClause

ComparisonExpression (ge)

VariableReferenceExpression($i) IntegerLiteralExpression (5) ForClause ($i)

RangeExpression

IntegerLiteralExpression (1) IntegerLiteralExpression (6) Figure 42: Expression tree of the query in 3 func @query() -> !jsoniq.sequence {

%0 = "jsoniq.stream"() : () -> !jsoniq.tuplestream<>

%1 = "jsoniq.for"(%0) ( {

^bb0(%arg0 : !jsoniq.tuple<>):

%2 = "jsoniq.lit"() {value = 1 : i64 } : () -> !jsoniq.sequence

%3 = "jsoniq.lit"() {value = 6 : i64 } : () -> !jsoniq.sequence

%4 = "jsoniq.to"(%2, %3) : (!jsoniq.sequence, !jsoniq.sequence) -> !jsoniq.sequence

"jsoniq.terminator"(%4) : (!jsoniq.sequence) -> ()

}) {var = "i"} : (!jsoniq.tuplestream<>) -> !jsoniq.tuplestream<i>

%2 = "jsoniq.where" (%1) ( {

^bb0(%arg0: !jsoniq.tuple<i>):

%3 = "jsoniq.varref"(%arg0) {var = "i"} : (!jsoniq.tuple<i>) -> !jsoniq.sequence

%4 = "jsoniq.lit"() {value = 5 : i64 } : () -> !jsoniq.sequence

%5 = "jsoniq.ge"(%3, %4) : (!jsoniq.sequence, !jsoniq.sequence) -> !jsoniq.sequence

"jsoniq.terminator"(%5) : (!jsoniq.sequence) -> () }) : (!jsoniq.tuplestream<i>) -> !jsoniq.tuplestream<i>

%3 = "jsoniq.return" (%2) ( {

^bb0(%arg0 : !jsoniq.tuple<i>):

%4 = "jsoniq.varref"(%arg0) {var = "i"} : (!jsoniq.tuple<i>) -> !jsoniq.sequence

%5 = "jsoniq.lit"() {value = 1 : i64 } : () -> !jsoniq.sequence

%6 = "jsoniq.+"(%4, %5) : (!jsoniq.sequence, !jsoniq.sequence) -> !jsoniq.sequence

"jsoniq.terminator"(%6) : (!jsoniq.sequence) -> () }) : (!jsoniq.tuplestream<i>) -> !jsoniq.sequence return %3 : !jsoniq.sequence

}

Figure 43: Representation of the query in 3 in our IR

(28)

3.1.4 Theoretical Aspects of the IR

LetQbe the set of all valid Rumble queries that we can represent in the dialect of approach 1. The IR ofq2Q in this dialect can be modeled as a polytree [15]Tq = (Vq, E), where Vq is the set of the registers in the IR ofqandE✓Vq⇥Vq. A nodev2Vq has a directed edge to a node w 2Vq, if the operation stored in v has as input argument the register w. Ifu2Vq stores an operation that contains a region, thenuhas an additional edge to the register, that is given to theterminatoroperation of the block as argument. By this construction each register of the IR of a queryq2Qinduces its own tree and we denote withTq(op) the tree with the registerop2Vq as root.

With this definition we can writeTq asTq(root) whereroot2Vq is the register returned by the MLIR function that represents q. In the further course of this discussion we call this register theroot register of the IR ofq.

Given the IR of approach 1 of a queryq2Q, we show in the following how to inductively constructTq.

Basis: Let be Opleaf := {varref, stream, lit, emptyobject}. Then Tq(op), where op stores an operation from Opleaf or is a block argument, has the form given in figure 44.

op

Tq(op) :=

Figure 44: Tq(op) whereop2Opleaf

Note that denotes the empty tree.

Unary Nodes: Let be

Opmon:={count, castable, cast, instanceof, treat, not, arrayconstructor, arrayunboxing, neg}

ThenTq(op), whereopstores an operation fromOpmon, has the form given in figure 45.

op

Tq(op1) Tq(op) :=

Figure 45: Tq(op) whereop2Opmon

Withop12Vq we denote the register given as input to the operation stored inop.

Binary Nodes: Let be

Opbin:={[], return, groupby, orderby, where, let, f or, comma, mergeobjects, constructobject,+, , to, eq, ne, lt, le, gt, ge,=,! =, <, <=, >, >=,⇤, div, mod, idiv, and, or,||,[[]], objectlookup}

(29)

ThenTq(op), whereopstores an operation fromOpbinhas the form given in figure 46.

op

Tq(op1) Tq(op2) Tq(op) :=

Figure 46: Tq(op) whereop2Opbin

Withop1, op2we denote registers given as input to the operation stored inop. In the case that inopis stored an operation with a region, then op2 refers to the register that is given to the terminatoroperation of the block of this region as argument.

Ternary Nodes Letopbe a register that stores aconditional operation, thenTq(op) has the form given in figure 47.

op

Tq(op) :=

Tq(op1) Tq(op2) Tq(op3)

Figure 47: Tq(op) whereopisconditional

Withop1, op2, op3we denote the registers given as input to the operation stored in op.

N-ary Nodes Letop be a register that stores afunc operation, thenTq(op) has then form given in figure 48.

op

Tq(op1) Tq(opn)

Tq(op) := ...

Figure 48: Tq(op) whereop=func

With op1, ..., opn we denote the registers given as input to the operation stored in op.

At this point we covered all possible cases and therefore know for an IR of a queryq2Q in approach 1 how to constructTq=Tq(op), where opis the root register of the IR of q.

(30)

Relation to the Expression Tree

In the treeTq,q2Qwe can group all nodes that store operations used to represent the same expression together to one node. The new tree resulted by this process has the same shape as the expression tree. This fact follows directly from the definition of Tq.

Let’s consider the following query as an example.

for $ in (1 to 10) return $i * 2 (4)

The expression tree of this query has the following form6. ReturnClause

MultiplicativeExpression

VariableReferenceExpression (i) IntegerLiteralExpression (2) ForClause

RangeExpression

IntegerLiteralExpression (1) IntegerLiteralExpression (10) The IR in the dialect of approach 1 of this query is the following

func @query() -> !jsoniq.sequence {

%0 = "jsoniq.stream"() : () -> !jsoniq.tuplestream<>

%1 = "jsoniq.for"(%0) ( {

^bb0(%arg0 : !jsoniq.tuple<>):

%2 = "jsoniq.lit"() {value = 1 : i64 } : () -> !jsoniq.sequence

%3 = "jsoniq.lit"() {value = 10 : i64 } : () -> !jsoniq.sequence

%4 = "jsoniq.to"(%2, %3) : (!jsoniq.sequence, !jsoniq.sequence) -> !jsoniq.sequence

"jsoniq.terminator"(%4) : (!jsoniq.sequence) -> ()

}) {var = "i"} : (!jsoniq.tuplestream<>) -> !jsoniq.tuplestream<i>

%2 = "jsoniq.return" (%1) ( {

^bb0(%arg0 : !jsoniq.tuple<i>):

%3 = "jsoniq.varref"(%arg0) {var = "i"} : (!jsoniq.tuple<i>) -> !jsoniq.sequence

%4 = "jsoniq.lit"() {value = 2 : i64 } : () -> !jsoniq.sequence

%5 = "jsoniq.*"(%3, %4) : (!jsoniq.sequence, !jsoniq.sequence) -> !jsoniq.sequence

"jsoniq.terminator"(%5) : (!jsoniq.sequence) -> () }) : (!jsoniq.tuplestream<i>) -> !jsoniq.sequence return %2 : !jsoniq.sequence

}

The tree induced by this IR is given in figure 49.

The clustered nodes store the registers used to represent the same expression. It is straightforward to see that the tree with the clustered nodes has the same shape as the expression tree.

3.2 Approach 2

This approach is motivated by some ideas of Daniel Yu’s master thesisEfficient Pro- cessing of Almost-Homogeneous Semi-Structured Data [3]. In his thesis he

6Usually the ReturnClause would have the FlworExpression as a parent expression in the tree, but since the FlworExpression just takes the output sequence of the return clause and returns it, it is not necessarily needed.

(31)

return

for *

stream to

lit lit

varref lit

Figure 49: Tq whereqis the query in 4. Note that the nodes of the tree are renamed with the names of the operation they store.

redefined JSONiq expressions from

e:DC!S to

e:T S!T S

The key insight of this idea is that we can represent a sequence of items as a tuple with the special key $$ and we can represent a tuple as a tuple-stream of length one. The following example shows the sequence (1,null,"json",[true,2,"name"])represented as a tuple-stream.

(< $$ : (1,null,"json",[true,2,"name"]) > ) Figure 50 illustrates a generic JSONiq expression using the redefinition [3].

Child- Expression

e1

Child- Expression

e2 ...

Child- Expression

en

Expression e

T Sin T Sout

T S1in T S1out T S2in T S2out T Snin T Snout

1 2 n

output

Figure 50: Illustration of a generic expression represented as mappingT S !T S In this approach we defined a dialect with namespace!jsoniq2, that contains operations which represent expressions exactly as illustrated in figure 50, this means we defined operations that represent the functions 1, ..., n, output. In the further course of this section we will introduce you to the types and operations of this dialect and again provide some conceptual aspects of the IR.

(32)

3.2.1 Types

Since we consider JSONiq expressions as mappings T S!T S in this approach, we only defined one type, calledTS, that represents a tuple-stream. Using only this type we can represent every Rumble expression and thus Rumble queries in this dialect.

3.2.2 Operations

As explained above, we defined for each function 1, ..., n, outputwithin an expression a corresponding operation in our dialect. The operations in this approach never contain a region and are therefore flat. In the following we show for each expression we considered in this approach the corresponding operations.

Atomic Expressions

Atomic Expressions are the ones with zero child-expressions. Figure 51 shows an illus- tration of an atomic expression

Atomic Expression e

T Sinput T Soutput

output

Figure 51: Illustration of an atomic expression. For a more detailed explanation, please take a look at [3].

As one can see, Atomic Expressions contain only one function (denoted with output). This means we defined for each Atomic Expression one operation which represents output. In this dialect we gave the operations that represent the output functions of an expression the name of the expression, e.g. the output function of theStringLiteralExpression is represented by the operation calledjsoniq2.StringLiteralExpression.

The atomic Expressions we considered in this approach are : IntegerLiteralExpression, StringLiteralExpression, DoubleLiteralExpression, DecimalLiteralExpression, NullLiteralExpression, BooleanLiteralExpression, VariableReferenceExpression, ContextItemExpression.

Figure 52 shows the representation of IntegerLiteralExpression (10)in this dialect.

%1 = "!jsoniq2.IntegerLiteralExpression"(%0) {value = 10}

: (!jsoniq2.TS) -> !jsoniq2.TS

Figure 52: Representation of theIntegerLiteralExpression (10).

Note that the register%0stores the operation that representT Sinput.

All other Unary Expressions are represented in the same way, except that the Operation that represents the output function of the ContextItemExpression does not contain a dictionary and the operation that represents the VariableReferenceExpression con- tains a dictionary with the name of the variable.

(33)

Unary Expressions

Unary Expressions have exactly one sub-expression. Figure 53 gives an illustration of a unary expression.

Unary Expression e

T Sin T Sout

Sub- Expression

e1

T S1in T S1out

1 output

Figure 53: Illustration of Unary Expression. For a more detailed explanation, please take a look at [3].

As one can see, Unary Expressions contain two functions: 1and output. This means we represent Unary Expressions in this dialect with the operations that represent these two functions. In this dialect we gave the operations that represent the function 1, ..., n of an expression the name of the expression prefixed with the function name, e.g. the 1

function of theNotExpressionis represented with the operation called jsoniq2.delta1NotExpression.

The Unary Expressions we considered in this approach are : UnaryExpression and NotExpression. The following example shows how to represent-10in this dialect.

%1 = "jsoniq2.delta1UnaryExpression"(%0) : (!jsoniq2.TS) -> !jsoniq2.TS

%2 = "jsoniq2.IntegerLiteralExpression"(%1) {value = 10}

: (!jsoniq2.TS) -> !jsoniq2.TS

%3 = "jsoniq2.UnaryExpression"(%2){operator = "-"}

: (!jsoniq2.TS) -> !jsoniq2.TS

Figure 54: Representation of the-10.

Note that the operations stored in register%2represents the sub-expression. The register

%0stores the operation that representsT Sin.

TheNotExpressionis represented in the same way, except that the operation that rep- resents the output function does not contain a dictionary.

Binary Expressions

Binary Expressions have exactly two sub-expressions. Figure 55 gives an illustration of a Binary Expression

As one can see, Binary Expressions contain three functions: 1, 2 and output. This means we represent Binary Expressions in this dialect with the operations that repre- sent these three functions. The Binary Expressions we considered in this approach are : RangeExpression, AdditiveExpression, MultiplicativeExpression, StringConcat- Expression, ComparisonExpression, AndExpression, OrExpression, ObjectLookup- Expression. Figure 56 shows the representation of "hello" || " world!" in this di- alect.

Note that register%2and %4represent the sub-expressions. Register%0stores the oper- ation that represents T Sin. Further note that the operations that represent the output

functions of the expressions: AdditiveExpression, MultiplicativeExpression and ComparisonExpressioncontain a dictionary with its operator in it.

Abbildung

Figure 1: Illustration of the two possibilities. The part marked in green shows the scope covered in this thesis.
Figure 3: Illustration of a JSONiq expression [3].
Figure 4 shows an illustration of a FLWOR expression. The function 1 maps the input T in to a tuple-stream T S i,1 , which then flows into the first clause c 1
Figure 5: Illustration of the FLWOR expression in 1 [3].
+7

Referenzen

ÄHNLICHE DOKUMENTE

As far as language acquisition goes, it's very good to expose learners (in this case children) to as many nouns, verbs, adjec- tives and adverbs as is possible. I do

Translate into English: Wir können nicht auf

The important task of this thesis is to manage the classification of the protocol literature into a literary genre and answer the question whether the protocol literature

The idea behind the algorithm FastCut(G) is to include the repetitions into the algo- rithm instead of simply repeating Karger’s Contract(G) algorithm as a whole.. This makes

Use the present progressive.. Try to describe what the people

The dead children of Syria command global, not just American, condemnation and action.. For America’s allies and partners in Asia,

Fachbereich Mathematik und

The input tuple is unmodified pushed down to the child expressions, which return their output as a sequence of items.. The expression itself computes its output sequence out of