• Keine Ergebnisse gefunden

Storage Format for Almost-Homogeneous Data Sets

N/A
N/A
Protected

Academic year: 2022

Aktie "Storage Format for Almost-Homogeneous Data Sets"

Copied!
35
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Research Collection

Bachelor Thesis

Storage Format for Almost-Homogeneous Data Sets

Author(s):

Moro, Gianluca Publication Date:

2020-07

Permanent Link:

https://doi.org/10.3929/ethz-b-000426097

Rights / License:

Creative Commons Attribution 4.0 International

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

(2)

Bachelor’s Thesis Nr. 307b

Systems Group, Department of Computer Science, ETH Zurich

Storage Format for Almost-Homogeneous Data Sets

by Gianluca Moro

Supervised by Prof. Dr. Gustavo Alonso,

Dr. Ingo Müller, Dr. Ghislain Fourny

March 2020 - July 2020

(3)
(4)

This document is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0): https://creativecommons.org/licenses/by/4.0

(5)

Abstract

JSON is a popular data format which is very flexible since no schema needs to be defined and therefore the data can also be heterogeneous. But this flexiblity comes at the price of performance. However, in practice, most data sets do not use the flexibilty of JSON to its full extend and are mostly homogeneous.

We call such data sets almost-homogeneous. For almost-homogeneous data sets, the trade-off between the flexibilty and performance loss of JSON is not justified.

Thus, we propose a new storage format for almost-homogeneous data sets which allows for faster pro- cessing by storing the data in the Parquet file format. Parquet is optimized for structured, homogeneous data sets and allows for significantly faster data processing compared to unstructured JSON.

In this thesis, we first cover the background on Parquet and also analyse the almost-homogeneity in real-life JSON data sets. We then show how to preprocess and convert data sets from JSON to the Parquet format. Finally, we evaluate the processing performance of our new storage format.

(6)

Contents

1 Introduction 1

2 Background 2

2.1 JSON . . . 2

2.1.1 JSON Data Types . . . 2

2.1.2 JSON Lines format. . . 2

2.2 Apache Parquet. . . 2

2.2.1 Columnar Storage . . . 3

2.2.2 Parquet File Format . . . 3

2.3 Apache Arrow. . . 5

2.4 Structural patterns in almost-homogenous data sets . . . 5

2.4.1 New York Times . . . 5

2.4.2 Yelp . . . 5

2.4.3 GitHub . . . 5

2.4.4 Reddit . . . 6

2.4.5 Conclusion . . . 6

3 Conversion Overview 10 3.1 Complete Conversion Process . . . 10

3.2 Arrow Limitations . . . 10

3.2.1 Nested JSON objects. . . 10

3.2.2 Mixed data type columns . . . 10

4 JSON Flattening 11 4.1 Round-Trippable . . . 11

4.2 Flattening process . . . 11

4.3 Nullity . . . 12

4.4 Empty Structs . . . 12

4.5 Field binder, array and type encoding . . . 12

4.6 Implementation . . . 13

4.7 Examples . . . 15

4.7.1 Non-nested JSON Objects . . . 15

4.7.2 JSON Structs . . . 15

4.7.3 Atomic Arrays . . . 15

4.7.4 Struct Arrays . . . 15

5 Writing Arrow Table and Parquet File 16 5.1 ’null’ in JSON arrays . . . 16

5.2 Arrow Schema Inference . . . 16

5.3 Arrow Table and Parquet File. . . 16

6 Parquet File Analysis 18 6.1 File Size Comparison . . . 18

6.2 Null values . . . 19

6.3 Conclusion . . . 20

7 Benchmarks 21 7.1 Conversion Benchmarks . . . 21

7.2 Query Benchmarks . . . 22

7.2.1 Query Implementations . . . 22

7.2.2 Results . . . 23

7.2.3 Conclusion . . . 24

8 Conclusion 26

(7)

A Appendix 28 A.1 Data set sources . . . 28 A.2 Source Code. . . 28

(8)

1 Introduction

JSON is a popular data exchange format which is very flexible, allowing to represent very complex data structures without the need to define a schema. However, this flexibilty is also one of its biggest drawbacks because it can lead to heterogeneity within data sets since different JSON objects with varying sets of attributes and data types can be added to the same collection. This heterogeneity introduces an overhead for type checking and thus makes JSON processing less efficient.

In practice, the heterogeneity is often limited. We denote collections where the major part is homo- geneous as "almost-homogeneous" data sets. In these cases, the JSON objects share a lot of attributes, i.e. many attributes are present in most objects and values at a specific path have the same type or a dominant type, meaning that one type is present in a lot of objects while the other types are only present in very few cases. For almost-homogeneous data sets, the trade-off between the flexibility of JSON and its performance loss is not justified.

Therefore, the goal of this thesis is to design a storage format for almost-homogeneous data sets which allows for more efficient data processing. We propose a system that converts almost-homogeneous JSON data to the Parquet file format which is optimized for fully-structured data. To perform such a conversion, we first need to preprocess the JSON data. For this, we transform the JSON file by storing each possible type at each possible path in separate columns. The resulting columns are thus fully- homogeneous with respect to the type. We call this process "JSON Flattening". The preprocessed data can then be written to the new Parquet format.

Parquet is a columnar storage format which is optimized for structured data. We store the data in Parquet files which are similar in size while being significantly faster to read. We can achieve speeedups of up to a few hundred times faster when reading Parquet data compared to unstructured JSON. When querying Parquet, we only need to load the data which we are interested in and can ignore all other, unlike in JSON queries where we need to load and parse all JSON objects even if we are only interested in a few attributes. Furthermore, Parquet allows for very efficient compression and encoding schemes which can further reduce the number of I/O during reads.

Chapter 2 covers the background about JSON, the Parquet file format and analysis of structural patterns in JSON data sets. In Chapters 3, 4 and 5, we will describe how we perform the JSON Flattening and the conversion from JSON to Parquet. Finally, in Chapters 6 and 7 we analyse and benchmark the new storage format. We will show that the resulting Parquet files are similar in size compared to the original JSON ones but significantly faster to read.

(9)

2 Background

2.1 JSON

JSON (JavaScript Object Notation) is a data-interchange format that is lightweight and easy for humans to read and write. It is easy for machines to parse and generate. [1]

2.1.1 JSON Data Types

Each JSON value can either have an atomic or a structured data type. Atomic data type means that the value cannot be split into smaller parts. On the other hand, values with structured data type can be made of atomic values or other structured values, which is called "nesting". The following list explains the different possible JSON values:

• object

– Structured value

– An object is an unordered set of key-value pairs. Keys are strings and an object cannot contain duplicate keys.

• array

– Structured value

– An array is an ordered collection of values.

• string

– Atomic value

– Sequence of zero or more Unicode characters wrapped in double quotes.

• number

– Atomic value

– Numbers can be represented as integers, fractions (with a dot) or exponent format (with ane)

• null

– Atomic value

– Represents an empty value

• boolean

– Atomic value

– Eithertrueor false 2.1.2 JSON Lines format

The JSON Lines [2] format is used to store multiple JSON values in a single file. It has three requirements:

• UTF-8 encoding

• Each line is a valid JSON value

• Line separator is ’\n’

The data sets used in this thesis will be collections of JSON Lines files. We will also use the term

"column" which we define as follows: A column of a JSON Lines file is defined as the union of all values of a key of all JSON objects in the JSON Lines file.

2.2 Apache Parquet

Apache Parquet is an on-disk columnar storage format designed for efficient storage and performance.

It supports very efficient compression and encoding schemes and allows different compressions to be specified on a per-column level [3]. Parquet supports both flat and nested data.

(10)

2.2.1 Columnar Storage

Columnar storage means that values corresponding to the same column are stored adjacent, whereas in a row-based storage format complete data rows are stored adjacent. When reading columnar data, we can read whole columns without loading unnecessary row data. The memory access pattern for reading a column in columnar format is sequential because the values are adjacent and thus faster than the fragmented access pattern for row-based data format. Therefore, columnar storage is faster for processing analytical queries where we are mainly interested in a lot of same-column values in few different columns.

Figure 1: Visual comparison between row-wise and columnar storage.

Since values in the same column have the same data type and tend to have similar values, it is very easy to efficiently compress and use encoding schemes like Dictionary encoding or Run-Length encoding.

It is also possible to apply different compression and encoding schemes on a per-column basis. Therefore, columnar storage usually results in better storage performance.

However, there are also some drawbacks, especially when we are interested in reading whole rows at a time instead of columns. Also, inserting new rows is more cumbersome in column oriented storage.

Therefore, the Parquet file format uses a hybrid between columnar and row-wise storage, which can be seen in Figure2.

Figure 2: Visualization of the hybrid storage format used in Parquet files.

2.2.2 Parquet File Format

The Parquet file format uses a hybrid between columnar and row-wise storage. Data in Parquet files is partitioned into "Row Groups" and "Column Chunks". Row groups are the horizontal partitioning of the data into rows and a row group contains exactly one column chunk for each column in the data set. Column chunks are chunks of data corresponding to a particular column in the data set.

Furthermore, column chunks are divided into "Pages" which contain the actual data itself. In addition to the compressed and encoded values, pages also contain metadata about the data they contain, such as the minimum, maximum and number of values in the page. These can be used for more efficient

(11)

querying because we can skip over whole pages if for example we see that the value we are looking for is smaller than the minimum value in that page. Furthemore, pages store Repetition and Definition Levels which are used in the Dremel encoding [4] which Parquet uses to encode nested data. Figure3shows a visualization of how data is partitioned into row groups and column chunks inside a Parquet file.

Figure 3: Visualization of data partitioning into row groups and column chunks inside a Parquet file The footer of a Parquet file stores metadata which contains information like the start locations of all the column metadata. Column metadata contains information like compression and encoding schemes and the data type of the values in that column. When reading a Parquet file, first the file metadata should be read to find all column chunks we are interested in and then the chunks can be read sequentially.

Parquet supports only few different data types, namely

• BOOLEAN (1-bit boolean)

• INT32 (32-bit signed int), INT64 (64-bit signed int), INT96 (96-bit signed int)

• FLOAT (32-bit floating point value)

• DOUBLE (64-bit floating point value)

• BYTE_ARRAY (arbitrarily long byte array)

The set of supported types is intentionally chosen to be as minimal as possible to reduce complexity of implementing readers and writers for the format. However, the set of types can be extend using logical types which specify how primitive types should be interpreted. For example strings are stored as byte arrays in the Parquet file but with a UTF-8 annotation stored in metadata.

In this thesis, we will be using the Parquet file format for our implementation of the storage format for almost-homogeneous data sets.

(12)

2.3 Apache Arrow

Apache Arrow is a cross-language development platform for in-memory data and specifies a standardized language-independent columnar memory format for flat and nested data, designed for efficient analytic operations [5]. It allows for zero-copy reads and therefore fast data access without serialization overhead between different languages and systems. It can for example be used to write to and from Parquet. Arrow tries to maximize Cache Locality, Pipelining and SIMD (Single Instruction Multiple Data) instructions to result in faster execution performance.

We will be using Arrow to convert our JSON Lines data set to Parquet files.

2.4 Structural patterns in almost-homogenous data sets

In this section, we will show the heterogeneity of real-life JSON Lines data sets. The files are part of the following four data sets: New York Times, Yelp, GitHub and Reddit1. We will analyse the data type distribution of the JSON Lines columns as well as the percentages of non-absent values. Using these metrics, we can estimate how hetero-/homogeneous the data sets are. The analyses for the New York Times, Yelp and Reddit data sets have already been covered in a different thesis [6] and we will be using those results. For the GitHub data set, we did the same kind of analysis on our own.

In the analysis, we look at the type distribution of the JSON columns, which is the distribution of the data types of the values in that column. The distributions are normalized to the number of present values in that column. Furthermore, the analysis measures the percentages of present values in each column, which is indicated by theexistsvalue.

Note that for the New York Times, Yelp and Reddit data sets, we only look at the top-most level in the JSON objects while for the GitHub data set, we will also include some nested objects to get better results.

2.4.1 New York Times

The New York times is an American newspaper established in 1851 [7]. Each JSON object in the data set represents a published newspaper article. The type distribution can be seen in Table1 on page7.

We can see that there are 20 different columns with no absent values. There are eleven columns which only use one data type and nine columns that use two different types. Almost all columns which use two data types use the null type and one other type. The only exception is the "word_count" column which uses integers and strings. We can also see that no booleans or floating point numbers are used.

Therefore, we can conclude that for the New York Times data set, roughly half of the columns are fully-homogeneous.

2.4.2 Yelp

Yelp is a website that lists businesses and users can post reviews about them [10]. The data set contains files about the businesses, users and reviews. For this analysis, we only consider the business data set where each JSON object represents a Yelp business listing.

From Table2on page7, we can see that there are 15 columns which are present in all JSON objects.

Furthemore, all except two columns only use one type. The columns with two different types have one dominant type.

We can conclude that for the Yelp data set, the majority of the columns are fully homogeneous.

2.4.3 GitHub

GitHub is a development platform that provides version control using Git [8]. Each JSON object in this data set represents an event reported by the GitHub API. Events can be Push-, Pull-, Create-, Fork-events, etc.

Note that for this data set, we are also including JSON columns up to the second nesting level because there are only 12 top-level columns and it would be difficult to make a good analysis. Furthermore, in Table 3we only show a small subset of these columns because there would be too much to show them

1The sources are listed in the appendix

(13)

all as there are several hundred columns in total. However, only by looking at the subset we can already make a good estimate on the heterogeneity of the whole data set.

From the type distribution shown in Table 3 on page 8, we can see that there are a lot of absent values. Roughly half of the columns are only present in less than 1% of all JSON objects. Furthermore, we can see that the majority of columns only use a single type. From the 55 columns shown in Table 3, 45 use a single type and nine columns use two different types. There is a single column "actor" that even takes three different types. For most of the columns which use two different types, we can observe that there is a dominating type, which means that one of the types is used far more often than the other one. Finally, we can see that no floating point numbers are being used.

We can conclude that the GitHub data set shows some heterogeneous patterns especially because there are a lot of columns which are only present in few JSON objects. However, most columns only use a single data type or have a dominant type.

2.4.4 Reddit

Reddit is a website consisting of different communities where users can share, comment and rate ideas [9]. Each JSON object in the data set represents a Reddit comment.

Table 4 on page 9 shows that there are 34 different JSON columns and roughly half of them are present in all objects. Furthermore, we can observe that 29 columns only use one type, four columns use two different types and only one column uses three types. One interessting observation for the columns using two types is that all of them use string for one of the types.

We can conclude that the Reddit data set is mostly homogeneous in terms of types. However, there are quite a few columns which are not present in all JSON objects.

2.4.5 Conclusion

Using four real-life data sets, we have seen that most JSON columns either use one or two types. Also, most columns with two types have a dominating type which is present in far more objects compared to the other type. Furthermore, we could observe that the New York Times and Yelp data set do not contain any absent values. On the other hand, the GitHub and Reddit data set contain some columns which are only present in a small number of JSON objects. However, in general, we can conclude that the data sets are almost-homogeneous.

(14)

JSON object exists bool float int null string object array

_id 100 0 0 0 0 100 0 0

abstract 100 0 0 0 47.328 52.672 0 0

blog 100 0 0 0 0 0 0 100

byline 100 0 0 0 57.252 0 42.748 0

document_type 100 0 0 0 0 100 0 0

headline 100 0 0 0 0 0 100 0

keywords 100 0 0 0 0 0 0 100

lead_paragraph 100 0 0 0 43.511 56.489 0 0

multimedia 100 0 0 0 0 0 0 100

news_desk 100 0 0 0 74.809 25.191 0 0

print_page 100 0 0 0 7.634 92.366 0 0

pub_date 100 0 0 0 0 100 0 0

section_name 100 0 0 0 70.992 29.008 0 0

slideshow_credits 100 0 0 0 100 0 0 0

snippet 100 0 0 0 22.137 77.863 0 0

source 100 0 0 0 0 100 0 0

subsection_name 100 0 0 0 96.183 3.817 0 0

type_of_material 100 0 0 0 0 100 0 0

web_url 100 0 0 0 0 100 0 0

word_count 100 0 0 90.076 0 9.924 0 0

Table 1: Type distribution of New York Times data set (numbers are percentages)

JSON object exists bool float int null string object array

address 100 0 0 0 0 100 0 0

attributes 100 0 0 0 0 0 100 0

business_id 100 0 0 0 0 100 0 0

categories 100 0 0 0 0 0 0 100

city 100 0 0 0 0 100 0 0

hours 100 0 0 0 0 0 100 0

is_open 100 0 0 100 0 0 0 0

latitude 100 0 99.999 0 0.001 0 0 0

longitude 100 0 99.999 0 0.001 0 0 0

name 100 0 0 0 0 100 0 0

neighborhood 100 0 0 0 0 100 0 0

postal_code 100 0 0 0 0 100 0 0

review_count 100 0 0 100 0 0 0 0

stars 100 0 100 0 0 0 0 0

state 100 0 0 0 0 100 0 0

Table 2: Type distribution of Yelp business data set (numbers are percentages)

(15)

JSON object exists bool float int null string object array

actor 100 0 0 0 0.045 10.706 89.249 0

created_at 100 0 0 0 0 100 0 0

id 89.249 0 0 0 0 100 0 0

org 28.414 0 0 0 0 0 100 0

payload 99.99 0 0 0 0 0 100 0

payload.action 26.679 0 0 0 0 100 0 0

payload.actor 0.175 0 0 0 0 100 0 0

payload.actor_gravatar 0.175 0 0 0 0 100 0 0

payload.after 0.001 0 0 0 0 100 0 0

payload.before 45.824 0 0 0 0 100 0 0

payload.comment 8.058 0 0 0 0 0 100 0

payload.comment_id 1.051 0 0 100 0 0 0 0

payload.commit 0.096 0 0 0 0 100 0 0

payload.commits 46.048 0 0 0 0 0 0 100

payload.desc 0.141 0 0 0 4.323 95.677 0 0

payload.description 14.028 0 0 0 38.392 61.608 0 0

payload.distinct_size 45.823 0 0 100 0 0 0 0

payload.forkee 2.474 0 0 0.223 0 0 99.777 0

payload.gist 0.013 0 0 0 0 0 100 0

payload.head 51.668 0 0 0 0 100 0 0

payload.id 0.144 0 0 100 0 0 0 0

payload.issue 10.136 0 0 6.024 0 0 93.976 0

payload.issue_id 0.955 0 0 100 0 0 0 0

payload.legacy 0.041 0 0 0 0 0 100 0

payload.master_branch 14.028 0 0 0 0 100 0 0

payload.member 0.66 0 0 0 0 0.176 99.824 0

payload.name 0.167 0 0 0 0 100 0 0

payload.number 6.566 0 0 100 0 0 0 0

payload.object 0.026 0 0 0 0 100 0 0

payload.object_name 0.026 0 0 0 37.294 62.706 0 0

payload.original 0.001 0 0 0 0 100 0 0

payload.page_name 0.008 0 0 0 0 100 0 0

payload.pages 0.671 0 0 0 0 0 0 100

payload.pull_request 7.565 0 0 0 0 0 100 0

payload.push_id 46.319 0 0 100 0 0 0 0

payload.pusher_type 15.598 0 0 0 0 100 0 0

payload.ref 68.124 0 0 0 7.145 92.855 0 0

payload.ref_type 16.457 0 0 0 0 100 0 0

payload.release 0.348 0 0 0 0 0 100 0

payload.repo 0.163 0 0 0 0 100 0 0

payload.repository 0.008 0 0 0 0 0 100 0

payload.sha 0.008 0 0 0 0 100 0 0

payload.size 51.667 0 0 100 0 0 0 0

payload.snippet 0.006 0 0 0 0 100 0 0

payload.summary 0.008 0 0 0 97.851 2.149 0 0

payload.title 0.008 0 0 0 0 100 0 0

payload.url 0.149 0 0 0 0 100 0 0

public 100 99.996 0 0.004 0 0 0 0

repo 89.249 0 0 0 0 0 100 0

repo.id 89.174 0 0 100 0 0 0 0

repo.name 89.249 0 0 0 0 100 0 0

repo.url 89.249 0 0 0 0 100 0 0

repository 10.516 0 0 0 0 0 100 0

type 100 0 0 0 0 100 0 0

(16)

JSON object exists bool float int null string object array

approved_by 0.034 0 0 0 100 0 0 0

archived 61.9 100 0 0 0 0 0 0

author 100 0 0 0 0 100 0 0

author_flair_css_class 100 0 0 0 71.808 28.192 0 0

author_flair_text 100 0 0 0 71.808 28.192 0 0

banned_by 0.034 0 0 0 100 0 0 0

body 100 0 0 0 0 100 0 0

body_html 0.034 0 0 0 0 100 0 0

controversiality 100 0 0 100 0 0 0 0

created 0.034 0 0 100 0 0 0 0

created_utc 100 0 0 30.167 0 69.833 0 0

distinguished 100 0 0 0 99.387 0.613 0 0

downs 100 0 0 100 0 0 0 0

edited 100 97.378 0.204 2.417 0 0 0 0

gilded 100 0 0 100 0 0 0 0

id 100 0 0 0 0 100 0 0

likes 0.034 0 0 0 100 0 0 0

link_id 100 0 0 0 0 100 0 0

mod_reports 0.034 0 0 0 0 0 0 100

name 61.9 0 0 0 0 100 0 0

num_reports 0.034 0 0 0 100 0 0 0

parent_id 100 0 0 0 0 100 0 0

removal_reason 22.54 0 0 0 100 0 0 0

replies 0.034 0 0 0 0 100 0 0

report_reasons 0.034 0 0 0 100 0 0 0

retrieved_on 99.797 0 0 100 0 0 0 0

saved 0.034 100 0 0 0 0 0 0

score 100 0 0 100 0 0 0 0

score_hidden 61.9 100 0 0 0 0 0 0

stickied 30.167 100 0 0 0 0 0 0

subreddit 100 0 0 0 0 100 0 0

subreddit_id 100 0 0 0 0 100 0 0

ups 92.612 0 0 100 0 0 0 0

user_reports 0.034 0 0 0 0 0 0 100

Table 4: Type distribution of Reddit data set (numbers are percentages)

(17)

3 Conversion Overview

Our idea for the conversion from JSON Lines to Parquet is to first convert the JSON Lines file to an Arrow Table which can then be used to automatically write the final Parquet file.

3.1 Complete Conversion Process

Algorithm1 shows the complete JSON Lines to Parquet conversion pipeline. The different steps of the conversion process will be discussed in detail in Sections4 JSON Flatteningand5 Writing Arrow Table and Parquet File. The "JsonToParquet" function takes a JSON Lines and a Parquet file path as inputs and converts the contents of the JSON Lines file to a new Parquet file. Note that theF lattenJ sonF ile function in Line2 writes its output to a new JSON Lines filef latF ileN ame.

Algorithm 1Json Lines to Parquet Conversion

1: functionJsonToParquet(jsonFileName, parquetFileName)

2: FlattenJsonFile(jsonF ileN ame) . Algorithm3

3: f latF ileN ame←"flat_"+ jsonF ileN ame

4: customSchema←InferSchema(f latF ileN ame) . Algorithm4

5: customP arser←pyarrow.json.P arseOptions(customSchema)

6: arrowT able←pyarrow.json.read_json(f latF ileN ame, customP arser) . Write Arrow Table

7: pyarrow.parquet.write_table(arrowT able, parquetF ileN ame) .Write Parquet File

3.2 Arrow Limitations

We will be implementing the conversion in Python, therefore we are using the Python package "PyArrow"

which provides Arrow integration for Python [11]. However, PyArrow has some limitations when it comes to converting certain JSON Lines files.

3.2.1 Nested JSON objects

JSON objects can be nested, i.e. a JSON key-value pair can contain another JSON object (also called

"struct") as its value. The current version of PyArrow at the time of working on the conversion process (version 0.16, March 2020) does not support automatic writing of nested data to Parquet. Therefore, we cannot directly convert nested JSON structs to Parquet using PyArrow. Note that the same problem also occurs for arrays of structs.

3.2.2 Mixed data type columns

In a JSON Lines file, where each line represents a JSON object, there may be multiple key-value pairs in different JSON objects which share the same key but the data types of their values may differ. This corresponds to a JSON column with more than one data type. This is a problem when trying to write the Arrow Table because in an Arrow Table each column can only contain data of one data type or

’null’. The same problem also holds for arrays with mixed data types.

(18)

4 JSON Flattening

To solve the problems described in 3.2 Arrow Limitations, we transform the JSON objects such that the resulting JSON objects have a structure which we can convert to an Arrow Table. We call this transformation "Flattening". In particular, the flattening process transforms a JSON Lines file such that:

• All JSON objects within the transformed JSON Lines file are non-nested

– All values either have atomic data type (numbers, strings, booleans, null) or are (potentially multidimensional) arrays with atomic data type values

• All values with the same key have the same data type

Before we look at how the flattening works, we first look at the "round-trippable" condition which must hold.

4.1 Round-Trippable

Our whole conversion process must be round-trippable, which means that if we convert a JSON Lines file to Parquet and then convert it back to JSON, we must get a JSON Lines file with the same content back again. Therefore, it is important that during the flattening, we do not lose any information about the structure of the original JSON Lines file. Otherwise, it might happen that if we try to reconstruct (unflatten) twodifferent flat JSON objects, we end up with the same nested JSON object. Thus, we say our flattening process is round-trippable if and only if the flattening function is injective.

4.2 Flattening process

The main idea of our flattening is to add new key-value pairs for each field in a nested JSON object. For this, we encode the "path to the value in the nested JSON object" and the data type of the value in the key of the flat object. Therefore, all values with the same key have the same data type and values can only have atomic data types or arrays of atomic data types.

For the "path-encoding", we use the symbols ’.’ for struct fields and ’[]’ for arrays. For the "data type encoding", we use the symbol ’$’.

Function 1shows a formal definition of how we encode the path and data type in the key. We apply f latten("", v)to each JSON valuev in the JSON Lines file. With "", we denote the empty string. We use{p:v}to describe a JSON object with one key-value pair, with key / path pand valuev. Furher- more, we use S to denote the merging of multiple individual JSON objects into a single JSON object.

In Function2, we use the notation [vi]i=1,...,n to denote the building of a new array by iterating overi from 1 tonand assigning the valuevi to the array indexi.

f latten(p, v) =









































 S

i=1,...,n

f latten(li, ui) ifp=""∧v={l1:u1, . . . , ln :un} S

i=1,...,n

f latten(p.li, ui) ifp6=""∧v={l1:u1, . . . , ln :un} combine(p, v) ifv= [u1, ..., un]

{p$int:v} ifv∈int {p$float:v} ifv∈float {p$string:v} ifv∈string {p$bool:v} ifv∈bool {p$array:[]} ifv=[]

{p$struct:true} ifv={}

{p$null:true} ifv=null

(1)

(19)

combine(p, v) = [

p0∈P

 p0:

"(

vi0[p0] ifp0∈vi0 null otherwise

#

i=1,...,n

 wherev= [u1, ..., un],

[v01, . . . , v0n] := [f latten(p[], u1), . . . , f latten(p[], un)], P :=set of all paths inkeys(v10), . . . , keys(vn0)

(2)

Function 1 can directly be used for the implementation of the flattening process. For arrays (case v = [u1, ..., un]), we want to "preserve the structure" of the array in the original nested JSON object.

That means, if we flatten an array of structs, we want to obtain a single array which only contains the values of the structs, which have atomic data types, and the order within the array and the length remains the same. To implement this, we first recursively flatten all elements of the array in their original order, which will give us an array of flat JSON objects. Then we need to combine all values which belong to the same array, i.e. they have the same path, into a single array. When combining, we need to make sure that the overall structure of the original array (order within array) is preserved. That means, we need to make sure that the values are placed at the position they were in the original JSON array. To achieve this, we also include ’null’ values to encode that there is no value at this position in the array for this path and type. It is important that we preserve the structure of JSON arrays during the flattening to remain round-trippable. The implementation of the flattening process will be discussed in section4.6 Implementation.

Now, by only looking at the keys in the flat object, we can see how the the nested object looked like:

• If a key contains ’a[].b’, it means the value of ain the nested object was an array which contains objects with the keyb.

– For example: {"a": [{"b": 1}, {"b": 2}, {"b": 3}]}

– The flat object would be: {"a[].b$int": [1, 2, 3]}

• If a key contains ’a[]’ (without the dot), it means the value of ain the nested object was an array which contains atomic values.

– For example: {"a": [1, 2, 3]}

– The flat object would be: {"a[]$int: [1, 2, 3]}

• If a key contains ’$type’, it means the value either has data type ’type’ or if the value is an array, then all elements have data type ’type’ or are ’null’.

4.3 Nullity

As seen in Function 1, we encode ’null’ with a boolean ’true’ rather than keeping the original ’null’.

The reason for this being that in an Arrow Table, missing values (i.e. some JSON object contains a key which others do not) will be replaced by null values. Therefore, in a column with type ’$null’, we could not distinguish between a ’null’ being generated because it was a missing value and a ’null’ as a value of a JSON object. However, if we encode JSON ’null’ values with ’true’, we can make the distinction.

4.4 Empty Structs

Because an Arrow Table cannot contain a column with data type ’struct’, it also cannot contain empty structs ’{}’. Therefore, during the flattening we need to remove empty structs but still need a way to know that the original value was an empty struct. For this, we simply encode the type of this value with

’$struct’ and set the value to ’true’, therefore the actual data type of this column becomes ’bool’. We now know if a key has type ’$struct’ and its value is ’true’, then it was originally an empty struct.

4.5 Field binder, array and type encoding

While flattening, we use the symbols ’.’, ’[]’ and ’$’ to encode fields of structs, arrays and data types respectively. However, a JSON object could contain these symbols as part of a key already before the

(20)

the original key, we need to make sure that no key contains any of the symbols on its own. For this, we use a single backslash as an escape character, i.e. we replace ’.’ with ’\.’, ’[]’ with ’\[]’ and ’$’ with

’\$’ before flattening. We can now distinguish between a symbol being part of the original key (with ’\’) and as part of the path encoding (no ’\’).

Note that we also need to replace ’\’ in the original key with ’\\’ because a single backslash now acts as an escape character.

4.6 Implementation

Algorithm2shows the pseudocode of the flattening process. We callflatten("", obj)for every JSON object ’obj’ of our input JSON Lines file. Note that for our Python implementation, the JSON objects are previously loaded into Python dictionaries.

The implementation is basically a straight-forward adaptation of Function 1 which was discussed in Section4.2 Flattening process. In Line 10, we perform the preprocessing of the keys as discussed in4.5 Field binder, array and type encoding. Thecombine function is responsible for combining the values of flattened struct arrays which share the same path into a single array, while preserving the original structure. Theflattenfunction returns a Python dicitonary which contains the flat JSON object and can then be dumped back to a JSON object.

When flattening a whole JSON Lines file, we iterate over all lines, load each JSON object and flatten it using Algorithm2. The flat JSON object is then written to a new JSON Lines file which contains all the flat objects. This process is described in Algorithm 3. We can then use this new file to write an Arrow Table and the final Parquet file, which will be discussed in Section 5 Writing Arrow Table and Parquet Filein more detail.

(21)

Algorithm 2Flattening Process

1: functionFlatten(path, data)

2: if type(data) =dictthen .data is a JSON object

3: if data={}then

4: key←key+"$struct"

5: data←true

6: return{key:data}

7: else

8: Initialize: f lat←empty dictionary

9: forp, v∈datado

10: preprocess p

11: if p6=""then

12: path←path+"."+p

13: else

14: path←p

15: f lattened←Flatten(path, v)

16: forf lat_p, f lat_v∈f lattened.items()do

17: f lat[f lat_p]←f lat_v

18: returnf lat

19:

20: else if type(data) =listthen . data is a JSON array

21: if data=[]then

22: path←path+"[]"+"$array"

23: data←true

24: return{path:data}

25: else

26: Initialize: f lats_list←empty list,paths←empty set

27: fori∈ {0, . . . , length(data)−1}do

28: path←path+"[]"

29: f lattened←Flatten(path, data[i])

30: f lattened_paths←f lattened.keys()

31: paths←paths∪f lattened_paths

32: f lats_list.append(f lattened)

33: f lat←Combine(f lats_list, paths)

34: returnf lat

35:

36: else .data is atomic JSON value

37: path←path+"$"+type(data)

38: if data=nullthen

39: data←true

40: return{path:data}

41:

42: functionCombine(flats_list, paths)

43: Initialize: f lat←empty dictionary

44: len←length(f lats_list)

45: forp∈paths do

46: Initalize: f lat_arr←empty list

47: fori∈ {0, . . . , len−1} do

48: if p∈f lats_list[i]then

49: f lat_arr.append(f lats_list[i][p])

50: else

51: f lat_arr.append(null) f lat[p]←f lat_arr

(22)

Algorithm 3JSON Lines file flattening

1: functionFlattenJsonFile(fileName)

2: origF ile←open(f ileN ame,"r") .input file

3: f latF ile←open("flat_"+ f ileN ame,"w") . output file

4: forline∈origF iledo

5: jsonData←json.loads(line)

6: f latJ sonData←FlattenJson("", jsonData) . Algorithm2

7: f latJ son←json.dumps(f latJ sonData)

8: f latF ile.write(f latJ son+"\n")

4.7 Examples

In the following sections, we will look at some examples of the JSON Flattening.

4.7.1 Non-nested JSON Objects

{"a": "hello world", "b": 1234, "c": false}

The above JSON object gets flattened to:

{"a$string": "hello world", "b$int": 1234, "c$bool": false}

Note that for non-nested JSON objects, only the data type information gets added to the key.

4.7.2 JSON Structs

{"a": {"b": 1.25, "c": true}}

The above JSON object gets flattened to:

{"a.b$float": 1.25, "a.c$bool": true}

4.7.3 Atomic Arrays

{"a": [1, 2, 3, 4], "b": {"c": ["hello", "world"]}}

The above JSON object gets flattened to:

{"a[]$int": [1, 2, 3, 4], "b.c[]$string": ["hello", "world"]}

4.7.4 Struct Arrays

{"a": [{"b": 1}, {"b": 2}, {"b": 3}]}

{"a": {"b": [{"c": "hello"}, {"c": "world"}]}}

The above JSON objects get flattened to:

{"a[].b$int": [1, 2, 3]}

{"a.b[].c$string": ["hello", "world"]}

Note that the arrays in the flat objects only contain the values of the structs while the keys are all combined into a single new key.

Let’s look at another example, where we illustrate "how we preserve the original structure":

{"a": [[{"b": "1"}], [{"b": 1}, {"b": "2"}, {"b": 3}], [{"b": true}]]}

The above JSON object gets flattened to:

{"a[][].b$string": [["1"], [null, "2", null], null],

"a[][].b$int": [null, [1, null, 3], null],

"a[][].b$bool": [null, null, [true]]}

Note that the value of "a" in the original object is a 2-dimensional array and in the flat object, the value of all keys corresponding to "a" are also 2-dimensional arrays. Furthermore, the non-null values are at the positions they were in the original array.

(23)

5 Writing Arrow Table and Parquet File

After flattening the JSON Lines file, we need to write an Arrow Table which can later be used to write the Parquet file. However, there were again some initial problems which we had to solve first.

5.1 ’null’ in JSON arrays

Using the "pyarrow.json.read_json()" function to create an Arrow Table from our flat JSON Lines file, we cannot directly convert a JSON array that only contains ’null’ or whose first element is ’null’ because PyArrow cannot automatically infer the Arrow data type for such objects.

To solve this issue, we need to manually infer an "Arrow Schema" which contains the Arrow data type information for each column in the Arrow Table. An Arrow Schema is basically a mapping from column names, in our case the JSON keys, to Arrow data types.

5.2 Arrow Schema Inference

We infer the Arrow Schema using the data type information we stored inside the JSON keys during the flattening process. For each JSON key in the input JSON Lines file, we create an "Arrow Field"

which is basically a mapping from JSON keys to Arrow data types. The list of Arrow Fields can then be converted to an Arrow Schema using the ’pyarrow.schema()’ function. Algorithm 4 on page17shows the pseudocode of the Arrow Schema inference.

In Line 8, the numberarrDepth refers to the depth of a JSON array, and is equal to0 if the value is not an array, i.e. the value is an atomic JSON value. Remember that during the flattening, we added the symbol ’[]’ for arrays. Therefore, the array depth is equal to the number of times the symbol ’[]’ occurs in the key. In Line9, we can get the data type, which is encoded in the JSON key, by splitting the key on the data type symbol ’$’. The "rsplit" function splits a string starting from the right, therefore the string containing the data type information is the last element in the returned list. In Python, the index -1 refers to the last element of a list.

Note that the Arrow Type for arrays / lists is generally not just "pyarrow.list_(type)" because it depends on the depth of the array. That means each array level adds a layer of "pyarrow.list_(...)".

Thus, in Line37we recursively define the correct Arrow Type for arrays.

5.3 Arrow Table and Parquet File

During the flattening process, we created a new JSON Lines file which contains all the flat JSON objects.

We can use this file and our manually inferred Arrow Schema to convert the JSON Lines file to an Arrow Table. We first need to create a "pyarrow.json.ParseOptions()" in order to specify that we want to use our manually inferred schema ’customSchema’:

customParser = pyarrow.json.ParseOptions(explicit_schema=customSchema)

We can now write an Arrow Table from our JSON Lines file ’flatFile.jsonl’, which contains the flat objects of our input JSON Lines file, by using the "pyarrow.json.read_json()" function:

arrowTable = pyarrow.json.read_json("flatFile.jsonl", parse_options=customParser) Using the Arrow Table, we can then write a Parquet file with the "pyarrow.parquet.write_table()"

function:

pyarrow.parquet.write_table(arrowTable, "name.parquet")

This function takes additional arguments "encoding" and "compression", which specify whether to use Dictionary Encoding and which compression technique will be applied. By default, PyArrow uses dictio- nary encoding together with run-length encoding and Snappy compression when creating Parquet files.

Other compression variants are Gzip, LZ4, Brotli and ZSTD. The comparison of different compression variants will be discussed in section6 Parquet File Analysis.

(24)

Algorithm 4Arrow Schema Inference

1: functionInferSchema(fileName)

2: Initialize: schemaF ields, keysInSchema←empty lists

3: f ile←open(f ileN ame,"r") . Flat JSON Lines file

4: forline∈f iledo

5: jsonObject←json.loads(line)

6: forkey∈jsonObject.keys()do

7: if key /∈keysInSchemathen

8: arrDepth←getNumOcc(key,"[]")

9: dataT ype←k.rsplit("$")[−1]

10:

11: f ieldT ype←getArrowType(arrDepth, dataT ype)

12: f ield←pyarrow.f ield(key, f ieldT ype)

13: schemaF ields.append(f ield)

14:

15: keysInSchema.append(key)

16:

17: schema←pyarrow.schema(schemaF ields)

18: returnschema

19:

20: functiongetArrowType(arrDepth, dataType)

21: if arrDepth= 0then

22: if dataT ype= "bool"then

23: returnpyarrow.bool_()

24: else if dataT ype= "string"then

25: returnpyarrow.string()

26: else if dataT ype= "int"then

27: returnpyarrow.int32()

28: else if dataT ype= "float"then

29: returnpyarrow.f loat64()

30: else if dataT ype= "array"then

31: returnpyarrow.null()

32: else if dataT ype= "struct"then

33: returnpyarrow.bool_()

34: else if dataT ype= "null"then

35: returnpyarrow.bool_()

36: else

37: returnpyarrow.list_(getArrowType(arrDepth−1, dataT ype)

(25)

6 Parquet File Analysis

6.1 File Size Comparison

One of our goals for the conversion to Parquet is to not add a significiant storage overhead. Therefore, we compared the JSON Lines and Parquet file sizes for different compressions to see if and how much the additional null values impact the overall file size. The compressions used were:

• No compression

• Snappy compression

• Gzip compression

The plots in Figure4 show the file sizes of different file samples for each of the four data sets we used.

For each data set (New York Times, GitHub, Reddit, Yelp), we selected four file samples at random.

The same samples will be used in all following analyses and benchmarks.

1 2 3 4

0 0.2 0.4 0.6 0.8 1

Sample Filesizecomparedto JSONuncompressed(%)

1 2 3 4

0 0.2 0.4 0.6 0.8 1

Sample Filesizecomparedto JSONuncompressed(%)

1 2 3 4

0 0.2 0.4 0.6 0.8 1

Sample Filesizecomparedto JSONuncompressed(%)

1 2 3 4

0 0.2 0.4 0.6 0.8 1

Sample Filesizecomparedto JSONuncompressed(%)

(a) New York Times (b) GitHub

(c) Reddit (d) Yelp

JSON (uncompressed) Parquet (uncompressed) JSON (snappy) Parquet (snappy) JSON (gzip) Parquet (gzip)

Figure 4: File size comparisons

(26)

The plots in Figure4 show that for the New York Times data set, the Parquet files are very similar in size compared to the corresponding JSON Lines files. This means that the additional null values do not increase the file size significantly, if at all. Furthermore, the compression ratios are also similar across samples. One outlier is sample 4 whose Parquet files are only about half the size of the JSON Lines version. The reason for this is probably because the JSON Lines file of sample 4 has a very different structure than the ones for samples 1 to 3. The first three samples each only contain one single JSON object, while sample 4 contains several thousands.

For the GitHub data set, we can observe that all Parquet files are slightly larger compared to the corresponding JSON Lines files. Since the size difference is small, we can conclude that the additionally introduced null values do not increase the file size by a large amount. Furthermore, the compression ratios are similar across samples.

The Reddit data set also shows similarities across all samples in terms of file size and compression ratios. We can see that the Parquet files are overall smaller than the JSON Lines files. That means that the additional null values do not introduce any storage overhead.

On the other hand, we can see that the Yelp data set has some variety when it comes to file sizes and compression ratios. The only similarity between samples being that all Parquet files are smaller than the corresponding JSON Lines files, which again means that the added null values do not negatively impact the size.

Conclusion

In most cases, the Parquet files are either smaller or very similar in size compared to the corresponding JSON Lines files. The only exception is the GitHub dataset, where the Parquet sizes are overall slightly larger. Furthermore, we can see that with exception of the Yelp data set, Snappy and Gzip compression have very similar effect on the resulting Parquet files. Overall, gzip compression results in the smallest file sizes.

Therefore, we can conclude that the conversion to the flattened Parquet file does not introduce a significant storage overhead, if at all.

6.2 Null values

During the flattening, we added null values to keep the original structure within struct arrays and when writing the Arrow Table, missing values will be replaced by null values. Therefore, one might be concerned that all these additional null values would increase the file size by a lot. To see if there is any correlation between the file size and the amount of null values within the file, we compared the file size to the ratio of null values.

To compute the null value ratio, we consider all values in the Parquet file, counting elements within nested arrays distinctively, i.e. an array does not count as one value but rather all values within the array are considered separate values. This way, we count the number of values which are null and the total number of values and compute the ratio of nulls compared to the total number of values.

From Table 5, we can observe that the null value ratios within data sets are very similar, with exception of the Yelp data set. The ratios between different data sets vary a lot, ranging from 0% up to 95%. However, we cannot see any correlation between the null value ratio and the file size ratio which holds for all data sets.

For example, one might think that a larger null value ratio will lead to a larger file size ratio. However, this does not hold for all data sets: For example, samples 2 to 4 in the Yelp data set contain no null values, however their sizes are bigger than sample 1 where around 60% of the values are null. Similar, for the GitHub data set, the null value ratios increase from sample 1 to 3, yet their file sizes decrease.

Furthermore, even though the amount of null values in the file can be very large (more than 90%), the file sizes are only slightly larger, if at all. This can be observed in the GitHub data set.

Therefore, we conclude that the added null values do not directly influence the resulting file size.

(27)

Sample Null value ratio File size ratio

Sample 1 0.46 0.96

Sample 2 0.35 1.03

Sample 3 0.36 1.00

Sample 4 0.26 0.57

(a) New York Times data set

Sample Null value ratio File size ratio

Sample 1 0.91 1.27

Sample 2 0.93 1.10

Sample 3 0.94 1.05

Sample 4 0.91 0.98

(b) GitHub data set Sample Null value ratio File size ratio

Sample 1 0.16 0.65

Sample 2 0.19 0.68

Sample 3 0.16 0.68

Sample 4 0.19 0.67

(c) Reddit data set

Sample Null value ratio File size ratio

Sample 1 0.61 0.75

Sample 2 0.00 0.92

Sample 3 0.00 0.83

Sample 4 0.00 0.97

(d) Yelp data set

Table 5: Null value ratios and file size ratios. The column "File size ratio" contains the ratio of the Parquet file size compared to the JSON Lines file size, both Snappy compressed.

6.3 Conclusion

We have shown that even though the ratio of number of null values to the total number of values within a Parquet file can be large, the overall size of the file compared to its corresponding JSON Lines file will remain roughly the same for most data sets, and only increase slightly for some. This means that the additionally introduced null values will not have any siginificant impact on the resulting file size.

Furthermore, the different compression ratios are similar within data sets and since Snappy compresses a lot faster than Gzip, we decided to use Snappy compression as our default option when creating Parquet files.

(28)

7 Benchmarks

In the following sections, we will discuss benchmarks we did on the conversion process and on data querying. The benchmarks were performed on a system with the following specifications:

• CPU: Intel Core i7 7700K, 4.20GHz

• RAM: 16GB

• OS: Windows 10, 64-bit

• Python: Version 3.6

• Java: Version 1.8

7.1 Conversion Benchmarks

To benchmark the conversion process, the conversion was performed on different JSON Lines files with varying sizes. For each file, the conversion was repeated multiple times and the mean execution time was measured. Note that the conversion process for this benchmark runs completely sequentially, no optimization using multi-threading was applied. By default, PyArrow uses multi-threading to speed up the writing of the Arrow Table but we disabled it. The results are shown in Figure5.

8 13 16 24

0 1 2 3

JSON Lines file size (MB)

Meanexecutiontime(s)

3 53 103 387

0 20 40

JSON Lines file size (MB)

Meanexecutiontime(s)

296 1,816 4,217 8,227

0 500 1,000

JSON Lines file size (MB)

Meanexecutiontime(s)

132 2,371 5,100

0 100 200 300 400

JSON Lines file size (MB)

Meanexecutiontime(s)

(a) New York Times (b) GitHub

(c) Reddit (d) Yelp

Total JSON Load/Dump Flattening

Schema Inference Arrow Table Parquet File Figure 5: Conversion Benchmarks

(29)

From the plots in Figure5, we can see that the time to convert JSON Lines files to Parquet grows linearly with the size of the JSON Lines file. On average, the conversion has a throughput of around 10MB/s.

Furthermore, we can observe that the majority of the time is being spent during the JSON flattening.

Around 30 to 60% of the time is used for the flattening and about 10 to 30% is taken up to infer the Arrow Schema. With exception of the GitHub data set, the percentages of time used for flattening the JSON Lines file and inferring the Arrow Schema are pretty similar. For the GitHub data set, the Arrow Schema inference requires a lot more time, which is probably caused by the far larger number of columns compared to all other data sets. While the Parquet files in the New York Times, Reddit and Yelp data set have up to a few dozens of columns, the GitHub Parquet files have a few hundred columns.

We can also see that loading and dumping all of the JSON objects accounts for up to 25% of the conversion time. Furthermore, the plots show that the time to write the Arrow Table takes only about 10 to 20% of the whole conversion time and writing the Parquet file takes the least amount of time by a large margin. This can be explained by the fact that Arrow allows zero-copy writing to Parquet and is thus very efficient.

Optimizations

The presented benchmarks show a baseline for a non-optimized conversion process. We did not spend too much time into fully optimizing the conversion but there are some possibilities to improve the performance. By default PyArrow will use multi-threading to speed up the writing of the Arrow Table.

Further optimizations could include multi-threading the JSON flattening and inferring the schema during the flattening. Furthermore, implementations in other programming languages, such as C++ or Java, might also result in better conversion times.

7.2 Query Benchmarks

To see the improvement in performance of processing data using the Parquet format over unstructured JSON, we evaluated the time to execute a query which loads all values at a given set of paths. The query is performed on the same file for different number of paths and the paths are chosen at random from the set of all possible paths in that file. We compared the query on the Parquet files using a Python application and on the JSON Lines files using both a Python and Java application.

7.2.1 Query Implementations

For the Parquet query, we use the "pyarrow.parquet.read_table()" function which takes a Parquet file path and optionally a list of column names as arguments and returns the content of the file as an Arrow Table. If a list of column names is provided, only those columns will be read from the file, otherwise the whole file will be read. Furthermore, we are also disabling the option "use_threads" which would perform multi-threaded column reads. By default PyArrow would use multi-threading to speed up the reading of Parquet files. However, since neither of the JSON query implementations uses any multi-threading and we want a fair comparison, we will not use any multi-threading for this benchmark.

The Python implementation for querying Parquet files looks as follows:

i m p o r t p y a r r o w

def p a r q u e t Q u e r y ( f i l e N a m e , p a t h s ): # ’ p a t h s ’ is a l i s t of p a t h s in q u e r y

n u m P r e s e n t V a l u e s = [] # ’ n u m P r e s e n t V a l u e s ’ c o n t a i n s r e s u l t of q u e r y t a b l e = p y a r r o w . p a r q u e t . r e a d _ t a b l e ( f i l e N a m e , c o l u m n s = paths , u s e _ t h r e a d s = F a l s e ) for col in t a b l e . i t e r c o l u m n s ():

n u m N u l l s = col . n u l l _ c o u n t l e n g t h = col . l e n g t h ()

n u m P r e s e n t V a l u e s = l e n g t h - n u m N u l l s

Listing 1: Python Parquet query implementation

For the JSON query, we iterate over all lines in our JSON Lines file and for each JSON object, we try to get the value at the given paths. In order to use JSONPath syntax for querying the JSON objects, we use "JSONPath RW" [12] for the Python implementation and "Jayway JsonPath" [13] for the Java

(30)

f r o m j s o n p a t h _ r w i m p o r t p a r s e def j s o n Q u e r y ( f i l e N a m e , p a t h s ):

n u m P r e s e n t V a l u e s = [0 for i in r a n g e(len( p a t h s ))]

p a r s e d P a t h s = [ p a r s e ( p ) for p in p a t h s ] w i t h o p e n( f i l e N a m e , m o d e =" r ") as j s o n F i l e :

for l i n e in j s o n F i l e :

j s o n D a t a = j s o n . l o a d s ( l i n e ) for i in r a n g e(len( p a r s e d P a t h s )):

p a r s e d P a t h = p a r s e d P a t h s [ i ] try:

d a t a = [ m a t c h . v a l u e for m a t c h in p a r s e d P a t h . f i n d ( j s o n D a t a ) n u m P r e s e n t V a l u e s [ i ] += 1

e x c e p t T y p e E r r o r :

# no v a l u e p r e s e n t at t h i s p a t h

# the " f i n d " m e t h o d t h r o w s a T y p e E r r o r if it c a n n o t

# f i n d a v a l u e at the g i v e n p a t h c o n t i n u e

Listing 2: Python JSON query implementation

i m p o r t com . j a y w a y . j s o n p a t h . C o n f i g u r a t i o n ; i m p o r t com . j a y w a y . j s o n p a t h . J s o n P a t h ;

p u b l i c s t a t i c v o i d j s o n Q u e r y ( S t r i n g f i l e N a m e , S t r i n g [] p a t h s ) { B u f f e r e d R e a d e r br = new B u f f e r e d R e a d e r (new F i l e R e a d e r ( f i l e N a m e ));

int[] n u m P r e s e n t V a l u e s = new int[ p a t h s . l e n g t h ];

S t r i n g c u r r e n t L i n e ;

w h i l e(( c u r r e n t L i n e = br . r e a d L i n e ()) != n u l l) {

// e a c h J S O N o b j e c t is p a r s e d o n l y o n c e per q u e r y

O b j e c t p a r s e d O b j e c t = C o n f i g u r a t i o n . d e f a u l t C o n f i g u r a t i o n () . j s o n P r o v i d e r (). p a r s e ( c u r r e n t L i n e );

for (int i = 0; i < p a t h s . l e n g t h ; i ++) { p a t h = p a t h s [ i ];

try {

d a t a = J s o n P a t h . r e a d ( p a r s e d O b j e c t , p a t h );

n u m P r e s e n t V a l u e s [ i ] += 1;

} c a t c h ( E x c e p t i o n e ) {

// no v a l u e p r e s e n t at t h i s p a t h

// the ’ r e a d ’ m e t h o d t h r o w s an e x c e p t i o n if it c a n n o t // f i n d a v a l u e at the g i v e n p a t h

c o n t i n u e; }

} } }

Listing 3: Java JSON query implementation

One important difference between the Python and Java implementation for querying JSON data is that in the Java version, each JSON object is only parsed once per query. Unlike in the Python version where each object is parsed once for each path in the query. Thus, the Python implementation will be slower than the Java one, which can be observed in the result. For a small number of paths in the query this will not have a huge impact but when the number of paths increases, the additional parsing in the Python version will increase execution time significantly compared to the Java one.

7.2.2 Results

Each query was performed multiple times and the mean execution time was measured. Finally, we normalized the times with the total number of present values in the JSON Lines file. Note that for the number of present values we count elements in arrays distinctively. The benchmarks for the different data sets are shown in Figure6. Since the plots are very similar for the same data set, we only show the plot for one sample per data set.

(31)

0 20 40 60 80 100 10−6

10−5 10−4 10−3 10−2 10−1

Paths in query (%)

Executiontimepervalue(ms)

0 20 40 60 80 100

10−6 10−5 10−4 10−3 10−2 10−1

Paths in query (%)

Executiontimepervalue(ms)

0 20 40 60 80 100

10−6 10−5 10−4 10−3 10−2 10−1

Paths in query (%)

Executiontimepervalue(ms)

0 20 40 60 80 100

10−6 10−5 10−4 10−3 10−2 10−1

Paths in query (%)

Executiontimepervalue(ms)

(a) New York Times (b) GitHub

(c) Reddit (d) Yelp

JSON Python JSON Java Parquet Python

Figure 6: Query Benchmarks (Logarithmic scale and times are normlized with the number of values in the JSON Lines file)

From the plots in Figure 6, we can see that the query times are qualitatively similar. The times for the GitHub query are overall a bit slower. The reason for this is probably that the GitHub data set is the most heterogeneous among all the data sets. We can further observe that querying the Parquet file is significantly faster than querying JSON data. The exeuction times increase with the number of paths in the query for all three implementations. While the Parquet and JSON Python query times increase quite significantly, at least for smaller query sizes, the time for the JSON Java query only increases slightly.

Only exception is the GitHub data set where the JSON Java query shows some strange behaviour. We can also observe that the speedup of querying Parquet compared to JSON decreases with increasing number of query paths. Furthermore, the curves flatten out the closer the query size gets to 100% path coverage. On average, the speedups of querying Parquet compared to JSON Java for a small number of query paths range from around 30 up to 250. With increasing query size, the speedups decrease such that querying Parquet is only about 10 to 60 times faster than JSON.

7.2.3 Conclusion

One big advantage of querying Parquet files is that we only need to load the columns which correspond to the paths in the query and can ignore all other data. This way, we can save a lot of unnecessary data loading. For the JSON query, the complete JSON Lines file will be loaded, even if we are only

(32)

interested in a few paths. Also, when querying JSON, all the objects need to be parsed from the file into JSON objects first, which is not needed in Parquet and thus also saves some time. This explains the bigger difference for small number of paths in the query. For increased query size, there must be additional factors which result in the faster Parquet query times. One factor might be the compression and dictionary encoding of columns in the Parquet file which will reduce the number of I/O during the read. Furthermore, in this benchmark, we are loading complete columns of data and Parquet is optimized for such kind of querying.

Using our benchmarks, we have shown that we can achieve great processing performance when read- ing Parquet compared to JSON thanks to the structured column-oriented storage format of Parquet.

Speedups of up to 250 times faster compared to JSON can be achieved using Parquet. Also note that we could further improve the processing performance by enabling multi-threaded column reads.

(33)

8 Conclusion

We defined almost-homogeneous data sets as JSON data sets with a large degree of homogeneity. For such collections, the trade-off between the flexibility of JSON and its performance drawback is not justified.

In Chapter2, we showed that many real-life JSON data sets can be considered almost-homogeneous.

Therefore, we proposed a system that converts such data sets to the Parquet file format which is optimized for processing structured data.

In order to convert heterogeneous data to Parquet, we first had to solve some limitations of the Arrow format, which we used to write the Parquet file, and its Python implementation. Therefore, we had to introduce the JSON Flattening, which we described in Chapter 4. This method allowed us to create homogeneous JSON Lines files by encoding the path and type in the JSON keys. Finally, we could convert the data to the new Parquet format as shown in Chapter5.

In Chapter 6, we justified that our newly introduced structure does not negatively impact the file size. We also compared different compression schemes which can be applied to the Parquet format.

Finally, in Chapter 7, we showed benchmarks on the conversion process and data querying. Using those benchmarks, we could see the positive impact on data processing using our new storage format compared to unstructured JSON. Using the Parquet format, we could achieve read speedups of up to 250 times faster compared to JSON while having similiar file sizes. We also proposed some optimization techniques which could be applied to potentially speed up the conversion process.

Future work could include the optimization of the conversion and adding more features to query the Parquet data. Furthermore, one could adjust the implementations to also support JSON’s big integers and decimals.

(34)

References

[1] JSON. URL:http://www.json.org(visited on 12/06/2020)

[2] JSON Lines. URL:http://www.jsonlines.org(visited on 12/06/2020) [3] The Apache Software Foundation. Apache Parquet.

URL:https://parquet.apache.org/documentation/latest/ (visited on 22/06/2020)

[4] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton and Theo Vassilakis. "Dremel: Interactive Analysis of Web-Scale Datasets", (2010). URL:https:

//storage.googleapis.com/pub-tools-public-publication-data/pdf/36632.pdf [5] The Apache Software Foundation. Apache Arrow.

URL:https://arrow.apache.org/(visited on 22/06/2020)

[6] Daniel Yu. "Efficient Processing of Almost-Homogeneous Semi-Structured Data", (2018).

URL:https://doi.org/10.3929/ethz-b-000272311

[7] New York Times Company. URL:https://www.nytco.com/(visited on 22/06/2020) [8] GitHub Inc. URL:https://github.com/(visited on 22/06/2020)

[9] Reddit Inc. URL:https://www.redditinc.com/(visited on 22/06/2020) [10] Yelp Inc. URL:https://www.yelp.com/about(visited on 22/06/2020) [11] The Apache Software Foundation.Python bindings.

URL:https://arrow.apache.org/docs/python/ (visited on 24/03/2020) [12] GitHub. Kenneth Knowles.Python JSONPath RW.

URL:https://github.com/kennknowles/python-jsonpath-rw(visited on 12/06/2020) [13] GitHub.Jayway JSONPath.

URL:https://github.com/json-path/JsonPath(visited on 12/06/2020)

(35)

A Appendix

A.1 Data set sources

Data set New York Times GitHub

Reddit Yelp

Source

https://developer.nytimes.com/

https://www.gharchive.org/

https://www.reddit.com/r/datasets/comments/65o7py/updated_

reddit_comment_dataset_as_torrents/

https://www.yelp.com/dataset

A.2 Source Code

The Python implementations of the JSON Flattening and the JSON to Parquet conversion can be downloaded from: https://doi.org/10.3929/ethz-b-000426097

Abbildung

Figure 2: Visualization of the hybrid storage format used in Parquet files.
Figure 3: Visualization of data partitioning into row groups and column chunks inside a Parquet file The footer of a Parquet file stores metadata which contains information like the start locations of all the column metadata
Table 1: Type distribution of New York Times data set (numbers are percentages)
Table 4: Type distribution of Reddit data set (numbers are percentages)
+4

Referenzen

ÄHNLICHE DOKUMENTE

The json node value contains a pointer to a json array, json object, json string, json number and a void pointer, no matter of what type the node value is.. To get the actual value

We evaluated different storage structures using micro bench- mark to raise the awareness of impact of unused functional- ities, storage structure complexity, and data size growth

JSON is used as data exchange format and the JSON files provide meta-data associated with tag representations in the hyper resolution image.. In order to guaran- tee consistency

die Abfrage von Informationen aus einer Datenbank, oder das Lesen von Daten aus einer Datei und eine anschließende Formatierung erfolgen. Mit AJAX wird oft eine Kombination

b) Starten Sie das Serverskript flightpos_server.php über die XAMPP-Shell! Dazu muss php.exe mit dem auszuführenden Skript als Argument angegeben werden. Der Start muss aus

CTD, Rosette, salinities Ocean Physics John Love Marie Robert Cruise Psychology University of Szczecin Anna Flak O2, reversing thermometers Ocean Physics Bernard Minkley

You get it as a JSON object from the API In other words as a nested key-value data structure.. This data structure

In this part we will discuss the usage of JSON as a transport and storage format. JSON is the format that contains the information and is delivered to the client. Moreover,