Polyglot Persistence - Backend Performance: Scalable Data Management

2.2 Backend Performance: Scalable Data Management

2.2.5 Polyglot Persistence

In distributed systems, causality is often tracked usingvector clocks[Fid87]. Causal con-sistency can be implemented through a middleware or directly in the client by tracking causal dependencies and only revealing updates when their causal dependencies are vis-ible, too [BGHS13, BKD⁺13]. Causal consistency is the strongest guarantee that can be achieved with high availability in the CAP theorem’s system model of unreliable channels and asynchronous messaging [MAD⁺11]. The reason for causal consistency being com-patible with high availability is that causal consistency does not requireconvergence of replicas and does not imply staleness bounds [GH02]. Replicas can be in completely dif-ferent states, as long as they only return writes where causal dependencies are met. For this reason, our caching approach in Orestes combines causal consistency with recency guarantees to strengthen the causal consistency model for practical data management use cases that require bounded staleness.

Bailis et al. [BGHS13] argued thatpotential causality leads to a high fan-out of poten-tially relevant data. Instead, application-defined causality can help to minimize the actual dependencies. In practice however, potential causality can be determined automatically through dependency tracking (e.g., in COPS [LFKA11]), while explicit causality forces application developers to declare dependencies.

Causal consistency can be combined with a timing constraint demanding that the global ordering respects causal consistency with tolerance ∆ for each read, yielding a model called timed causal consistency [TM05]. This model is weaker than ∆-atomicity: timed causal consistency with∆=0yields causal consistency, while∆-atomicity with∆=0yields linearizability.

Definition 2.11. If both PRAM and writes follow reads are guaranteed,causal consistency is satisfied.

Besides the discussed consistency models, many different deviations have been proposed and implemented in the literature. Viotti and Vukolic [VV16] give a comprehensive survey and formal definitions of consistency models. In particular, they review the overlapping definitions used in different lines of work across the distributed systems, operating sys-tems, and database research community.

We are convinced that consistency trade-offs should be made explicit in data management.

While strong guarantees are a sensible default for application developers, there should be an option to relax consistency in order to shift the trade-off towards non-functional availability and performance requirements. The consistency models linearizability, causal consistency, ∆-atomicity, and session guarantees will be used frequently throughout this thesis, as they allow a fine-grained trade-off between latency and consistency.

database system. Not only the variety of use cases is increasing, but also the require-ments are becoming more heterogeneous: horizontal scalability, schema flexibility, and high availability are primary concerns for modern applications. While RDBMSs cover many of the functional requirements (e.g., ACID transactions and expressive queries), they cannot cover scalability, performance, and fault tolerance in the same way that spe-cialized data stores can. The explosive growth of available systems through the Big Data and NoSQL movement sparked the idea of employing particularly well-suited database systems for subproblems of the overall application.

The architectural style polyglot persistence describes the usage of specialized data stores for different requirements. The term was popularized by Fowler in 2011 and builds on the idea of polyglot programming [SF12]. The core idea is that abandoning a “one size fits all” architecture can increase development productivity, resp. time-to-market, as well as performance. Polyglot persistence applies to single applications as well as complete organizations.

Wide-Column Store Clickstream Data Scalability, Batch-Analytics

Search Engine Full-Text Index Text Search, Facetting

Graph Store Social Graph Graph Queries RDBMS

Financial Data Consistency, ACID Transactions

Key-Value Store Session Data Write-Availability, Fast Reads

Document Store Product Catalog Scalability, Query Response Time

Application

Figure 2.10: Example of a polyglot persistence architecture with database systems for dif-ferent requirements and types of data in an e-commerce scenario.

Figure 2.10 shows an example of a polyglot persistence architecture for an e-commerce application, as often found in real-world applications [Kle17]. Data is distributed to differ-ent database systems according to their associated requiremdiffer-ents. For example, financial transactions are processed through a relational database, to guarantee correctness. As product descriptions form a semi-structured aggregate, they are well-suited for storage in a distributed document store that can guarantee scalability of data volume and reads.

The log-structured storage management in wide-column stores is optimal for maintaining high write throughput for application-generated event streams. Additionally, they pro-vide interfaces to apply complex data analysis through Big Data platforms such as Hadoop

and Spark [Whi15, ZCF⁺10]. The example illustrates that in polyglot persistence archi-tectures, there is an inherent trade-off between increased complexity of maintenance and development against improved, problem-tailored storage of application data.

In summary, polyglot persistence adopts the idea of applying the best persistence tech-nology for a given problem. In the following, we will present an overview of different strategies for implementing polyglot persistence and the challenges they entail.

Polyglot Data Management

The requirement for a fast time-to-market is supported by avoiding the impedance mis-match [Mai90, Amb12] between the application’s data structures and the persistent data model. For example, if a web application using a JSON-based REST API can store native JSON documents in a document store, the development process is considerably simplified compared to systems where the application’s data model has to be mapped to a database system’s data model.

Performancecan be maximized, if the persistence requirements allow for an efficient parti-tioning and replication of data combined with suitable index structures and storage man-agement. If the application can tolerate relaxed guarantees for consistency or transac-tional isolation, database systems can leverage this to optimize throughput and latency.

Typicalfunctionalpersistence requirements are:

• ACID transactions with different isolation levels

• Atomic, conditional, or set-oriented updates

• Query types: point lookups, scans, aggregations, selections, projections, joins, sub-queries, Map-Reduce, graph sub-queries, batch analyses, searches, real-time sub-queries, dataflow graphs

• Partial or commutative update operations

• Data structures: graphs, lists, sets, maps, trees, documents, etc.

• Structured, semi-structured, or implicit schemas

• Semantic integrity constraints

Among thenon-functional requirementsare:

• Throughput for reads, writes, and queries

• Read and write latency

• High availability

• Scalability of data volume, reads, writes, and queries

• Consistency guarantees

• Durability

• Elastic scale-out and scale-in

The central challenge in polyglot persistence is determining whether a given database sys-tem satisfies a set of application-provided requirements and access patterns. While some performance metrics can be quantified with benchmarks such as YCSB, TPC, and oth-ers [DFNR14,CST⁺10,CST⁺10,PPR⁺11,BZS13,BKD⁺14,PF00,WFZ⁺11,FMdA⁺13,BT14, Ber15, Ber14], many non-functional requirements such as consistency and scalability are currently not covered through benchmarks or even diverge from the documented behav-ior [WFGR15].

In a polyglot persistence architecture, the boundary of the database form the boundary of transactions, queries, and update operations. Thus, if data is persisted and modified in different databases, this entails consistency challenges. The application therefore has to explicitly control the synchronization of data across systems, e.g., through ETL batch jobs, or has to maintain consistency at the application level, e.g., through commutative data structures. Alternatively, data can be distributed in disjoint partitions which shifts the problem to cross-database queries, a well-studied topic in data integration [Len02].

In contrast to data integration problems, however, there is no autonomy of data sources.

Instead, the application explicitly combines and modifies the databases for polyglot per-sistence [SF12].

Product XYZ Product Information.

- Filter Queries

- Scalable Data Volume

1.300.212 Views

- High Write Throughput

- Top-k-Queries (k Highest Counters) 2

Figure 2.11: Polyglot persistence requirements for a product catalog in an e-commerce application.

Implementation of Polyglot Persistence

To manage the increased complexity introduced by polyglot persistence, different archi-tectures can be applied. We group them into the three architectural patterns application-coordinated polyglot persistence, microservices, and polyglot database services. As an ex-ample, consider the product catalog of the introductory e-commerce example (see Figure 2.11). The product catalog should be able to answer simple filter queries (e.g., searching by keyword) as well as returning the top-k products according to access statistics. The functional requirement therefore is that the access statistics have to support a high write throughput (incrementing on each view) and top-k queries(1). The product catalog has to offer filter queries and scalability of data volume(2). These requirements can, for

ex-ample, be fulfilled with the key-value store Redis and the document store MongoDB. With its sorted set data structure, Redis supports a mapping from counters to primary keys of products. Incrementing and performing top-k queries are efficiently supported with log-arithmic time complexity in memory. MongoDB supports storing product information in nested documents and allows queries on the attributes of these documents. Using hash partitioning, the documents can efficiently be distributed over many nodes in a cluster to achieve scalability.

Application-coordinated Polyglot Persistence

Application Module Module

Microservices

Micro-Service

Micro-Service Micro-Service

API API

Polyglot Database Service

Applications

Database Service

Figure 2.12: Architectural patterns for the implementation of polyglot persistence:

application-coordinated polyglot persistence, microservices, and polyglot database services.

With application-coordinated polyglot persistence (see Figure 2.12), the application server’s data tier programmatically coordinates polyglot persistence. Typically, the map-ping of data to databases follows the application’s modularization. This pattern simplifies development, as each module is specialized for the use of one particular data store. Also, design decisions in data modeling as well as access patterns are encapsulated in a single module (loose coupling). The separation can also be relaxed: for the product catalog, it would not only be possible to model a counter and separate product data. Instead, a product could also be modeled as an entity containing a counter. The dependency be-tween databases has to be considered both at development time and during operation. For example, if the format of the primary key changes, the new key structure has to be imple-mented for both systems in the code and in the database. Object-NoSQL mappers simplify the implementation of application-coordinated polyglot persistence. However, currently, the functional scope of these mappers is very limited [TGPM17, SHKS15, WBGsS13].

A practical example of application-coordinated polyglot persistence is Twitter’s storage of user feeds [Kri13]. For fast read access, the newest tweets for each user are materialized in a Redis cluster. Upon publishing of a new tweet, the social graph is queried from a graph store and distributed among the Redis-based feeds for each relevant user (Write Fanout). As a persistent fallback for Redis, MySQL servers are managed and partitioned by the application tier.

To increase encapsulation of persistence decisions, microservice architectures are use-ful [New15] (see Section 2.1.1). Microservices allow narrowing the choice of a database system to one particular service and thus decouple the development and operations of services [DB13]. Technologically, IaaS/PaaS, containers, and cluster management frame-works provide sophisticated tooling for scaling and operating microservice architectures.

In the example, the product catalog could be split into two microservices using MongoDB and Redis separately. The Redis-based service would provide an API for querying popu-lar products and incrementing counters, whereas the MongoDB-based microservice would have a similar interface for retrieving product information. The user-facing business logic (e.g., the frontend in a two-tier architecture) simply has to invoke both microservices and combine the result.

In order to make polyglot persistence fully transparent for the application,polyglot data-base servicescan abstract from implementation details of underlying systems. The key idea is to hide the allocation of data and queries to databases through a generic cloud ser-vice API. Some NoSQL databases and serser-vices use this approach, for example, to integrate full-text search with structured storage (e.g., in Riak [Ria17] and Cassandra [LM10]), to store metadata consistently (e.g., in HBase [Hba17] and BigTable [CDG⁺08]), or to cache objects (e.g., Facebook’s TAO [BAC⁺13]). However, these approaches use a defined scheme for the allocation and cannot adapt to varying application requirements. Polyglot database services can also apply static rules for polyglot persistence: if the type of the data is known (for example a user object or a file), a rule-based selection of a storage system can be performed [SGR15].

The currently unsolved challenge is to adapt the choice of a database system to the actual requirements and workloads of the application. A declarative way to achieve this is to have the application express requirements as SLAs to let a polyglot database service deter-mine the optimal mapping. In this thesis, we will explore a first approach for automating polyglot persistence.

In the example, the application could declare the throughput requirements of the counter and the scalability requirement for the product catalog. The database service would then autonomously derive a suitable mapping for queries and data. The core challenge is to base the selection of systems on quantifiable metrics of available databases and applying transparent rewriting of operations. A weaker form than fully-automated polyglot per-sistence are database services withsemi-automatic polyglot persistence. In this model, the application can explicitly define which underlying system should be targeted, while

reusing high-level features such as schema modeling, transactions, and business logic across systems through a unified API. Orestes supports this strategy and provides stan-dard interfaces for different database systems to integrate them into a polyglot database middleware exposed through a unified set of APIs.

Im Dokument Low Latency for Cloud Data Management (Seite 55-61)