• Keine Ergebnisse gefunden

Evaluation of the Cache Sketch for Object Caching

Im Dokument Low Latency for Cloud Data Management (Seite 157-164)

4.3.2 Parameter Optimization for the CATE TTL Estimator

As discussed above, adaptive TTL estimation depends on theslopeof the ratio function as well asT T Lmax. Naturally, if the maximum TTL is much longer than the actual simulation, results will show a high cache hit rate and a lot of invalidations compared to a relatively short maximum TTL. Hence, the maximum TTL is set to the duration of the simulation and all other TTLs are estimated in relation to this maximum. In order to optimize these parameters, we use a variation ofmaximum descent hill climbing. Initial slopes of the ratio function are drawn uniformly at random in the[0,1]range.

Theoptimization algorithmtests whether increasing or decreasing the slope provides an improvement of the simplified cost function(w·#cachemisses+ (1−w)·#invalidations)/#ops that is to be minimized. The score is calculated as the sum of cache misses and invalida-tions normalized by the total number of operainvalida-tions. It thus is a simplified version of the score introduced in the model. This simplification is reasonable, because stale reads show a lot of variance (i.e., depending on indeterministic thread-scheduling) while having min-imal impacts on the score, thus presenting more of a noise to the optimization. Since the number of invalidations is an approximate indicator of stale reads as well as a measure of the Bloom filter population, we have found the simplified cost function to be a well-working simplification of the original cost function. Depending on the cost of cache misses compared to invalidations (and the subsumed false positive and stale read rates), terms are weighted withw∈[0,1].

Testing directions ofT T Lmax and slopeconstitutes a sustep, which concludes by per-sisting the direction of maximum change (towards a lower cost) for the next super-step to start with. The algorithm terminates after a given number of super-steps or when it cannot improve the cost function.

Optimizations were performed for YCSB workloads A [CST+10, Section 4] (write-heavy;

read/write balance 50%/50%) and B (read-heavy; read/write balance 95%/5%) with a Zipfian popularity distribution. Each simulation step was run on 100 000 operations for 10 threads, with a targeted throughput of 200 ops/s and a time scaling factor of 50, on the default amount of 1 000 objects. We ran the hill climbing algorithm from 25 starting points. Figure 4.9a and 4.9c show the resulting costs as a function ofwfor the optimized parameters of CATE, with a linear ratio function compared to the static TTL estimation with a highT T Lmax. The results demonstrate that CATE performs significantly better than static estimation for applications that do prefer high cache hit rates (workload A). Unsur-prisingly, read-heavy workloads leading to many cache hits perform slightly better with a static (maximum) TTL estimation (unless cache misses are weighted very low).

As page load time is arguably the most important web performance metric, we analyzed the gains ofcached initializationfor different browser cache and CDN cache hit rates and two Cache Sketch false positive rates (5% and 30%). The analysis assumes an average web page with 90 resources using 6 connections [GBR14] and that the Cache Sketch is used for every resource. The results shown in Figure 4.9b are as drastic as expected: for

0.1 0.2 0.3 0.4 0.5w

0.1 0.2 0.3 0.4 0.5 cost

Static CATE(linear) (a) Costs for Workload A.

0/0 0/20 20/0 20/20 40/40 66/20 80/80hit ratios 500

1000 1500 2000 2500

load time

p=5% p=30%

(b) Page load time improvement through the cached initialization model. Labels on the x-axis indicate the percentual cache hit rate in the browser (first number) and CDN (second number).

0.1 0.2 0.3 0.4 0.5w

0.01 0.02 0.03 0.04 0.05 cost

Static CATE(linear) (c) Costs for Workload B.

100k 200k 300k 400k 500koperations 0.2

0.4 0.6 0.8 1.0 p

Static CATE(linear)

(d) False positive rate for a n=1k, p=0.05 Cache Sketch under Zipf-distributed opera-tions.

Figure 4.9: YMCA simulation results.

instance, for a cache hit rate of 66% in the browser and 20% in the CDN as described for the Facebook photos workload [HBvR+13], the speedup is over 320% for p=0.05.

The development of the Cache Sketch false positive rate is shown in Figure 4.9d for 100 000 objects, workload B, a slope optimized for 100 000 operations, and the Bloom filter configured to contain 1 000 elements with p=0.05. As expected, CATE achieves lower false positive rates by decreasing TTLs, when p grows too large. Even though the Cache Sketch is provisioned to only hold 1% of all objects, the static estimator performs surprisingly well, as long as the number of operations is smaller than the number of total objects.

4.3.3 YCSB Results for CDN-Cached Database Workloads

To validate the results in an experimental setup, we conducted theYCSB benchmarkfor the described setup on Amazon EC2, usingc3.8xlargeinstances for the client (northern California region) and server (Ireland), while caching in the Fastly CDN [Spa17]. We employed the document store MongoDB as a baseline for classic database communication.

It was compared to an Orestes server running with MongoDB to add the Cache Sketch and the REST API. The benchmark was performed with the same configuration as the

■ ■

◆◆

▲ ▲

200 400 600 800 1000threads 2000

4000 6000 8000 10 000 12 000

operations/s

Orestes(B) MongoDB(B)

Orestes(A) MongoDB(A) (a) Throughput for a single client.

● ●

■ ■

200 400 600 800 1000threads 0.6

0.7 0.8 0.9 cache hit ratio

Workload B Workload A (b) Cache hit ratios.

■ ■

◆◆ ◆

▲ ▲ ▲

200 400 600 800 1000threads 50

100 150 ms

Orestes(B) MongoDB(B)

Orestes(A) MongoDB(A) (c) Latencies of read operations.

● ●

200 400 600 800 1000threads 0.001

0.002 0.003 0.004 0.005 0.006 0.007 stale read ratio

Workload B Workload A

(d) Stale read ratios.

Figure 4.10: Performance and consistency metrics for YCSB with CDN-caching for two different workloads (A and B).

simulation, but using the static TTL estimator. Figure 4.10 shows latency, throughput, cache hit ratios, and stale reads for 32 to 1 024 threads (i.e., YCSB clients).

The results reveal the expected behavior: latency and throughput are improved consid-erably in both workloads, although a slight non-linearity between 512 and 1 024 threads occurs because of thread scheduling overhead of the limited single-machine design of YCSB. MongoDB achieves the same latency and throughput in both workloads, since all operations are bounded by network latency. The very few stale reads show considerable variance and were largely independent from the number of threads, as seen in Figure 4.10d. This fact supports our argument that(∆c,p)-atomicity is an appropriate consistency measure and that CDNs are well-suited to answer Cache Sketch-triggered revalidations.

4.3.4 Industry Backend-as-a-Service Evaluation

For a practical performance comparison of BaaS systems, we developed a simple bench-mark to test the end-to-end latency of a serverless web application. Here we provide a short summary of the performance of Baqend, the commercial service offering of Orestes.

The comparison between our approach and several popular commercial BaaS providers

0 2000 4000 6000 8000 10000 12000 Frankfurt

Tokyo

Sydney

California

Page Load Time (in ms)

Baqend Kinvey Firebase Azure Apiomat Parse (a) Performance with a cold browser cache and a warm CDN (first load).

0 1000 2000 3000 4000 5000 6000 7000

Frankfurt

Tokyo

Sydney

California

Page Load Time (in ms)

Baqend Kinvey Firebase Azure Apiomat Parse

(b) Performance with a warm browser cache and a warm CDN (second load).

Figure 4.11: Page load time comparison for different industry Backend-as-a-Service providers for the same data-driven web application.

(Kinvey, Firebase, Azure Mobile Services, Apiomat, Parse) is open-source and can be vali-dated in a web browser10.

The benchmark uses the example of a simple news website loaded from different geo-graphical locations (Frankfurt, Tokyo, Sydney, California) with a cold browser cache and a warm CDN cache. Both the web application files (HTML, CSS, and JavaScript) and the actual database objects are fetched from the respective BaaS. The data model is structured by news stream objects that reference individual news. Each news item contains text fields like the title and teaser, as well as the reference to an image object. The data model is adapted to each of the providers abstractions, but conceptually identical. Data is rendered

10The benchmark can be run athttp://benchmark.baqend.com/and the source code of all implementations is published on GitHub [Baq18]

through JavaScript that uses the respective SDKs to fetch the data of 30 news articles with images.

The results are shown in Figure 4.11a. For each location (measured from AWS data centers and a Chrome browser) the page load time (loadevent) is plotted as an average over ten consecutive runs with cold connections and a cold browser cache (first visit). As reads from Baqend rely on invalidation-based CDN caching, the average performance advantage is factor 15.4. As the Cache Sketch becomes effective with expiration-based caching, we also measured the second load. This measurement corresponds to the loading time for a returning visitor with a warm browser cache (see Figure 4.11b). Some of the other providers cache data in the browser, however without the ability to maintain consistency.

Nonetheless, the performance advantage of the Cache Sketch is even greater: on second load, the other providers are outperformed by factor of 21.1 on average. This shows that the predicted performance improvements of our proposed caching approach translate into the setting of practical websites and web applications through file and object caching.

4.3.5 Efficient Bloom Filter Maintenance

The server Cache Sketch requires an efficient underlying Counting Bloom filter. For this purpose, we developed a Bloom filter framework available as an already widely-used open-source project11. It supports normal and Counting Bloom filters as in-memory data struc-tures as well as shared filters backed by the in-memory key-value store Redis [San17].

The library supports the table-based sharding and replication introduced in Section 4.1.6 for high-throughput workloads. The Redis Bloom filter uses the data structures of Redis to maintain an efficient bit vector for the materialized Bloom filter and relies on pipelining and batch transactions to ensure performance and consistency. We chose Redis because of its tunable persistence complemented with very low latency [Car13].

Figure 4.12 shows selected performance characteristics of the Cache Sketch under differ-ent hash functions and operations. The uniformity of implemdiffer-ented hash functions for random Strings is evaluated in Figure 4.12a using the p-values for 100χ2-goodness-of-fit tests. For random inputs (e.g., UUID object keys) all hash functions perform reasonably well – including simple checksums. However, for keys exhibiting structure, the best trade-off between speed of computation and uniformity is reached by Murmur 3.

Figure 4.12b plots the throughput of the unpartitioned, non-replicated Redis Bloom filters for a growing amount of connections withm=100 000andk=5on an a server with 16 GB RAM and a CPU with four cores and 3 GHz. Read operations (querying, pulling the complete filter) achieve roughly 250 000 ops/s, while write operations (adding, removing) that require some overhead for counter maintenance and concurrency still achieve over 50 000 ops/s resp. 100 000 ops/s.

We also report the performance of Cache Sketch operations on the DBaaS variant of Re-dis (AWS ElastiCache) for two different instance sizes (see Table 4.2). Each operation

11Available athttps://github.com/Baqend/Orestes-Bloomfilteralong with more detailed results.

0.0 0.2 0.4 0.6 0.8 1.0 Murmur3CRC32MD2

FNVWithLCGMurmur2RNG CarterWegmanMurmur3KMSHA512SHA384SHA256Adler32SHA1MD5

p-Value

hashes=100000, m=1000, k=10 (a) Quality of Bloom filter hashes for random

words (Box-Whisker plot of p-values).

1 2 4 8 16 32 64conn.

50 000 100 000 150 000 200 000 250 000

operations/s

contains

add

remove

pull

add(BF)

contains(BF)

(b) Throughput of the Redis-backed Counting Bloom filter and Bloom Filter (BF).

Figure 4.12: Analysis of the Redis-backed Bloom filters.

type was performed 10M times and averaged over 5 repetitions. The small instance (cache.m3.large, 2 cores, 6 GB RAM, medium I/O performance) and the large instance (cache.r3.8xlarge, 237 GB RAM, 10 GBit/s I/O) showed less than 30% performance difference. The only exception is the load operation, which fetches the complete Bloom filter. As the operation is bounded by network bandwidth, the larger EC2 instance shows a proportional throughput advantage here. Latencies of all operations in the experiment were consistently below 1 ms.

Cache Sketch add remove pull In-Memory 4.411M 12.787M 1.311M Redis (small) 213K 379K 22.8K Redis (large) 174K 313K 184K

Table 4.2: Throughput of different Cache Sketch implementations in operations/s.

The results provide clear evidence that the Redis-based implementation of the Cache Sketch provides sufficient performance to sustain a throughput of>100 K queries or in-validations per second with a single backing Redis instance. With theper-table partitioning modelintroduced in Section 4.1.6, the Cache Sketch can thus easily support much higher throughput than the Orestes servers and will not become a bottleneck.

In conclusion, the simulations and the experimental evaluation show that the Cache Sketch is able to considerably reduce latency for data management workloads. Through the combination of an efficient Cache Sketch and TTL estimation, the performance bene-fits are achieved with∆-atomic reads. In the following, we will extend the scheme from key-based object caching to arbitrary query results. Then, we will discuss an integrated approach for low-latency access to files, objects, and queries.

Im Dokument Low Latency for Cloud Data Management (Seite 157-164)