RELATED WORK 137 matically, making it even harder for users to understand the semantics of the system

The Dual Streaming Model

5.4. RELATED WORK 137 matically, making it even harder for users to understand the semantics of the system

because those transformation rules are quite complex. Transforming streams into tables for stateless transformations also has the disadvantage that records might be lost if they have the same timestamp in a non-deterministic manner (c. f. [JMS⁺08]).

Another difference is the types of supported streams: CQL usedIStream for insert streams, DStream for delete stream, and RStream that is the relation stream. In our model, we support record streams (similar asIStream) and changelog streams (a combination of IStream and Dstream). It is important to note that a RStream can be used to compute a corresponding changelog stream and thus both models are equally expressive with regard to their types of data streams.

While the CQL model was the first to define strict operator semantics, the model seems to be quite attached to the relational model. In contrast to CQL, we propose to embrace data streams as first class citizens in combination with relational tables.

The goal is to simplify operator semantics and allow users to reason about the system easily.

Law et al. [LWZ04] introduce a similar model to ours and define operators that continuously update the output with the notion of “the result so far”. They define correctness of operators based on input prefix and have a formal notion of blocking and non-blocking operators. The difference to our work is that they only model record streams and their model is limited to monotonic queries for this reason.

They also do not consider out-of-order records or windowing operators.

The SECRET model [DTM⁺13] aims to describe different window semantics with a uniform model. SECRET focuses on centralized stream processing systems, cannot express out-of-order data, and does not cover other stream processing oper-ators [ATM⁺17]. Our model is more generic than SECRET, which focuses solely on windowing semantics.

Order and Time Order and time present multiple challenges in data stream processing. For example, how to handle out-of-order data and what time semantics should be used? The notion of event-time vs. processing-time was first introduced by Srivastava and Widom [SW04]. They noted that processing-time or ingestion-time guarantees that there is no out-of-order data. Using event-time raises challenges for unsynchronized clocks of external data sources resulting in skewed time, delays, and out-of-order data. Buffering and reordering is one suggested technique to re-order data. For avoiding delay, heartbeats are introduced, which also help to detect unsynchronized clocks.

Time semantics are discussed by Barga et al. [BGAH07]. They introduce strong time semantics similar to temporal database systems, including definitions for differ-ent levels of consistency. Out-of-order records as well as blocking operators are con-sidered, too. A temporal-relational algebra is also used by StreamScope [LFQ⁺16]

based on time intervals that are assigned to records instead of scalar timestamps.

Another approach to handle out-of-order data arepunctuations[TMSF03]. Punc-tuations are control messages that provide certain guarantees about the data stream.

For example, they can express that no record after the punctuation will have a times-tamp smaller than a certain value. Thus, punctuations are similar to heartbeats but more generic as they can express arbitrary constraints on the data, while heartbeats only express time progress.

138 CHAPTER 5. THE DUAL STREAMING MODEL The discussed techniques have in common that they imply a partially blocking computation until a heartbeat or punctuation arrives, or a buffer to reorder records is filled up. Thus, those techniques result in delays and increased processing latency.

A different approach is to process all data immediately to keep processing latency as small as possible and refine results later if required. We follow the update approach and use punctuations to trade-off space (reduced number of downstream updates) vs. time (increase latency). Thus, updates are a more generic approach compared to punctuations and partial blocking.

Finally, Bogeli et al. [BAH⁺19] propose a change to SQL that allows to incor-porate data streams and temporal tables and to express unified queries over both.

Their temporal table is very similar to our evolving table, however, their processing model is not based on continuous updates as ours, but on watermarks and triggers.

5.5 Summary

We introduced the Dual Streaming Model in the second part of this thesis. It puts forward the duality between data streams and temporal-relational tables, and treats state as first class citizen instead of an internal implementation detail. To this end, the Dual Streaming Model decouples result correctness/completeness from process-ing latency. This decouplprocess-ing opens up the design space for data stream processprocess-ing applications and allow users to trade-off result correctness/completeness vs. process-ing latency vs. processprocess-ing cost. Furthermore, our model enables users to retrieve the result of their program either push based, by subscribing to the result stream, or pull based by querying the result table.

The Dual Streaming Model is the foundation of Kafka Streams [ASFg], the stream processing library of Apache Kafka [ASFc, KNR11]. Kafka Streams is widely adopted in the industry, including large enterprises, which shows that our Dual Streaming Model is useful in practice.

139

Part IV

Discussion

141

Chapter 6

Conclusion

The requirement for low-latency data processing of high-volume data streams had increased over the last few years. Yet, state-of-the-art distributed stream processing systems are still hard to operate in practice. Furthermore, there is no agreement on a unified processing model, and different systems offer different semantics and trade-offs to the user.

In this thesis, we introduced a cost-model for data-parallel distributed stream processing systems (Chapter 3). Our cost-model is built on operator parallelism and record batching. To execute a streaming data flow program efficiently, record batch-ing may be employed to trade-off processbatch-ing latency vs. throughput. Furthermore, for high-volume data streams, data-parallelism is used to allow a system to process all data without “falling behind”. Our model considers CPU and network cost to estimate the required degree of parallelism for each operator given a target input data rate. Based on our cost-model, we presented multiple algorithms to detect pro-cessing bottlenecks, predict the data flow throughput, and to optimize batch sizes as well as operator parallelism (Chapter 4). To this end, we believe that our cost-model and analytical optimization approach should be combined with dynamic scaling to adapt to the changing characteristics of input data streams. Furthermore, extend-ing our model to incorporate processextend-ing latency and to extend dynamic batchextend-ing approaches with such a model is interesting future work.

In the second part of this thesis (Chapter 5), we proposed the Dual Streaming Model that unifies concepts of existing models. We put forward the duality of data streams and temporal-relational tables, and treat state as a first class citizen. The model makes explicit to the user, the trade-off between processing cost vs. processing latency vs. result correctness/completeness. Hence, it opens the design space for stream processing applications and allows users to pick a trade-off depending on their application requirements. We believe that the Dual Streaming Model is a step forward to generic processing semantics that allow to express a wide variety of stream processing application within a single model.

143

Bibliography

[AA91] N. Wilschut A. and Peter M. G. Apers. Dataflow Query Execution in a Parallel Main-Memory Environment, pages 68–77. IEEE Computer Society, United States, 12 1991. Imported from EWI/DB PMS [db-utwente:arti:0000002032].

[AAB⁺05] Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Uğur Çet-intemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stanley B. Zdonik. The design of the Borealis stream pro-cessing engine. In CIDR 2005, Second Biennial Conference on In-novative Data Systems Research, Asilomar, CA, USA, January 4-7, 2005, Online Proceedings, pages 277–289, 2005.

[AAB⁺06] Lisa Amini, Henrique Andrade, Ranjita Bhagwan, Frank Eskesen, Richard King, Philippe Selo, Yoonho Park, and Chitra Venkatramani.

SPC: A distributed, scalable platform for data mining. In Proceed-ings of the 4th International Workshop on Data Mining Standards, Services and Platforms, DMSSP ’06, pages 27–37, New York, NY, USA, 2006. ACM.

[ABB⁺03a] Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Keith Ito, Rajeev Motwani, Itaru Nishizawa, Utkarsh Srivastava, Dilys Thomas, Rohit Varma, and Jennifer Widom. STREAM: The Stanford stream data manager. IEEE Data Engineering Bulletin, 26(1):19–26, 2003.

[ABB⁺03b] Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Keith Ito, Itaru Nishizawa, Justin Rosenstein, and Jennifer Widom.

STREAM: The Stanford stream data manager (demonstration de-scription). In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03, pages 665–665, New York, NY, USA, 2003. ACM.

[ABB⁺13] Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. MillWheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 6(11):1033–

1044, August 2013.

145

146 BIBLIOGRAPHY [ABC⁺05] Yanif Ahmad, Bradley Berg, Uˇgur Cetintemel, Mark Humphrey, Jeong-Hyon Hwang, Anjali Jhingran, Anurag Maskey, Olga Papaem-manouil, Alexander Rasin, Nesime Tatbul, Wenjuan Xing, Ying Xing, and Stan Zdonik. Distributed operation in the Borealis stream pro-cessing engine. In Proceedings of the 2005 ACM SIGMOD Inter-national Conference on Management of Data, SIGMOD ’05, pages 882–884, New York, NY, USA, 2005. ACM.

[ABC⁺15] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Pro-ceeding of the VLDB Endowment, 8(12):1792–1803, August 2015.

[ABQ13] Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni. Adap-tive online scheduling in Storm. InProceedings of the 7th ACM Inter-national Conference on Distributed Event-based Systems, DEBS ’13, pages 207–218, New York, NY, USA, 2013. ACM.

[ABW03] Arvind Arasu, Shivnath Babu, and Jennifer Widom. CQL: A lan-guage for continuous queries over streams and relations. In Database Programming Languages, 9th International Workshop, DBPL 2003, Potsdam, Germany, September 6-8, 2003, Revised Papers, pages 1–

19, 2003.

[ABW06] Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL con-tinuous query language: Semantic foundations and query execution.

The VLDB Journal, 15(2):121–142, June 2006.

[ACc⁺03a] D. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, C. Er-win, E. Galvez, M. Hatoun, A. Maskey, A. Rasin, A. Singer, M. Stone-braker, N. Tatbul, Y. Xing, R. Yan, and S. Zdonik. Aurora: A data stream management system. In Proceedings of the 2003 ACM SIG-MOD International Conference on Management of Data, SIGSIG-MOD

’03, pages 666–666, New York, NY, USA, 2003. ACM.

[ACc⁺03b] Daniel J. Abadi, Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tat-bul, and Stan Zdonik. Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2):120–139, Au-gust 2003.

[ACG⁺04] Arvind Arasu, Mitch Cherniack, Eduardo Galvez, David Maier, Anurag S. Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. Linear road: A stream data management benchmark. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB ’04, pages 480–491. VLDB Endow-ment, 2004.

BIBLIOGRAPHY 147 [ADT⁺18] Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shix-iong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia.

Structured streaming: A declarative API for real-time applications in Apache Spark. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, pages 601–613, New York, NY, USA, 2018. ACM.

[AH00] Ron Avnur and Joseph M. Hellerstein. Eddies: Continuously adaptive query processing. SIGMOD Records, 29(2):261–272, May 2000.

[AJS⁺06] Lisa Amini, Navendu Jain, Anshul Sehgal, Jeremy Silber, and Olivier Verscheure. Adaptive control of extreme-scale stream processing sys-tems. In26th IEEE International Conference on Distributed Comput-ing Systems (ICDCS 2006), 4-7 July 2006, Lisboa, Portugal, page 71, 2006.

[AN04] Ahmed M. Ayad and Jeffrey F. Naughton. Static optimization of conjunctive queries with sliding windows over infinite streams. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pages 419–430, New York, NY, USA, 2004. ACM.

[ASFa] The Apache Software Foundation. Apache Flink project web page.

https://flink.apache.org/.

[ASFb] The Apache Software Foundation. Apache Heron project web page.

https://apache.github.io/incubator-heron/.

[ASFc] The Apache Software Foundation. Apache Kafka project web page.

https://kafka.apache.org/.

[ASFd] The Apache Software Foundation. Apache S4 project web page.

https://incubator.apache.org/projects/s4.html.

[ASFe] The Apache Software Foundation. Apache Samze project web page.

https://samza.apache.org/.

[ASFf] The Apache Software Foundation. Apache Storm project web page.

https://storm.apache.org/.

[ASFg] The Apache Software Foundation. Kafka Streams documenation.

https://kafka.apache.org/documentation/streams/.

[ATM⁺17] Lorenzo Affetti, Riccardo Tommasini, Alessandro Margara, Gian-paolo Cugola, and Emanuele Della Valle. Defining the execution se-mantics of stream processing engines. Journal of Big Data, 4(1):12, Apr 2017.

[AW04] Arvind Arasu and Jennifer Widom. A denotational semantics for continuous queries over streams and relations. SIGMOD Records, 33(3):6–11, September 2004.

148 BIBLIOGRAPHY [BAH⁺19] Edmon Begoli, Tyler Akidau, Fabian Hueske, Julian Hyde, Kathryn Knight, and Kenneth Knowles. One SQL to rule them all - an effi-cient and syntactically idiomatic approach to management of streams and tables. In Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska, editors, Proceedings of the 2019 International Conference on Management of Data, SIGMOD Con-ference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 1757–1772. ACM, 2019.

[BBC⁺04] Hari Balakrishnan, Magdalena Balazinska, Don Carney, Uğur Çet-intemel, Mitch Cherniack, Christian Convey, Eddie Galvez, Jon Salz, Michael Stonebraker, Nesime Tatbul, Richard Tibbetts, and Stan Zdonik. Retrospective on Aurora. The VLDB Journal, 13(4):370–

383, December 2004.

[BBD⁺02] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Pro-ceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Sym-posium on Principles of Database Systems, PODS ’02, pages 1–16, New York, NY, USA, 2002. ACM.

[BBD⁺04] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Dilys Thomas. Operator scheduling in data stream systems. The VLDB Journal, 13(4):333–353, December 2004.

[BBMD03] Brian Babcock, Shivnath Babu, Rajeev Motwani, and Mayur Datar.

Chain: Operator scheduling for memory minimization in data stream systems. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03, pages 253–264, New York, NY, USA, 2003. ACM.

[BBMS05] Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, and Michael Stonebraker. Fault-tolerance in the Borealis distributed stream processing system. In Proceedings of the 2005 ACM SIG-MOD International Conference on Management of Data, SIGSIG-MOD

’05, pages 13–24, New York, NY, USA, 2005. ACM.

[BBMS08] Magdalena Balazinska, Hari Balakrishnan, Samuel R. Madden, and Michael Stonebraker. Fault-tolerance in the Borealis distributed stream processing system. ACM Transactions on Database Systems, 33(1):3:1–3:44, March 2008.

[BBS04] Magdalena Balazinska, Hari Balakrishnan, and Michael Stonebraker.

Load management and high availability in the Medusa distributed stream processing system. In Proceedings of the 2004 ACM SIG-MOD International Conference on Management of Data, SIGSIG-MOD

’04, pages 929–930, New York, NY, USA, 2004. ACM.

[BCG⁺11] Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. Hyracks: A flexible and extensible foundation for

data-BIBLIOGRAPHY 149 intensive computing. In Proceedings of the 2011 IEEE 27th Interna-tional Conference on Data Engineering, ICDE ’11, pages 1151–1162, Washington, DC, USA, 2011. IEEE Computer Society.

[BDD⁺10] Irina Botan, Roozbeh Derakhshan, Nihal Dindar, Laura Haas, Renée J. Miller, and Nesime Tatbul. SECRET: A model for analysis of the execution semantics of stream processing systems. Proceedings of the VLDB Endowment, 3(1-2):232–243, September 2010.

[BEH⁺10] Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Pro-ceedings of the 1st ACM Symposium on Cloud Computing, SoCC ’10, pages 119–130, New York, NY, USA, 2010. ACM.

[BFc12] Nathan Backman, Rodrigo Fonseca, and Uˇgur Çetintemel. Managing parallelism for stream processing in the cloud. In Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing, HotCDP ’12, pages 1:1–1:5, New York, NY, USA, 2012. ACM.

[BGAH07] Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, and Ming-sheng Hong. Consistent streaming through time: A vision for event stream processing. In CIDR 2007, Third Biennial Conference on In-novative Data Systems Research, Asilomar, CA, USA, January 7-10, 2007, Online Proceedings, pages 363–374, 2007.

[BH02] Richard J. Bolton and David J. Hand. Statistical fraud detection: A review. Statistical Science, 17(3):235–249, 2002.

[BHL⁺10] Dominic Battré, Matthias Hovestadt, Björn Lohrmann, Alexander Stanik, and Daniel Warneke. Detecting bottlenecks in parallel dag-based data flow programs. In 3rd Workshop on Many-Task Comput-ing on Grids and Supercomputers, MTAGS@SC 2010, New Orleans, Louisiana, USA, November 15, 2010, pages 1–10. IEEE Computer Society, 2010.

[BLT86] Jose A. Blakeley, Per-Ake Larson, and Frank Wm Tompa. Efficiently updating materialized views. SIGMOD Records, 15(2):61–71, June 1986.

[BROL14] Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin. Summing-bird: A framework for integrating batch and online MapReduce com-putations. Proceedings of the VLDB Endowment, 7(13):1441–1451, August 2014.

[BW01] Shivnath Babu and Jennifer Widom. Continuous queries over data streams. SIGMOD Records, 30(3):109–120, September 2001.

[BW04] Shivnath Babu and Jennifer Widom. StreaMon: An adaptive engine for stream query processing. In Proceedings of the 2004 ACM SIG-MOD International Conference on Management of Data, SIGSIG-MOD

’04, pages 931–932, New York, NY, USA, 2004. ACM.

150 BIBLIOGRAPHY [CBB⁺03] Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Uğur Çetintemel, Ying Xing, and Stanley B. Zdonik. Scalable distributed stream processing. In CIDR 2003, First Biennial Con-ference on Innovative Data Systems Research, Asilomar, CA, USA, January 5-8, 2003, Online Proceedings, 2003.

[CCA⁺10] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. MapReduce online. In Pro-ceedings of the 7th USENIX Conference on Networked Systems De-sign and Implementation, NSDI’10, pages 21–21, Berkeley, CA, USA, 2010. USENIX Association.

[CcC⁺02] Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Monitoring streams: A new class of data manage-ment applications. In Proceedings of the 28th International Confer-ence on Very Large Data Bases, VLDB ’02, pages 215–226. VLDB Endowment, 2002.

[CCD⁺03a] Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J.

Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel Madden, Vijayshankar Raman, Frederick Reiss, and Mehul A.

Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. InCIDR 2003, First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 5-8, 2003, Online Proceedings, 2003.

[CCD⁺03b] Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J.

Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel R. Madden, Fred Reiss, and Mehul A. Shah. TelegraphCQ:

Continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIG-MOD ’03, pages 668–668, New York, NY, USA, 2003. ACM.

[CcR⁺03] Don Carney, Uğur Çetintemel, Alex Rasin, Stan Zdonik, Mitch Cher-niack, and Mike Stonebraker. Operator scheduling in a data stream manager. InProceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’03, pages 838–849. VLDB En-dowment, 2003.

[CDE⁺16] Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and Paul Poulosky. Benchmarking streaming computation engines: Storm, Flink and Spark streaming. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016, pages 1789–1792. IEEE Computer Society, 2016.

[CDTW00] Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. Nia-garaCQ: A scalable continuous query system for internet databases.

BIBLIOGRAPHY 151 In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, pages 379–390, New York, NY, USA, 2000. ACM.

[CEF⁺17] Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas Tzoumas. State management in Apache Flink^®: Consistent stateful distributed stream processing. Proceedings of the VLDB Endowment, 10(12):1718–1729, August 2017.

[CER17] CERN. Future ICT challenges in scientific research. http:

//cds.cern.ch/record/2301895/files/Whitepaper_brochure_

ONLINE.pdf, 2017.

[CF02] Sirish Chandrasekaran and Michael J. Franklin. Streaming queries over streaming data. In Proceedings of the 28th International

Im Dokument Performance Optimizations and Operator Semantics for Streaming Data Flow Programs (Seite 151-176)