Thesis Structure - Efficient and Low-Cost Fault Tolerance for Web-Scale Systems

The structure of the following chapters follows the structure of the research questions described earlier:

Chapter 1presents the background of the problems driving this research, introduces the research problems and the contributions of this thesis.

Chapter 2 introduces the terminology used throughout the thesis and surveys the state of the art in fault-tolerant replication, with a particular interest for its application for Web-scale systems.

Chapter 3 describes the Scrooge protocol.

Chapter 4defines the fail-heterogeneous fault model and introduces the HereTrust protocol.

1.3. THESIS STRUCTURE 15 Chapter 5 introduces eventual linearizability, shows inherent tradeoffs in implementing it, and describes the gracefully-degrading Aurora protocol.

Chapter 6 finally concludes the thesis, re-evaluating the value of the conceptual and experimental contributions. A discussion on the applicability of the thesis results to different fields of distributed systems, especially on a Web-application scale, alongside with an outline of the future research directions opened by the novel approach presented by this thesis.

16 CHAPTER 1. INTRODUCTION

Chapter 2 State of the Art and Background

Fault-tolerant replication over a message-passing distributed system is a long-established problem that spurred a large volume of research over the last decades. This chapter reviews some basic concepts of fault-tolerant repli-cation, which are necessary to understand the contributions of this thesis.

It then makes an overview of the two specific topics treated in this thesis:

Byzantine-fault tolerant replication and weakly consistent replication.

18 CHAPTER 2. STATE OF THE ART AND BACKGROUND

2.1 The Consensus Problem and Replication

Consensus is a fundamental problem in distributed computing. It requires a set of processes starting with possibly different initial values to eventually output a single common output value. Consensus is a paradigmatic problem in distributed coordination and has been extensively studied over the last decades.

The problem of fault-tolerant consensus over message-passing distributed systems was first introduced by Lamport, Pease and Shostak in the early eighties [PSL80; LSP82a]. Byzantine-fault tolerant consensus algorithms where initially used to implement clock synchronization in avionic sys-tems [WLG⁺78]. In these real-time dedicated systems, it is safe to assume that the message-passing system issynchronous, that is, there exists a known upper bound on the message communication and processing delay of each pro-cess. The initial work of Lamport, Pease and Shostak also established the lower bound on the number of replicas necessary to tolerate a given number of Byzantine faults using a synchronous message-passing system. The lower bound on the time complexity, expressed in terms of number of communica-tion rounds, followed shortly thereafter [FL81].

Subsequent research examined the problem of consensus in different classes of systems where communication may be asynchronous, or partially synchronous, but only crashes are tolerated. In the crash fault model, pro-cesses follow their specification until they stop taking any step, and messages can not be corrupted. An early, fundamental result was the impossibility of solving consensus inasynchronoussystems, where there is no upper bound on message communication and processing delays [FLP85]. A palette of differ-entpartial synchronymodels representing the minimal synchrony conditions to solve consensus was proposed in [DDS87].

2.1.1 Failure Detectors

Partial synchrony can be expressed by augmenting the asynchronous system model with the abstraction offailure detectors[CT96]. Failure detectors are oracles providing information on which processes have crashed. Each process runs a failure detector module that outputs at any time a set of process indices. Failure detectors are grouped in different classes based on their completeness and accuracy. Completeness refers to the ability of a failure detector to eventually suspect all crashed processes. Accuracy requires that correct processes are not suspected. Partially synchronous systems can be modeled as systems with an eventually accurate failure detector, which can mistakenly suspect all correct processes as faulty for a finite time. Since

2.1. THE CONSENSUS PROBLEM AND REPLICATION 19 suspects of these failure detectors are unreliable, consensus algorithms need to be indulgent and deal with false suspicions [Gue00].

Failure detectors represent a way to express the inherent complexity of solving a distributed computing problem. A good survey on the failure de-tector abstraction is [FGK06]. Much work has dealt with identifying the weakest failure detectors that are necessary to solve distributed computing problems, as for example [CHT96; DGFG⁺04]. In this thesis we consider four classes of failure detectors. The class Ω is the weakest class of failure detector to solve consensus. Failure detectors of class Ω output at most one process id at each process p_i. The process whose id is output is said to betrustedby p_i. All failure detectors of class Ω eventually let a single correct process be permanently trusted by all correct processes [CHT96]. The class of strongly complete failure detectors, which we denote C, includes all failure detectors that output a set ofsuspectedprocesses and that ensurestrong completeness, i.e., eventually every process that crashes is permanently suspected by every correct process [CT96]. The classes of eventually strong (resp. eventually perfect) failure detectors♦S (resp. ♦P) include all strongly complete failure detectors havingeventually weak accuracy(resp.eventually strong accuracy), i.e., eventually some correct process is (resp. all correct processes are) not suspected by any correct process [CT96].

2.1.2 The Paxos Protocol

Paxos is a very simple and efficient algorithm to solve consensus in the crash model using a leader oracle [Lam98; Lam01]. It identifies three roles for processes. Proposers have an initial value and propose it to become the final output value. They send their proposals only when the leader oracle indicates them as leader. Acceptors accept proposals. If enough acceptors has accepted a proposal, this is termed aschosenand can be safely learned as output value by learners. Learners establish that a proposal can be decided as output.

The communication pattern of Paxos in “good runs” where there is only one leader proposer is depicted in Figure 2.1 Before making a proposal, a leader reads from the acceptors to find out if any previous proposed value may have been learned. If such a value is found, the leader adopts it as its own initial value. In this step, acceptors promise to the leader that they will ignore all messages sent by any other previous leader. In order to establish a total order between the leaders, a proposal number is associated to each message sent by the leader. Whenever a process is elected as a leader, it increases its proposal number. Proposal numbers are unique: no two different processes ever use the same proposal number.

20 CHAPTER 2. STATE OF THE ART AND BACKGROUND

Figure 2.1: Communication pattern of the Paxos protocol, described using the terminology of [Lam01]. For simplicity, we depict the leader process as the only learner.

In the second round, the leader sends its proposal to all acceptors. If an acceptor accepts the proposal (because it has not previously promised to ignore it) it sends an acknowledgement to the leader. If enough acceptors have accepted a proposed value, learners can decide to output it.

Paxos requires 2t+ 1 processes to tolerate t crashes, which is shown to be minimal in [CT96]. The following is an informal explanation of why this number of replicas is necessary. Consensus requires that if a learner has decided a value, no other learner will decide on a different value. If at mostt processes can fail by crashing, a process can wait for at most n−t messages at each round. This is a consequence of the unreliability of failure detection, which makes it impossible to determine with certainty whether the sender of the t missing messages are faulty or simply slow. A learner must thus be able to learn a value after receiving n −t acknowledgements. If a new leader is elected, it must be able to read the chosen value by contactingn−t acceptors in the read phase. This is key for safety. It is thus easy to see that having at least 2t+ 1 replicas ensures that any two sets of acceptors having cardinalityn−tintersect in at least one acceptor, which then reports to the new leader the chosen value.

2.1.3 State-Machine Replication

Replicating functionalities over multiple physical devices for fault tolerance is a common technique in systems design. It is used at different layers of abstraction, from hardware design to software applications. A fundamental fault-tolerant replication technique is the state-machine approach [Sch90].

State machines model deterministic servers. They atomically execute com-mands issued by clients. This results in a modification of the internal state of the state machine and/or in the production of an output to a client. An execution of a state machine is completely determined by the sequence of

2.2. MODERN BYZANTINE-FAULT TOLERANCE 21

Im Dokument Efficient and Low-Cost Fault Tolerance for Web-Scale Systems (Seite 32-39)