Paper Note: Chain Replication

FAQ

Is chain replication used in practice over other things like Raft or Paxos?

Systems often use both.

A common way of building distributed systems is to use a configuration server (called the master in the paper) for maintaining configuration info (e.g., who is primary) and a replication system for replicating data.

Paxos/Raft are commonly-used to build the configuration server while the replication system often uses primary-backup or chain replication.

The reason to use Raft/Paxos for configuration server is it must handle split-brain syndrome.

The reason to use primary/backup for data replication is that it is simpler than Raft/Paxos and Raft, for example, is not good at for replicating large amounts of data. The replication system can rely on the configuration server to avoid split-brain syndrome.

What are the tradeoffs of Chain Replication vs Raft or Paxos?

Both CR and Raft/Paxos are replicated state machines. They can be used to replicate any service that can fit into a state machine mold (basically, processes a stream of requests one at a time).

CR is likely to be faster than Raft because the CR head does less work than the Raft leader: the CR head sends writes to just one replica, while the Raft leader must send all operations to all followers. CR has a performance advantage for reads as well, since it serves them from the tail (not the head), while the Raft leader must serve all client requests.

However, Raft/Paxos and CR differ significantly in their failure properties:

Fault Tolerant
- If there are N replicas, a CR system can recover if even a single replica survives.
- Raft and Paxos require a majority of the N replicas to be available in order to operate.
Handle Faults
- A CR chain has to pause updates if there’s even a single failure, and must wait for the configuration server to notice and reconfigure the chain.
- Raft/Paxos, in contrast, can continue operating without interruption as long as a majority is available.

This scheme would be bad if we care about latency right?

Yes and No.

Yes, the latency of update operations is proportional to the length of the chain.
No, the latency of read operation is low: only the tail is involved.

Analyze

Key Features

The approach is intended for supporting large-scale storage services that exhibit high throughput and availability without sacrificing strong consistency guarantees.
Chain replication supports high throughput for query and update requests, high availability of data objects, and strong consistency guarantees.

Basic Structures

System View

可以说这篇论文中最重要的一张图就是这个了，这张图至少说明了三个特性：

所有的 query 都只经过 tail。
所有的 reply 都由 tail 产生。
所有的 update 请求都需要从 head 开始处理，经过一条链最终到达 tail。

这个结构也说明了为什么链式复制拥有这高性能这一特性：

相比于平常的 Primary/Backup 的复制机制，链式复制将 master 的负载分担到了 head 和 tail 两台服务器的身上，所以能更快。

Protocol View

链式复制需要建立在 Fail-stop 的基础上：

Each server halts in response to a failure rather than making erroneous state transitions. (很明显的 CP 系统)
A server’s halted state can be detected by the environment.

简单的运行逻辑已经在 System View 中说明了，有一点需要提出来说的是，这个系统是如何保证强一致性的呢？

Strong consistency thus follows because query requests and update requests are all processed serially at a single server (the tail).

Fault Tolerance

作者假设了一个永不失败的 master，拥有以下职责：

Detects failures of servers.
Informs each server in the chain of its new predecessor or new successor in the new chain obtained by deleting the failed server.
Informs clients which server is the head and which is the tail of the chain.

可以把这个 master 想成一个 Paxos/Raft 构成的集群。

master 能够检测三种类型的异常：

头结点异常
尾节点异常
中间节点异常

作者提到了一个更新传播不等式（Update Propagation Invariant）: 对于满足 $i \le j$ 的节点 $i$ 和 $j$ (即 $i$ 在链中是 $j$ 的前继节点)，有

$$ Hist^j_{objID} \preceq Hist^i_{objID} $$

$Hist^i_{objID}$ 代表的是 $i$ 通过的 update 请求序列。这个不等式在原文中给我看迷糊了，其实很简单，就是说 $j$ 的请求序列是 $i$ 的一个前缀。（即每个节点收到的更新序列是前一个节点收到的更新序列的前缀。）

如果该更新传播不等式在系统中成立，那么就能说明 head 处理的更新序列和 tail 处理的更新序列顺序上是一致的，也就能保证整个系统维持了线性一致性。

Failure of the Head

This case is handled by the master removing H from the chain and making the successor to H the new head of the chain.

删除 head 会导致 $Pending_{objID}$ 发生变化（$Pending_{objID}$ 指的是被服务器接收但是还没有被 tail 处理的更新请求序列）。从 $pending_{objID}$ 中移除一个请求和 Figure 1 中的 T2 是一致的，所以删除 head 与 Figure 1 中的说明是一致的。

Failure of the Tail

This case is handled by removing tail T from the chain and making predecessor T− of T the new tail of the chain.

这种操作和 Figure 1 中的 T3 是一致的。

Failure of Other Servers

Failure of a server S internal to the chain is handled by deleting S from the chain. The master first informs S’s successor S+ of the new chain configuration and then informs S’s predecessor S−.

需要注意的是，这种情况有可能会违反之前提到的更新传播不等式，进而破坏系统的线性一致性。

这里就细说一下什么情况下会破坏该一致性：

假设有 H –> S1 –> S2 –> S3 –> T，最重要的是要理解：如果 S2 挂了，S1 是不知道 S2 和 S3 之间的发送情况的，假设 S1 给 S2 发了 5 条请求，而 S2 给 S3 发了 3 条请求，如果没有一个记录来维护发送情况，S1 就会直接给 S3 发送新的第 6 条请求，最终导致：

S1 :[1,2,3,4,5,6]
S2 :[1,2,3,6]
S2 不再是 S1 的一个前缀。

为了解决这个问题，作者提出了在每个节点再维护一个序列 $Sent_i$，$Send_i$ 中存储的是那些 $i$ 向后继节点转发了但是还没有被 tail 处理的更新请求序列，这个序列的变化规则就很明显了：

每次 $i$ 向后继节点转发一个请求，就在 $Sent_i$ 中添加一个该请求。
tail 每处理一个请求，就会往前继节点发送一个 $ack(r)$，前继节点收到 $ack(r)$ 后，就在 $Sent_i$ 中删除一个该请求。

举个对应前面例子：

IMG_6DF4ECB59405-1

所以可以总结一个规律：对于满足 $i \le j$ 的节点 $i$ 和 $j$，有 $i$ 的更新序列等于 $j$ 的更新序列和 $Sent_i$ 的补集，在论文中体现为 Inprocess Request Invariant。

这个不等式就为如何维护更新传播不等式提供了指导性意见：

S- 收到了 S+ 是它的新后继节点这一信息后，首先把 $Sent_{S^-}$ 中的请求序列转发给 S+。

所以整个更新流程可以如下图（具体的其他细节在这里就省略了）：

Message 1 informs S+ of its new role;
Message 2 acknowledges and informs the master what is the sequence number $sn$ of the last update request S+ has received;
Message 3 informs S− of its new role and of $sn$ so S− can compute the suffix of $Sent_{S^-}$ to send to S+;
Message 4 carries that suffix.

Extend a Chain

新增的节点 $T^+$ 增加在链尾，在同步完成前先不提供查询服务，将 $Hist^T_{objID}$ 全量复制到 $Hist^{T^+}_{objID}$,由于复制的时间可能很长，原本的尾部 $T$ 因此还并发地承担之前的责任以确保可用性，但是将这段时间的更新同时增加到 $Sent_T$，直到全量复制完成。

之后，$T$ 将之后的查询请求丢弃或者转发给新的链尾 $T^+$ ，并把已有的 $Sent_T$ 的请求发送给 $T^+$。然后 master 通知 client 新的链尾，并让它直接访问 $T^+$。

Primary/Backup Protocols

In the primary/backup approach, one server, designated the primary

imposes a sequencing on client requests (and thereby ensures strong consistency holds).
distributes (in sequence) to other servers, known as backups, the client requests or resulting updates.
awaits acknowledgements from all non-faulty backups.
after receiving those acknowledgements then sends a reply to the client.

With chain replication, the primary’s role in sequencing requests is shared by two replicas.

The head sequences update requests.
The tail extends that sequence by interleaving query requests.

链式复制的一个缺点就是延时会更高，并且随着链长度的增加，延时也会成比例的线性增加。

Chain replication does this dissemination serially, resulting in higher latency than the primary/backup approach where requests were distributed to backups in parallel.

但是在错误处理上，链式复制所需要的时间会更短，论文中有详细的分析，这里就不多赘述了。

Summary

链式复制是一个相对比较简单的协议，支持高吞吐，高可用以及强一致性，是一个 CP 系统，而链式复制做出的取舍就是增加了时延。

CR 能够以很简单的设计实现多副本的线性一致性，不过其不能自己处理脑裂和分区的问题，因而还需要另一个高可用的配置服务器集群来协作提供高可用服务。

FAQ#

Analyze#

Key Features#

Basic Structures#

System View#

Protocol View#

Fault Tolerance#

Failure of the Head#

Failure of the Tail#

Failure of Other Servers#

Extend a Chain#

Primary/Backup Protocols#

Summary#

FAQ