raft 线性一致性

作者收录于类别 raft

2021-12-08 2021-12-08 约 3565 字预计阅读 8 分钟

什么是线性一致性

对于同一个对象 x，其初始值为 1，客户端 ABCD 并发地进行了请求，按照真实时间（real-time）顺序，各个事件的发生顺序如上图所示。对于任意一次请求都需要一段时间才能完成，例如 A，“x R() A” 到 “x Ok(1) A” 之间的那条线段就代表那次请求花费的时间。四个次请求中只有 B 进行了写请求，改变了 x 的值，我们从 B 着手分析，明确 x 在各个时刻的值。由于不能确定 B 的 W（写操作）在哪个时刻发生，能确定的只有一个区间，因此可以引入上下限的概念。对于 x=1，它的上下限为开始到事件“x W(2) B”，在这个范围内所有的读操作必定读到 1。对于 x=2，它的上下限为事件“x Ok() B” 到结束，在这个范围内所有的读操作必定读到 2。那么“x W(2) B”到“x Ok() B”这段范围，x 的值是什么？1 或者 2。由此可以将 x 分为三个阶段，各阶段"最新"的值如下图所示：

最后返回的 D 读到了 1，看起来是 “stale read”，其实并不是，它仍满足线性一致性。D 请求横跨了三个阶段，而读可能发生在任意时刻，所以 1 或 2 都行。同理，A 读到的值也可以是 2。C 就不太一样了，C 只有读到了 2 才能满足线性一致。因为 “x R() C” 发生在 “x Ok() B” 之后（happen before [3]），可以推出 R 发生在 W 之后，那么 R 一定得读到 W 完成之后的结果：2。

什么时候回复 client

raft 并不要求必须要等到 client 发来的 cmd 被 apply 到状态机后才回复 client，也就是说也可以在包含这个 cmd 的 raft log 被 commit 后立即回复client（如果 client 期望的 reply 仅仅是一个 true or false 代表这个 cmd 会不会最终被 apply，而不是期望得到一个状态机 apply 这条 cmd 之后产生的输出）。但是在 raft 框架的实现中，往往都是等到 client 发来的 cmd 被 apply 到状态机后才回复 client。例如： https://github.com/baidu/braft/blob/master/example/counter/server.cpp

    void fetch_add(const FetchAddRequest* request,
                   CounterResponse* response,
                   google::protobuf::Closure* done) {
        brpc::ClosureGuard done_guard(done);
        ...
        
        butil::IOBuf log;
        butil::IOBufAsZeroCopyOutputStream wrapper(&log);
        if (!request->SerializeToZeroCopyStream(&wrapper)) {
            ...
            
        }
        // Apply this log as a braft::Task
        braft::Task task;
        task.data = &log;
        // This callback would be iovoked when the task actually excuted or
        // fail
        task.done = new FetchAddClosure(this, request, response,
                                        done_guard.release());
        ...
        
        // Now the task is applied to the group, waiting for the result.
        return _node->apply(task);
    }

在接受到 client 端的 RPC 请求时，google::protobuf::Closure* done run 的时候才会回复 client。这里把这个 done 包在 FetchAddClosure 里，作为 task.done。最后把 task 交给 raft node 去 replicate 并最终 apply 这个 task。

    // @braft::StateMachine
    void on_apply(braft::Iterator& iter) {
        // A batch of tasks are committed, which must be processed through 
        // |iter|
        for (; iter.valid(); iter.next()) {
            int64_t detal_value = 0;
            CounterResponse* response = NULL;
            // This guard helps invoke iter.done()->Run() asynchronously to
            // avoid that callback blocks the StateMachine.
            braft::AsyncClosureGuard closure_guard(iter.done());
            if (iter.done()) {
                // This task is applied by this node, get value from this
                // closure to avoid additional parsing.
                FetchAddClosure* c = dynamic_cast<FetchAddClosure*>(iter.done());
                response = c->response();
                detal_value = c->request()->value();
            } else {
                ...
            }

            // Now the log has been parsed. Update this state machine by this
            // operation.
            const int64_t prev = _value.fetch_add(detal_value, 
                                                  butil::memory_order_relaxed);
            if (response) {
                response->set_success(true);
                response->set_value(prev);
            }

            ...
        }
    }

void FetchAddClosure::Run() {
    // Auto delete this after Run()
    std::unique_ptr<FetchAddClosure> self_guard(this);
    // Repsond this RPC.
    brpc::ClosureGuard done_guard(_done);
    if (status().ok()) {
        return;
    }
    // Try redirect if this request failed.
    _counter->redirect(_response);
}

最终在状态机 apply 这个 client 发来的 cmd 后，才会 FetchAddClosure::Run，最终 RPC done run（在状态机 apply 的同时，设置了 RPC response）。这一切都是要业务自己写代码完成的。

什么时候是 leader

raft 要求只要这一个节点选举成功，那么这个节点的状态就会马上变为 leader。在 raft 框架的实现中，也是按照这样实现的。

bool Node::is_leader() {
    return _impl->is_leader();
}

    bool NodeImpl::is_leader() {
        BAIDU_SCOPED_LOCK(_mutex);
        return _state == STATE_LEADER;
    }

// 选举成功
void NodeImpl::become_leader() {
    CHECK(_state == STATE_CANDIDATE);
    ...
    
    _state = STATE_LEADER;
    ...
}

但是新选出的 leader Node2 和老 leader Node1 之间可能存在差异。有可能在老 leader Node1 任期时，Node1 机器比较快，状态机 apply raft log 的速度比较快，已经 apply 到第 9 条 raft log 了，但是 Node2 机器比较慢，才 apply 到第 4 条 raft log。这时候 Node1 宕机，Node2 成为新 leader。假如有 client 在 Node1 任期的时候，对 leader 进行了读操作，读到了 apply 第 9 条 raft log 后状态机的内容。然而在新 leader 刚刚上任时，在对 leader 进行读操作，却读不到 apply 第 9 条 raft log 后状态机的内容。显然不符合线性一致性。这就要求新当选的 leader 要至少 apply 到老 leader apply 了的最后一条 raft log。新 leader 不知道老 leader 已经 apply 到哪条 raft log 了，但是这条 raft log一定已经被 commit 了，新 leader 甚至还可能不知道这条 raft log 已经被 commit 了（心跳还没来的及带过来）。所以只有上一任期的所有 raft log 被 commit 且 apply 之后才能处理读请求。由于新 leader 不能 commit 不是自己任期的 raft log（见论文 Figure 8），所以新任期刚开始会有一个 no-op 的 raft log。等到这个 no-op 的 raft log 被 apply 了之后，就代表上一任期的所有 raft log 被 commit 且 apply 了，这时才能开始处理读请求。

leader read

分布式环境下，raft leader 一般能保持最新的数据，因此读请求从 leader 发起是必须的，但直接从leader 状态机读并不能完全确保系统的线性一致性。比如发生网络分区，旧 leader 处于一个单独的分区中，这样收不到其他节点更高的 term 的 RequestVote 请求，leadership 不会丢失，如果此时直接从旧leader 的状态机读，则很可能返回stale的结果。假设在另一个分区，新 leader 已被选举出来且提交了新的记录，此时有两个客户端分别从新旧 leader 读取，从新 leader 能读取到新记录，旧 leader 只能读取到旧记录，从整个系统的角度看违背了线性一致性。

log read

实现线性一致读最常规的办法是走 Raft 协议，将读请求同样按照 Log 处理，通过 Log 复制和状态机执行来获取读结果，然后再把读取的结果返回给 Client。因为 Raft 本来就是一个为了实现分布式环境下线性一致性的算法，所以通过 Raft 非常方便的实现线性 Read，也就是将任何的读请求走一次 Raft Log，等此 Log 提交之后在 apply 的时候从状态机里面读取值，一定能够保证这个读取到的值是满足线性要求的。当然，因为每次 Read 都需要走 Raft 流程，Raft Log 存储、复制带来刷盘开销、存储开销、网络开销，走 Raft Log不仅仅有日志落盘的开销，还有日志复制的网络开销，另外还有一堆的 Raft “读日志” 造成的磁盘占用开销，导致 Read 操作性能是非常低效的，所以在读操作很多的场景下对性能影响很大，在读比重很大的系统中是无法被接受的，通常都不会使用。

read index

https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf 6.4章

Fortunately, it is possible to bypass the Raft log for read-only queries and still preserve linearizability. To do so, the leader takes the following steps:

Find current leader in the raft group.
If the leader has not yet marked an entry from its current term committed, it waits until it has done so. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. As soon as this no-op entry is committed, the leader’s commit index will be at least as large as any other servers’ during its term.
The leader saves its current commit index in a local variable readIndex. This will be used as a lower bound for the version of the state that the query operates against.
The leader needs to make sure it hasn’t been superseded by a newer leader of which it is unaware. It issues a new round of heartbeats and waits for their acknowledgments from a majority of the cluster. Once these acknowledgments are received, the leader knows that there could not have existed a leader for a greater term at the moment it sent the heartbeats. Thus, the readIndex was, at the time, the largest commit index ever seen by any server in the cluster.
The leader waits for its state machine to advance at least as far as the readIndex; this is current enough to satisfy linearizability.
Finally, the leader issues the query against its state machine and replies to the client with the results.

lease read

lease 的基本思路是 leader 不断通过心跳向 follower 续组 follower lease，如果我现在持有 follower lease，那我可以保证，在我的 lease 有效期内我一定不会给别人 prevote 或者 vote。当超过半数的 follower 都续组了 follower lease 后，leader 就可以给自己续 leader lease，我现在持有leader lease，那我可以保证，在我的 lease 有效期内，我一定是唯一 leader。

Find current leader in the raft group.
If the leader has not yet marked an entry from its current term committed, it waits until it has done so. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. As soon as this no-op entry is committed, the leader’s commit index will be at least as large as any other servers’ during its term.
The leader saves its current commit index in a local variable readIndex. This will be used as a lower bound for the version of the state that the query operates against.
Check if leader lease is still valid.
The leader waits for its state machine to advance at least as far as the readIndex; this is current enough to satisfy linearizability.
Finally, the leader issues the query against its state machine and replies to the client with the results.

follower read

当客户端对一个 follower 发起读请求的时候，这个 follower 会请求此时 leader 的 read Index（leader 获取 read index 的方法见上面的 1, 2, 3 步骤）。拿到 leader 的 read index 后，等本地 apply 到 read index 后，再进行读操作。

Wait Free

在只有 leader read 时（包括 read index 或 lease read）我觉得如果要求必须要等到 client 发来的 cmd 被 apply 到状态机后才回复 client，不能在包含这个 cmd 的 raft log 被 commit 后立即回复client。那么第 2，4 步可以忽略。这也是 PingCAP 线性一致性读博客中 Wait Free 优化的具体含义。

在既有 leader read 又有 follower read 时，我认为不能采用 Wait Free 优化。我们设想一个 follower apply 比 leader 快的场景，比如 leader 和 follower 的日志本来均为 [1，2]，此时一个客户端执行了一个写请求，leader 将其进行了广播并进行了 commit，然后正在很慢的异步 apply 中，此时 leader 的日志为 [1,2,3]，commitIndex 为 3，applyIndex 为 2，follower 的日志为 [1,2,3]，commitIndex 为 3，applyIndex 为 3。此时另一个并发的客户端发起了一个查询请求，该查询请求路由到了 follower，follower 用了上述 Read Index 的步骤拿到了 leader 的 commitIndex 3 并确定自己的 applyindex >= 3 后对状态机进行了查询然后返回。接着该客户端又发起了一个查询请求，该查询请求路由到了 leader，此时 leader 如果采用 Wait Free 的方式，则只能对 applyindex 为 2 的状态机进行查询，那么就可能返回旧的数据。通过这个例子我们可以看到，如果开启了 follower read，要想保证线性一致性 leader 不能再采用 Wait Free 直接读的方式，必须获取 read index 才能保证线性一致性，这一点也可以参考 PingCAP Follower Read 博客。

参考

https://pingcap.com/zh/blog/linearizability-and-raft https://pingcap.com/zh/blog/lease-read https://pingcap.com/zh/blog/follower-read-the-new-features-of-tidb https://www.sofastack.tech/blog/sofa-jraft-linear-consistent-read-implementation/ http://codefever.github.io/2019/09/17/raft-linearizable-read/ https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf https://github.com/baidu/braft/blob/master/example/counter/server.cpp https://tanxinyu.work/consistency-and-consensus/

目录