In distributed systems, consensus is the process of getting multiple, independent nodes (servers) to agree on a single value or sequence of values, even in the presence of failures (like crashes or network partitions). It's fundamental for building reliable systems where data needs to be consistent across replicas, such as distributed databases, configuration management, and state machine replication.
Why is Consensus Hard?
Nodes can fail (crash).
Network communication can be unreliable (messages delayed, lost, or reordered).
Achieving agreement without a central coordinator requires careful protocol design.
Raft Explained
Raft is a consensus algorithm designed to be understandable and easier to implement than its predecessor, Paxos. Its primary goal is to manage a replicated log – an ordered sequence of commands applied consistently across multiple servers.
Key Concepts
Server States: Each server is in one of three states:
Follower: Passively receives requests from leaders and candidates. Responds to RPCs. Times out and starts an election if it doesn't hear from a leader.
Candidate: State used during an election. Requests votes from other servers. Becomes leader if it gets a majority. Becomes follower if it discovers a current leader or a new term starts.
Leader: Handles all client requests. Replicates log entries to followers. Sends periodic heartbeats to maintain authority.
Terms: Time is divided into terms (epochs) of arbitrary length, identified by monotonically increasing integers. Each term begins with an election. If an election completes, one leader manages the cluster for the rest of the term. If an election fails (split vote), the term ends, and a new election (and new term) begins shortly after. Terms act as a logical clock.
Replicated Log: Each server maintains a log containing a sequence of commands received from clients. Raft ensures logs across servers become consistent.
Remote Procedure Calls (RPCs): Servers communicate using RPCs:
RequestVote RPC: Used by candidates during elections to request votes.
AppendEntries RPC: Used by leaders to replicate log entries and as a heartbeat (empty AppendEntries) to prevent follower timeouts.
Majority (Quorum): Raft relies on receiving responses from a majority of servers (> N/2 where N is cluster size) to make decisions like electing a leader or committing log entries.
Leader Election in Raft
Timeout: Followers wait for communication (AppendEntries RPCs) from a leader. If a follower's election timeout elapses without hearing from a leader (or granting a vote), it assumes the leader has failed.
Become Candidate: The follower increments its current term, transitions to the Candidate state, votes for itself, and resets its election timer.
Request Votes: The candidate sends RequestVote RPCs to all other servers in parallel, including its current term and information about its log.
Voting: Other servers respond based on rules:
If the RPC's term is less than the receiver's current term, the vote is rejected.
If the receiver has already voted in the *same* term (or its log is "more up-to-date" - simplified here), it rejects the vote.
Otherwise, the receiver grants its vote, updates its `votedFor` record for the term, and resets its *own* election timeout.
Outcome:
Wins Election: If the candidate receives votes from a majority of servers, it becomes the new Leader. It immediately starts sending heartbeats (AppendEntries) to establish authority.
Another Leader Emerges: If the candidate receives an AppendEntries RPC from another server claiming to be leader *in the current or a higher term*, the candidate recognizes the new leader and reverts to the Follower state.
Election Timeout (Split Vote): If a candidate neither wins nor discovers a new leader before its *own* election timer expires (e.g., due to a split vote where no candidate gets a majority), it increments its term and starts a *new* election. Random election timeouts help prevent perpetual split votes.
Log Replication (Simplified)
The Leader receives client commands and appends them as new entries to its own log.
To replicate, the leader sends AppendEntries RPCs to followers, containing the new log entries (and information about preceding entries for consistency checks, simplified here).
Followers append valid entries to their logs and respond success.
Once a log entry is replicated on a majority of servers, the leader considers the entry committed. It can then apply the command to its state machine and notify followers about the commit status (via the `leaderCommit` index in future AppendEntries).
Heartbeats are simply empty AppendEntries RPCs sent periodically by the leader to prevent followers from timing out and starting unnecessary elections. Receiving *any* valid AppendEntries RPC resets a follower's election timeout.
Paxos
Paxos is another famous consensus algorithm, historically significant and proven correct, but generally considered harder to understand and implement than Raft. It achieves consensus through phases involving "Prepare" and "Accept" messages, proposals, and quorums, without relying on a designated stable leader in the same way Raft does (though leader-based variants exist). Visualizing its message flow accurately is significantly more complex.
Visualize Raft
Configure the cluster, fail/recover nodes, and observe leader elections and heartbeats. (Log replication is shown simply by appending entries).
Raft Cluster State - Global Term: 0 - Leader: None