The Best Way to Learn: Break It On Purpose
Reading the Raft paper is one thing. Understanding how distributed consensus actually works when networks partition,
nodes lie, and Murphy's Law is in full effect? That requires getting your hands dirty.
So I spent a weekend intentionally breaking Raft to see what would happen.
"If you want to understand how something works, try to break it. If you want to master it, try to break it in creative ways."
Experiment 1: Can We Elect Two Leaders?
Raft's safety property guarantees at most one leader per term. But what if we try really hard to break it?
I modified a Raft implementation to introduce a "malicious node" that lies about its log index during elections.
The node claims to have a longer log than it actually does, trying to win elections it shouldn't.
// Simplified Raft Election Simulation
// This demonstrates leader election with a malicious node
class RaftNode {
constructor(id, logLength) {
this.id = id;
this.logLength = logLength;
this.currentTerm = 0;
this.votedFor = null;
this.isMalicious = false;
}
// Request vote from this node
requestVote(candidateId, candidateTerm, candidateLogLength) {
// Update term if candidate has higher term
if (candidateTerm > this.currentTerm) {
this.currentTerm = candidateTerm;
this.votedFor = null;
}
// Already voted in this term
if (this.votedFor !== null && this.votedFor !== candidateId) {
return false;
}
// Candidate's log must be at least as up-to-date
if (candidateLogLength >= this.logLength) {
this.votedFor = candidateId;
return true;
}
return false;
}
// Try to become leader (malicious nodes lie about log length)
runForLeader(nodes) {
this.currentTerm++;
this.votedFor = this.id;
const reportedLogLength = this.isMalicious
? 9999 // LIE: claim to have huge log
: this.logLength; // Tell the truth
let votes = 1; // Vote for self
nodes.forEach(node => {
if (node.id !== this.id) {
const granted = node.requestVote(this.id, this.currentTerm, reportedLogLength);
if (granted) votes++;
}
});
const majority = Math.floor(nodes.length / 2) + 1;
return votes >= majority;
}
}
// Create cluster: 5 nodes with different log lengths
const nodes = [
new RaftNode('A', 10),
new RaftNode('B', 8),
new RaftNode('C', 12), // Most up-to-date
new RaftNode('D', 7),
new RaftNode('E', 5) // Malicious with short log
];
nodes[4].isMalicious = true; // Node E lies about log
console.log("=== Raft Election with Malicious Node ===\n");
// Honest node C tries to become leader
console.log("Node C (log=12, honest) runs for leader:");
const cWins = nodes[2].runForLeader(nodes);
console.log(` Result: ${cWins ? 'ELECTED' : 'FAILED'}`);
// Reset votes
nodes.forEach(n => { n.votedFor = null; n.currentTerm = 0; });
// Malicious node E tries to become leader
console.log("\nNode E (log=5, MALICIOUS claims log=9999) runs for leader:");
const eWins = nodes[4].runForLeader(nodes);
console.log(` Result: ${eWins ? 'ELECTED (BAD!)' : 'FAILED'}`);
console.log("\n๐ก Raft's log comparison prevents malicious nodes from winning!");
console.log(" But what if we had network partitions...?");
Result: Raft held up. The malicious node won elections because other nodes believed its lie,
but as soon as it tried to replicate log entries, followers rejected them because the logs didn't match.
The cluster detected the inconsistency and held another election.
Lesson learned: Raft's safety doesn't rely on nodes being honest. It relies on detecting inconsistencies.
Experiment 2: Network Partitions (Split Brain)
The classic distributed systems nightmare: what happens when the network splits the cluster in half?
I simulated a 5-node cluster with nodes [A, B, C, D, E]. Then introduced a network partition that split them into:
- Partition 1: [A, B, C] (3 nodes, can form quorum)
- Partition 2: [D, E] (2 nodes, cannot form quorum)
// Network Partition Simulation
class Cluster {
constructor(nodeIds) {
this.nodes = nodeIds.map(id => ({
id,
term: 0,
leader: false,
canCommunicate: new Set(nodeIds) // Initially all can talk
}));
}
// Introduce network partition
partition(group1, group2) {
console.log(`\n๐ช NETWORK PARTITION:`);
console.log(` Group 1: [${group1.join(', ')}]`);
console.log(` Group 2: [${group2.join(', ')}]`);
this.nodes.forEach(node => {
if (group1.includes(node.id)) {
node.canCommunicate = new Set(group1);
} else {
node.canCommunicate = new Set(group2);
}
});
}
// Try to elect leader in a group
electLeader(candidateId) {
const candidate = this.nodes.find(n => n.id === candidateId);
const reachableNodes = this.nodes.filter(n =>
candidate.canCommunicate.has(n.id)
);
candidate.term++;
let votes = 1; // Self-vote
reachableNodes.forEach(node => {
if (node.id !== candidateId) {
// Simple voting: grant if same partition
if (node.canCommunicate.has(candidateId)) {
votes++;
}
}
});
const totalClusterSize = this.nodes.length;
const majority = Math.floor(totalClusterSize / 2) + 1;
const elected = votes >= majority;
console.log(`\n${candidateId} election attempt:`);
console.log(` Votes: ${votes}/${totalClusterSize}`);
console.log(` Majority needed: ${majority}`);
console.log(` Result: ${elected ? 'โ ELECTED' : 'โ FAILED'}`);
if (elected) {
candidate.leader = true;
}
return elected;
}
}
// Create 5-node cluster
const cluster = new Cluster(['A', 'B', 'C', 'D', 'E']);
console.log("=== Distributed Consensus Under Partition ===");
console.log("\nInitial: All nodes can communicate");
// No partition: A wins
cluster.electLeader('A');
// Introduce partition
cluster.partition(['A', 'B', 'C'], ['D', 'E']);
// Partition 1 (3 nodes) can elect leader
cluster.electLeader('B');
// Partition 2 (2 nodes) CANNOT elect leader
cluster.electLeader('D');
console.log("\n๐ก Key Insight: Majority quorum prevents split-brain!");
console.log(" Minority partition can't make progress = safety preserved");
Result: Partition 1 elected a new leader and continued operating. Partition 2 could not reach quorum
and remained stuck in follower state, unable to elect a leader or accept writes.
Key insight: This is by design. Raft sacrifices availability in the minority partition to preserve
consistency. Better to have no leader than two leaders.
Experiment 3: Log Conflicts
What happens when network partitions heal and nodes have conflicting logs?
I set up a scenario where:
- Cluster operates normally, leader is A
- Network partition splits [A, B] from [C, D, E]
- Partition [C, D, E] elects new leader C in term 2
- Both partitions accept writes to their logs
- Network partition heals
Now we have divergent logs. Nodes A and B have entries from term 1. Nodes C, D, E have entries from term 2.
"When partitions heal, Raft doesn't try to merge divergent histories. It picks one truth and makes everyone agree."
The node with the higher term number (C with term 2) becomes the authority. Nodes A and B discard their uncommitted
entries from term 1 and sync with C's log.
This is brutal but correct. Raft guarantees linearizability: if a client got an acknowledgment, that write is durable.
If they didn't get an ack, it might be lost. No maybes.
What Textbooks Don't Tell You
After a weekend of breaking Raft, here's what I learned that wasn't obvious from the paper:
- Raft is paranoid by design โ It assumes networks are unreliable, nodes can lie, and messages can be duplicated or lost
- Safety beats liveness โ Raft would rather stop making progress than violate consistency
- Term numbers are everything โ They're the global logical clock that orders events across the cluster
- Log matching is the key invariant โ If two logs have the same entry at the same index, everything before that must match
Try Breaking It Yourself
The best way to truly understand distributed consensus is to implement it and break it. Some ideas to try:
- What happens if messages are delayed by 10 seconds?
- Can you create a livelock where no leader is ever elected?
- What if a follower's disk fails and it loses its log?
- Can a minority partition ever commit a write?
Distributed systems are hard because the failure modes are creative and surprising. The only way to build intuition
is to see them fail, repeatedly, until the patterns become clear.