Sections
Section · Overview
Designing a key-value store
A key-value store is the simplest non-relational database: every record is a unique key bound to an opaque value. The API is two calls — put(key, value) and get(key) — and yet what sits behind that surface, when it has to scale, survive failures, and stay fast, is one of the densest chapters in distributed systems.
two calls.
that's the surface.
Behind the surface
- · keys are unique
- · values are opaque
- · no schema, no joins
- · must scale horizontally
- · must survive node loss
This module walks the design space the book opens up. Each section keeps a single idea in view; the two Live Labs at the end let you feel the consequences of the decisions interactively.
The shape of the chapter
- 01Single-server baseline — hash table + tiered storage
- 02CAP theorem — pick two when the network breaks
- 03Data partition — reuse the Ch5 hash ring
- 04Data replication — N copies, walked clockwise
- 05Consistency & quorum — the N · W · R math
- 06Vector clocks — versioning & conflict resolution
- 07Failure handling — gossip · sloppy quorum · anti-entropy
- 08System architecture — every node, every responsibility
- 09Storage engine — LSM write path, Bloom read path
- 10Wrap-up — Dynamo, Cassandra, BigTable mapped
Seven non-negotiable design goals
Every decision in the rest of the chapter trades one of these goals against another. Holding all seven simultaneously is impossible — CAP makes sure of it — so the design picks a point in the space and owns the consequences.
Small KV size
values fit < 10 KB
Big data capability
partition across nodes
High availability
replicate · sloppy quorum
High scalability
consistent hashing
Automatic scaling
add/remove on the fly
Tunable consistency
N · W · R per request
Low latency
in-memory + LSM
All seven
at once is
impossible.
→ CAP
Two Live Labs in this same module
After the reference walkthrough, the Quorum Playground lets you slide N, W, R and watch consistency emerge or break in real time; the Merkle Tree Sync lab shows two diverged replicas reconciling without shipping their full keyspaces.
Section · Baseline
Single-server baseline
The simplest key-value store you can build is an in-memory hash table with two methods: put(key, value) and get(key). On one machine, both are O(1) and life is good — until the dataset doesn't fit in RAM and the machine doesn't process the request volume.
Two tiers, two cost profiles
The classic single-box optimisation is the one above: hold a hot working set in memory and spill the colder tail to disk. That buys you more capacity than your RAM allows, but at the cost of latency for any key that gets evicted. Hits the hot tier? Microseconds. Misses to disk? Milliseconds — orders of magnitude slower.
Where this design breaks
- 01Capacity ceiling. Even with disk spill, a single host caps out at terabytes; modern workloads ask for petabytes.
- 02Throughput ceiling. One CPU, one NIC, one memory bus. Read amplification on disk-resident keys magnifies every miss.
- 03No fault tolerance. The host is the system. When it goes, the keyspace goes with it.
The rest of the chapter is the story of fixing all three — partitioning the keyspace, replicating each partition, and tolerating the failures that come with the territory.
Section · CAP theorem
Pick two — the third is fact of life
Brewer's CAP theorem says a distributed system can give you at most two of consistency, availability, and partition tolerance at any moment. The trap is treating it as a free menu. In the real world the network will partition — cables get cut, switches fail, routes flap — so partition tolerance is not optional. The only choice you actually get to make is between consistency and availability, and you make it every time a request lands during a split.
What CA buys you (and what it costs)
The CA region in the Venn diagram is real, just narrow: it covers systems that assume the network never partitions. A single-node SQL database fits there comfortably. Anything spread across two or more machines does not — the moment a cable goes, you're forced to choose between C and A on every in-flight request.
Make the choice concrete
The demo below puts three nodes on a triangle. Severing the link between two replicas forces the system to decide: an AP store keeps accepting writes everywhere and reconciles when the partition heals; a CPstore refuses anything it can't confirm with a quorum. Toggle the partition and the mode, and watch the outcome ribbon change.
Which one does this chapter build?
The rest of the module sketches an AP-leaning store in the Dynamo/Cassandra family: replication walks the ring, the quorum knobs (N · W · R) let callers dial consistency per request, vector clocks reconcile divergent writes, and Merkle anti-entropy heals long-term drift between replicas. The Live Labs let you feel both knobs interactively.
Section · Data partition
Spread the data across nodes
The single-server design caps out at one machine's RAM and one machine's throughput. The fix is to split the keyspace across many servers — each one responsible for a well-defined slice — and to make the assignment scheme robust to the cluster growing and shrinking.
Two goals for any partition scheme
- 01Balanced load. Each node should hold roughly the same number of keys — no hotspots that bottleneck on one machine.
- 02Minimal churn on change. Adding or removing a node should move only the keys that strictly need to move, not the whole keyspace.
The scheme you already know — Chapter 5
Consistent hashing nails both goals. Place servers and keys on a ring, each key belongs to the next server clockwise, and adding a server moves only the keys in one arc. With virtual nodes, distribution smooths out further. Rather than re-derive it here, jump back to the ring walkthrough →
Fig 6-4 · the same ring · same key placement rule
With the ring in place, the next section asks a sharper question: how do we survive losing the server a key is assigned to? The answer is to put the key on more than one server, walked clockwise — replication.
Section · Data replication
N copies, walked clockwise
Partitioning solves capacity; replication solves durability and availability. Every key lives on N servers instead of one — typically three. The recipe is simple: once a key has been placed on its primary server (the next one clockwise on the ring), walk to the next N−1 distinct physical hosts clockwise and copy the key onto each of them.
Why "distinct physical hosts"
Virtual nodes are great for distribution, but they mean a single machine can occupy several positions on the ring. If two of those positions happen to be adjacent, the naïve "next clockwise" walk would put two replicas on the same physical host — defeating the purpose. The walk therefore skips vnodes that map back to a host already in the replica set.
Placement, in practice
- →A single replica per rack helps if a power supply or a top-of-rack switch fails.
- →A single replica per availability zone or data centre helps when an entire failure domain is lost.
- →The replica set is the unit that quorum, vector clocks, and Merkle anti-entropy all operate on — every following section is a deeper look at one of those mechanisms.
Section · Consistency & quorum
The N · W · R math
Once a key has N copies, the system has a choice on every request: how many copies does it consult before answering? The answer is the same three letters Dynamo introduced: N is the replica count, W is how many replicas must acknowledge a write, and R is how many must respond to a read.
The one inequality to remember
When W + R > N, any read intersects with any write on at least one replica. The freshest version is guaranteed to be in the read set — that's strong consistency. When W + R ≤ N, a read can miss the latest write entirely; that's the eventual- consistency regime, faster but stale-prone.
Common operating points
- 01R = 1, W = N. Fast reads, slow writes — every write must reach every replica before it can be acknowledged.
- 02W = 1, R = N. Fast writes, slow reads — the mirror image.
- 03W + R > N. Balanced, strongly consistent. The Dynamo default of N=3, W=2, R=2 lives here — it tolerates one replica being down while still meeting both quorums.
- 04W + R ≤ N. Low latency, weak consistency — used per-request when staleness is acceptable.
Make it concrete — Live Lab
The Quorum Playground at the end of this module runs the same math against a real Celery-backed simulation. Slide N, W, R, optionally inject a partition, and watch the consistency rate and latency curve emerge over hundreds of operations.
Section · Vector clocks
Versioning & conflict resolution
An eventually-consistent store accepts writes during partitions, which means two replicas can end up with versions that neither descends from the other. The system needs a way to detect this — to distinguish “Y happened after X” from “X and Y happened in parallel, neither aware of the other.” That mechanism is the vector clock.
The clock, in one paragraph
A vector clock attaches a per-replica counter to each version: writing on replica Sa bumps Sa's counter, writing on Sb bumps Sb's. Comparison is componentwise: if every counter in clock A is ≤ the same counter in clock B, then A happens-before B. If neither dominates, the two clocks are concurrent — no causal relationship — and the application must decide which version wins (or merge them).
How the chapter's scenario plays out
- Sa accepts a write. Its clock advances by one in Sa's slot.
- Sa replicates the value to Sb. Sb merges the clock and now matches.
- Sb writes a new value. The clock now has both Sa and Sb counters, strictly descending the previous version — linear history.
- A partition isolates Sc. Sa writes again; Sc independently writes a different value. Both clocks legitimately bump their own counter but stay zero in the other.
- When the partition heals, the merge step finds the two clocks incomparable — neither happens-before the other — and flags the pair as a conflict for the caller to reconcile.
A practical note
Vector clocks have one painful property: the vector lengthens with every replica that ever touched a key. Production systems cap them — either by truncating old entries or by pinning the clock dimensionality — and recover from rare edge cases via the anti-entropy mechanism in the next section.
Section · Failure handling
Detect, tolerate, repair
In a cluster of any reasonable size, something is always broken. The system handles that on three timescales: detection spreads the news that a node is gone, temporary tolerance keeps writes succeeding while it's down, and repair reconciles whatever drift accumulated. Each gets its own mechanism and its own animation below.
01 · Detect — gossip
A centralised “is alive?” ping would itself be a single point of failure, so production systems instead use a decentralised gossip protocol. Each node periodically picks a few random peers and exchanges its view of cluster membership — including how long it's been since it last heard from every other node. A high silence count above a threshold means the node is considered offline. Verdicts converge across the cluster within a handful of rounds.
02 · Tolerate — sloppy quorum & hinted handoff
Strict quorum says “refuse if you can't reach W replicas.” That's correct but it sacrifices availability. A sloppy quorumrelaxes the rule: still require W acks, but accept them from any W live replicas — not necessarily the “true” ones for that key. To keep durability, the substitute writes get tagged with a hint identifying their rightful destination. When that destination returns, the hint replays.
03 · Repair — anti-entropy via Merkle trees
Gossip and handoff handle short outages. For longer divergence — a replica that was down for hours, or a missed write that slipped past — the system needs a way to compare entire keyspaces and exchange only what differs. The trick is a Merkle tree: a binary tree of hashes built over the keys. Compare roots; if they match, the replicas are in sync and we're done. If they don't, recurse into the children, isolating divergence to a small subtree and a small wire payload.
Make it concrete — Merkle Sync Lab
The Merkle Sync Live Lab builds full trees on two simulated replicas, walks the comparison level by level on a real backend run, and reports the bytes-on-the-wire savings vs a brute-force full keyspace exchange.
Section · System architecture
Every node, every responsibility
Stitching the previous sections together gives you a peer- to-peer cluster: a ring of identical nodes, each holding a slice of the keyspace and N copies of every key. There's no leader, no metadata service, no special node — any peer can serve any client request, becoming the coordinator for that op's lifetime.
Why peer-to-peer
A leader-based design (one master, many followers) is simpler to reason about, but the leader is both a scaling ceiling and a fault domain. Dynamo's choice — and Cassandra's inheritance — is to push every responsibility down to every node. The cluster scales by adding peers; it tolerates failures because no one peer is irreplaceable.
The six responsibilities, in order of a request's life
- Client API — the receiving peer becomes coordinator and routes the request.
- Failure detection — gossip determines which replicas are reachable right now.
- Replication — the coordinator fans out the operation to the live replicas for that key.
- Conflict resolution — vector clocks on each version, application-level merge when needed.
- Anti-entropy repair — Merkle-tree exchanges run in the background to heal long-term drift.
- Storage engine — the local LSM-tree turns acknowledged writes into durable SSTables.
Section · Storage engine
Write path · read path
The distributed scaffold is in place; what runs inside each node? Bigtable, Cassandra, and ScyllaDB all converge on the same shape: a log-structured merge tree. The engine optimises for fast writes — every put becomes an append, never a random update — and uses careful background work to keep reads cheap.
Write path
- Commit log. The write is appended to a write-ahead log on disk and fsynced. That makes it durable — if the process dies before anything else happens, recovery replays the log.
- Memtable. The write is then placed into an in-memory sorted map keyed by its key. The memtable is fast (no disk seeks) and keeps the keyspace sorted so future flushes are cheap.
- SSTable flush. When the memtable hits its size threshold, the engine freezes it and writes a sorted-string table (SSTable) to disk. SSTables are immutable; updates land in fresh ones.
- Compaction (background) — older SSTables are merged into newer levels, dropping superseded versions and deleted keys.
Read path
- Memtable check. The most recent writes are still in RAM, so the engine checks the memtable first. A hit returns immediately.
- Bloom filter. Each SSTable has a probabilistic Bloom filter. If the filter says “definitely not present” the engine skips that file entirely — no disk read at all. If it says “maybe present” (false-positive possible), the engine continues.
- SSTable scan. Within each candidate SSTable, a small in-memory index points to the exact disk offset for the key's sorted block; binary search finishes the lookup.
Why this shape wins
Disks (and even SSDs) are dramatically faster at sequential writes than random ones. The LSM tree turns every write into a sequential append at the commit log, batches them in the sorted memtable, and writes the eventual SSTable as one sequential pour. The complexity moves to the read path, where Bloom filters keep most reads from ever touching disk for a key that doesn't exist.
Section · Wrap-up
Production systems & summary
The Dynamo paper landed in 2007 and gave the industry both a working AP design and the vocabulary to describe it. The systems below are the obvious descendants — different in the details, identical in the bones.
Amazon Dynamo
2007The paper that started the lineage. AP-leaning, peer-to-peer, vector clocks, sloppy quorum, anti-entropy via Merkle trees — every mechanism this chapter walks through traces back here.
↗ referenceApache Cassandra
2008Took Dynamo's distribution model and bolted on BigTable's column-family data model + storage engine. CL=ONE/QUORUM/ALL exposes the W·R math directly to applications.
↗ referenceGoogle Bigtable
2006Different lineage (single-master, strong consistency), but the LSM-tree storage engine in this section is largely Bigtable's invention. Influenced every subsequent KV store's local design.
↗ referenceThe chapter in one table
Every section solved a goal. Below: the ten goals this chapter held itself accountable to, paired with the technique that earns each one and a jump back to the section that derives it.
Make all of this concrete
The two Live Labs at the end of this module run real backend simulations against the math we just walked through — the Quorum Playground for N·W·R, and the Merkle Tree Sync for anti-entropy efficiency.
Live Lab · Quorum Playground
Watch consistency emerge from N · W · R
Pick a replica count and your quorum thresholds, optionally isolate part of the cluster behind a partition, and watch the simulator stream one operation at a time over SSE. The acceptance rate chart and per-op feed are driven by a real Celery task; nothing about this is faked client-side.
What you're looking at
- →Each operation draws a lognormal-ish latency per replica and reports the W-th (or R-th) ack as the completion time. The model is intentionally simple — it's the shape, not the precise numbers, that matters.
- →If you isolate enough replicas via the partition slider, the cluster can't form a quorum and writes get refused outright. The acceptance-rate curve drops to zero.
- →The capacity badge in the corner shows live concurrent slots — capped at three across all users so the worker pool stays calm.
Live Lab · Merkle Tree Sync
Reconcile two replicas without shipping the world
Two replicas hold the same keyspace, but a handful of keys have diverged. Each side builds a Merkle tree on its local data; the lab walks both trees in lockstep, short-circuiting on every matched subtree, and reports the bytes-on-the-wire it took to find the differences. For a few diffs in a thousand keys, the savings are dramatic.
What you're looking at
- →The backend builds two real SHA-256 trees, then compares them top-down. Each level's visited/matched/diverged counts stream over SSE.
- →The final byte tally weighs hash exchanges + leaf-level diffs against a baseline that ships every key and value. With M ≪ K the savings ratio gets close to 100%.
- →Crank divergence up to K to see the worst case: the whole tree disagrees, and the algorithm pays for the hash exchanges on top of shipping all the data anyway.
Sections
Section · Overview
Designing a key-value store
A key-value store is the simplest non-relational database: every record is a unique key bound to an opaque value. The API is two calls — put(key, value) and get(key) — and yet what sits behind that surface, when it has to scale, survive failures, and stay fast, is one of the densest chapters in distributed systems.
two calls.
that's the surface.
Behind the surface
- · keys are unique
- · values are opaque
- · no schema, no joins
- · must scale horizontally
- · must survive node loss
This module walks the design space the book opens up. Each section keeps a single idea in view; the two Live Labs at the end let you feel the consequences of the decisions interactively.
The shape of the chapter
- 01Single-server baseline — hash table + tiered storage
- 02CAP theorem — pick two when the network breaks
- 03Data partition — reuse the Ch5 hash ring
- 04Data replication — N copies, walked clockwise
- 05Consistency & quorum — the N · W · R math
- 06Vector clocks — versioning & conflict resolution
- 07Failure handling — gossip · sloppy quorum · anti-entropy
- 08System architecture — every node, every responsibility
- 09Storage engine — LSM write path, Bloom read path
- 10Wrap-up — Dynamo, Cassandra, BigTable mapped
Seven non-negotiable design goals
Every decision in the rest of the chapter trades one of these goals against another. Holding all seven simultaneously is impossible — CAP makes sure of it — so the design picks a point in the space and owns the consequences.
Small KV size
values fit < 10 KB
Big data capability
partition across nodes
High availability
replicate · sloppy quorum
High scalability
consistent hashing
Automatic scaling
add/remove on the fly
Tunable consistency
N · W · R per request
Low latency
in-memory + LSM
All seven
at once is
impossible.
→ CAP
Two Live Labs in this same module
After the reference walkthrough, the Quorum Playground lets you slide N, W, R and watch consistency emerge or break in real time; the Merkle Tree Sync lab shows two diverged replicas reconciling without shipping their full keyspaces.
Section · Baseline
Single-server baseline
The simplest key-value store you can build is an in-memory hash table with two methods: put(key, value) and get(key). On one machine, both are O(1) and life is good — until the dataset doesn't fit in RAM and the machine doesn't process the request volume.
Two tiers, two cost profiles
The classic single-box optimisation is the one above: hold a hot working set in memory and spill the colder tail to disk. That buys you more capacity than your RAM allows, but at the cost of latency for any key that gets evicted. Hits the hot tier? Microseconds. Misses to disk? Milliseconds — orders of magnitude slower.
Where this design breaks
- 01Capacity ceiling. Even with disk spill, a single host caps out at terabytes; modern workloads ask for petabytes.
- 02Throughput ceiling. One CPU, one NIC, one memory bus. Read amplification on disk-resident keys magnifies every miss.
- 03No fault tolerance. The host is the system. When it goes, the keyspace goes with it.
The rest of the chapter is the story of fixing all three — partitioning the keyspace, replicating each partition, and tolerating the failures that come with the territory.
Section · CAP theorem
Pick two — the third is fact of life
Brewer's CAP theorem says a distributed system can give you at most two of consistency, availability, and partition tolerance at any moment. The trap is treating it as a free menu. In the real world the network will partition — cables get cut, switches fail, routes flap — so partition tolerance is not optional. The only choice you actually get to make is between consistency and availability, and you make it every time a request lands during a split.
What CA buys you (and what it costs)
The CA region in the Venn diagram is real, just narrow: it covers systems that assume the network never partitions. A single-node SQL database fits there comfortably. Anything spread across two or more machines does not — the moment a cable goes, you're forced to choose between C and A on every in-flight request.
Make the choice concrete
The demo below puts three nodes on a triangle. Severing the link between two replicas forces the system to decide: an AP store keeps accepting writes everywhere and reconciles when the partition heals; a CPstore refuses anything it can't confirm with a quorum. Toggle the partition and the mode, and watch the outcome ribbon change.
Which one does this chapter build?
The rest of the module sketches an AP-leaning store in the Dynamo/Cassandra family: replication walks the ring, the quorum knobs (N · W · R) let callers dial consistency per request, vector clocks reconcile divergent writes, and Merkle anti-entropy heals long-term drift between replicas. The Live Labs let you feel both knobs interactively.
Section · Data partition
Spread the data across nodes
The single-server design caps out at one machine's RAM and one machine's throughput. The fix is to split the keyspace across many servers — each one responsible for a well-defined slice — and to make the assignment scheme robust to the cluster growing and shrinking.
Two goals for any partition scheme
- 01Balanced load. Each node should hold roughly the same number of keys — no hotspots that bottleneck on one machine.
- 02Minimal churn on change. Adding or removing a node should move only the keys that strictly need to move, not the whole keyspace.
The scheme you already know — Chapter 5
Consistent hashing nails both goals. Place servers and keys on a ring, each key belongs to the next server clockwise, and adding a server moves only the keys in one arc. With virtual nodes, distribution smooths out further. Rather than re-derive it here, jump back to the ring walkthrough →
Fig 6-4 · the same ring · same key placement rule
With the ring in place, the next section asks a sharper question: how do we survive losing the server a key is assigned to? The answer is to put the key on more than one server, walked clockwise — replication.
Section · Data replication
N copies, walked clockwise
Partitioning solves capacity; replication solves durability and availability. Every key lives on N servers instead of one — typically three. The recipe is simple: once a key has been placed on its primary server (the next one clockwise on the ring), walk to the next N−1 distinct physical hosts clockwise and copy the key onto each of them.
Why "distinct physical hosts"
Virtual nodes are great for distribution, but they mean a single machine can occupy several positions on the ring. If two of those positions happen to be adjacent, the naïve "next clockwise" walk would put two replicas on the same physical host — defeating the purpose. The walk therefore skips vnodes that map back to a host already in the replica set.
Placement, in practice
- →A single replica per rack helps if a power supply or a top-of-rack switch fails.
- →A single replica per availability zone or data centre helps when an entire failure domain is lost.
- →The replica set is the unit that quorum, vector clocks, and Merkle anti-entropy all operate on — every following section is a deeper look at one of those mechanisms.
Section · Consistency & quorum
The N · W · R math
Once a key has N copies, the system has a choice on every request: how many copies does it consult before answering? The answer is the same three letters Dynamo introduced: N is the replica count, W is how many replicas must acknowledge a write, and R is how many must respond to a read.
The one inequality to remember
When W + R > N, any read intersects with any write on at least one replica. The freshest version is guaranteed to be in the read set — that's strong consistency. When W + R ≤ N, a read can miss the latest write entirely; that's the eventual- consistency regime, faster but stale-prone.
Common operating points
- 01R = 1, W = N. Fast reads, slow writes — every write must reach every replica before it can be acknowledged.
- 02W = 1, R = N. Fast writes, slow reads — the mirror image.
- 03W + R > N. Balanced, strongly consistent. The Dynamo default of N=3, W=2, R=2 lives here — it tolerates one replica being down while still meeting both quorums.
- 04W + R ≤ N. Low latency, weak consistency — used per-request when staleness is acceptable.
Make it concrete — Live Lab
The Quorum Playground at the end of this module runs the same math against a real Celery-backed simulation. Slide N, W, R, optionally inject a partition, and watch the consistency rate and latency curve emerge over hundreds of operations.
Section · Vector clocks
Versioning & conflict resolution
An eventually-consistent store accepts writes during partitions, which means two replicas can end up with versions that neither descends from the other. The system needs a way to detect this — to distinguish “Y happened after X” from “X and Y happened in parallel, neither aware of the other.” That mechanism is the vector clock.
The clock, in one paragraph
A vector clock attaches a per-replica counter to each version: writing on replica Sa bumps Sa's counter, writing on Sb bumps Sb's. Comparison is componentwise: if every counter in clock A is ≤ the same counter in clock B, then A happens-before B. If neither dominates, the two clocks are concurrent — no causal relationship — and the application must decide which version wins (or merge them).
How the chapter's scenario plays out
- Sa accepts a write. Its clock advances by one in Sa's slot.
- Sa replicates the value to Sb. Sb merges the clock and now matches.
- Sb writes a new value. The clock now has both Sa and Sb counters, strictly descending the previous version — linear history.
- A partition isolates Sc. Sa writes again; Sc independently writes a different value. Both clocks legitimately bump their own counter but stay zero in the other.
- When the partition heals, the merge step finds the two clocks incomparable — neither happens-before the other — and flags the pair as a conflict for the caller to reconcile.
A practical note
Vector clocks have one painful property: the vector lengthens with every replica that ever touched a key. Production systems cap them — either by truncating old entries or by pinning the clock dimensionality — and recover from rare edge cases via the anti-entropy mechanism in the next section.
Section · Failure handling
Detect, tolerate, repair
In a cluster of any reasonable size, something is always broken. The system handles that on three timescales: detection spreads the news that a node is gone, temporary tolerance keeps writes succeeding while it's down, and repair reconciles whatever drift accumulated. Each gets its own mechanism and its own animation below.
01 · Detect — gossip
A centralised “is alive?” ping would itself be a single point of failure, so production systems instead use a decentralised gossip protocol. Each node periodically picks a few random peers and exchanges its view of cluster membership — including how long it's been since it last heard from every other node. A high silence count above a threshold means the node is considered offline. Verdicts converge across the cluster within a handful of rounds.
02 · Tolerate — sloppy quorum & hinted handoff
Strict quorum says “refuse if you can't reach W replicas.” That's correct but it sacrifices availability. A sloppy quorumrelaxes the rule: still require W acks, but accept them from any W live replicas — not necessarily the “true” ones for that key. To keep durability, the substitute writes get tagged with a hint identifying their rightful destination. When that destination returns, the hint replays.
03 · Repair — anti-entropy via Merkle trees
Gossip and handoff handle short outages. For longer divergence — a replica that was down for hours, or a missed write that slipped past — the system needs a way to compare entire keyspaces and exchange only what differs. The trick is a Merkle tree: a binary tree of hashes built over the keys. Compare roots; if they match, the replicas are in sync and we're done. If they don't, recurse into the children, isolating divergence to a small subtree and a small wire payload.
Make it concrete — Merkle Sync Lab
The Merkle Sync Live Lab builds full trees on two simulated replicas, walks the comparison level by level on a real backend run, and reports the bytes-on-the-wire savings vs a brute-force full keyspace exchange.
Section · System architecture
Every node, every responsibility
Stitching the previous sections together gives you a peer- to-peer cluster: a ring of identical nodes, each holding a slice of the keyspace and N copies of every key. There's no leader, no metadata service, no special node — any peer can serve any client request, becoming the coordinator for that op's lifetime.
Why peer-to-peer
A leader-based design (one master, many followers) is simpler to reason about, but the leader is both a scaling ceiling and a fault domain. Dynamo's choice — and Cassandra's inheritance — is to push every responsibility down to every node. The cluster scales by adding peers; it tolerates failures because no one peer is irreplaceable.
The six responsibilities, in order of a request's life
- Client API — the receiving peer becomes coordinator and routes the request.
- Failure detection — gossip determines which replicas are reachable right now.
- Replication — the coordinator fans out the operation to the live replicas for that key.
- Conflict resolution — vector clocks on each version, application-level merge when needed.
- Anti-entropy repair — Merkle-tree exchanges run in the background to heal long-term drift.
- Storage engine — the local LSM-tree turns acknowledged writes into durable SSTables.
Section · Storage engine
Write path · read path
The distributed scaffold is in place; what runs inside each node? Bigtable, Cassandra, and ScyllaDB all converge on the same shape: a log-structured merge tree. The engine optimises for fast writes — every put becomes an append, never a random update — and uses careful background work to keep reads cheap.
Write path
- Commit log. The write is appended to a write-ahead log on disk and fsynced. That makes it durable — if the process dies before anything else happens, recovery replays the log.
- Memtable. The write is then placed into an in-memory sorted map keyed by its key. The memtable is fast (no disk seeks) and keeps the keyspace sorted so future flushes are cheap.
- SSTable flush. When the memtable hits its size threshold, the engine freezes it and writes a sorted-string table (SSTable) to disk. SSTables are immutable; updates land in fresh ones.
- Compaction (background) — older SSTables are merged into newer levels, dropping superseded versions and deleted keys.
Read path
- Memtable check. The most recent writes are still in RAM, so the engine checks the memtable first. A hit returns immediately.
- Bloom filter. Each SSTable has a probabilistic Bloom filter. If the filter says “definitely not present” the engine skips that file entirely — no disk read at all. If it says “maybe present” (false-positive possible), the engine continues.
- SSTable scan. Within each candidate SSTable, a small in-memory index points to the exact disk offset for the key's sorted block; binary search finishes the lookup.
Why this shape wins
Disks (and even SSDs) are dramatically faster at sequential writes than random ones. The LSM tree turns every write into a sequential append at the commit log, batches them in the sorted memtable, and writes the eventual SSTable as one sequential pour. The complexity moves to the read path, where Bloom filters keep most reads from ever touching disk for a key that doesn't exist.
Section · Wrap-up
Production systems & summary
The Dynamo paper landed in 2007 and gave the industry both a working AP design and the vocabulary to describe it. The systems below are the obvious descendants — different in the details, identical in the bones.
Amazon Dynamo
2007The paper that started the lineage. AP-leaning, peer-to-peer, vector clocks, sloppy quorum, anti-entropy via Merkle trees — every mechanism this chapter walks through traces back here.
↗ referenceApache Cassandra
2008Took Dynamo's distribution model and bolted on BigTable's column-family data model + storage engine. CL=ONE/QUORUM/ALL exposes the W·R math directly to applications.
↗ referenceGoogle Bigtable
2006Different lineage (single-master, strong consistency), but the LSM-tree storage engine in this section is largely Bigtable's invention. Influenced every subsequent KV store's local design.
↗ referenceThe chapter in one table
Every section solved a goal. Below: the ten goals this chapter held itself accountable to, paired with the technique that earns each one and a jump back to the section that derives it.
Make all of this concrete
The two Live Labs at the end of this module run real backend simulations against the math we just walked through — the Quorum Playground for N·W·R, and the Merkle Tree Sync for anti-entropy efficiency.
Live Lab · Quorum Playground
Watch consistency emerge from N · W · R
Pick a replica count and your quorum thresholds, optionally isolate part of the cluster behind a partition, and watch the simulator stream one operation at a time over SSE. The acceptance rate chart and per-op feed are driven by a real Celery task; nothing about this is faked client-side.
What you're looking at
- →Each operation draws a lognormal-ish latency per replica and reports the W-th (or R-th) ack as the completion time. The model is intentionally simple — it's the shape, not the precise numbers, that matters.
- →If you isolate enough replicas via the partition slider, the cluster can't form a quorum and writes get refused outright. The acceptance-rate curve drops to zero.
- →The capacity badge in the corner shows live concurrent slots — capped at three across all users so the worker pool stays calm.
Live Lab · Merkle Tree Sync
Reconcile two replicas without shipping the world
Two replicas hold the same keyspace, but a handful of keys have diverged. Each side builds a Merkle tree on its local data; the lab walks both trees in lockstep, short-circuiting on every matched subtree, and reports the bytes-on-the-wire it took to find the differences. For a few diffs in a thousand keys, the savings are dramatic.
What you're looking at
- →The backend builds two real SHA-256 trees, then compares them top-down. Each level's visited/matched/diverged counts stream over SSE.
- →The final byte tally weighs hash exchanges + leaf-level diffs against a baseline that ships every key and value. With M ≪ K the savings ratio gets close to 100%.
- →Crank divergence up to K to see the worst case: the whole tree disagrees, and the algorithm pays for the hash exchanges on top of shipping all the data anyway.