Overview

Sections

Section · Overview

Designing a key-value store

A key-value store is the simplest non-relational database: every record is a unique key bound to an opaque value. The API is two calls — put(key, value) and get(key) — and yet what sits behind that surface, when it has to scale, survive failures, and stay fast, is one of the densest chapters in distributed systems.

put(key, value)

get(key)

two calls.
that's the surface.

Key

Value

user:42:last_login

1747327200

session:9c3a…

{uid:42,role:'rw'}

cart:shopper-17

[sku-901, sku-742]

feature:dark_mode

rate:ip:10.0.4.7

Behind the surface

· keys are unique
· values are opaque
· no schema, no joins
· must scale horizontally
· must survive node loss

Fig 6-1 · the key-value contract

This module walks the design space the book opens up. Each section keeps a single idea in view; the two Live Labs at the end let you feel the consequences of the decisions interactively.

The shape of the chapter

01Single-server baseline — hash table + tiered storage
02CAP theorem — pick two when the network breaks
03Data partition — reuse the Ch5 hash ring
04Data replication — N copies, walked clockwise
05Consistency & quorum — the N · W · R math
06Vector clocks — versioning & conflict resolution
07Failure handling — gossip · sloppy quorum · anti-entropy
08System architecture — every node, every responsibility
09Storage engine — LSM write path, Bloom read path
10Wrap-up — Dynamo, Cassandra, BigTable mapped

Seven non-negotiable design goals

Every decision in the rest of the chapter trades one of these goals against another. Holding all seven simultaneously is impossible — CAP makes sure of it — so the design picks a point in the space and owns the consequences.

Small KV size

values fit < 10 KB

Big data capability

partition across nodes

High availability

replicate · sloppy quorum

High scalability

consistent hashing

Automatic scaling

add/remove on the fly

Tunable consistency

N · W · R per request

Low latency

in-memory + LSM

All seven

at once is

impossible.

→ CAP

Fig 6-2 · the seven goals and how each is earned

Two Live Labs in this same module

After the reference walkthrough, the Quorum Playground lets you slide N, W, R and watch consistency emerge or break in real time; the Merkle Tree Sync lab shows two diverged replicas reconciling without shipping their full keyspaces.

Section · Failure handling

Detect, tolerate, repair

In a cluster of any reasonable size, something is always broken. The system handles that on three timescales: detection spreads the news that a node is gone, temporary tolerance keeps writes succeeding while it's down, and repair reconciles whatever drift accumulated. Each gets its own mechanism and its own animation below.

01 · Detect — gossip

A centralised “is alive?” ping would itself be a single point of failure, so production systems instead use a decentralised gossip protocol. Each node periodically picks a few random peers and exchanges its view of cluster membership — including how long it's been since it last heard from every other node. A high silence count above a threshold means the node is considered offline. Verdicts converge across the cluster within a handful of rounds.

02 · Tolerate — sloppy quorum & hinted handoff

Strict quorum says “refuse if you can't reach W replicas.” That's correct but it sacrifices availability. A sloppy quorumrelaxes the rule: still require W acks, but accept them from any W live replicas — not necessarily the “true” ones for that key. To keep durability, the substitute writes get tagged with a hint identifying their rightful destination. When that destination returns, the hint replays.

03 · Repair — anti-entropy via Merkle trees

Gossip and handoff handle short outages. For longer divergence — a replica that was down for hours, or a missed write that slipped past — the system needs a way to compare entire keyspaces and exchange only what differs. The trick is a Merkle tree: a binary tree of hashes built over the keys. Compare roots; if they match, the replicas are in sync and we're done. If they don't, recurse into the children, isolating divergence to a small subtree and a small wire payload.

Make it concrete — Merkle Sync Lab

The Merkle Sync Live Lab builds full trees on two simulated replicas, walks the comparison level by level on a real backend run, and reports the bytes-on-the-wire savings vs a brute-force full keyspace exchange.

Sections

Section · Overview

Designing a key-value store

put(key, value)

get(key)

two calls.
that's the surface.

Key

Value

user:42:last_login

1747327200

session:9c3a…

{uid:42,role:'rw'}

cart:shopper-17

[sku-901, sku-742]

feature:dark_mode

rate:ip:10.0.4.7

Behind the surface

· keys are unique
· values are opaque
· no schema, no joins
· must scale horizontally
· must survive node loss

Fig 6-1 · the key-value contract

This module walks the design space the book opens up. Each section keeps a single idea in view; the two Live Labs at the end let you feel the consequences of the decisions interactively.

The shape of the chapter

01Single-server baseline — hash table + tiered storage
02CAP theorem — pick two when the network breaks
03Data partition — reuse the Ch5 hash ring
04Data replication — N copies, walked clockwise
05Consistency & quorum — the N · W · R math
06Vector clocks — versioning & conflict resolution
07Failure handling — gossip · sloppy quorum · anti-entropy
08System architecture — every node, every responsibility
09Storage engine — LSM write path, Bloom read path
10Wrap-up — Dynamo, Cassandra, BigTable mapped

Seven non-negotiable design goals

Small KV size

values fit < 10 KB

Big data capability

partition across nodes

High availability

replicate · sloppy quorum

High scalability

consistent hashing

Automatic scaling

add/remove on the fly

Tunable consistency

N · W · R per request

Low latency

in-memory + LSM

All seven

at once is

impossible.

→ CAP

Fig 6-2 · the seven goals and how each is earned

Two Live Labs in this same module

Section · Failure handling

Detect, tolerate, repair

01 · Detect — gossip

02 · Tolerate — sloppy quorum & hinted handoff

03 · Repair — anti-entropy via Merkle trees

Make it concrete — Merkle Sync Lab

Sections

Designing a key-value store

Single-server baseline

Pick two — the third is fact of life

Spread the data across nodes

N copies, walked clockwise

The N · W · R math

Versioning & conflict resolution

Detect, tolerate, repair

Every node, every responsibility

Write path · read path

Production systems & summary

Amazon Dynamo

Apache Cassandra

Google Bigtable

Watch consistency emerge from N · W · R

Reconcile two replicas without shipping the world

Sections

Designing a key-value store

Single-server baseline

Pick two — the third is fact of life

Spread the data across nodes

N copies, walked clockwise

The N · W · R math

Versioning & conflict resolution

Detect, tolerate, repair

Every node, every responsibility

Write path · read path

Production systems & summary

Amazon Dynamo

Apache Cassandra

Google Bigtable

Watch consistency emerge from N · W · R

Reconcile two replicas without shipping the world