Sections
Section · Overview
Designing a rate limiter
A rate limiter caps how many requests a client or service can issue in a given window. It protects you from accidental overuse, hostile traffic, and the cost spiral that happens when neither is checked. This chapter walks through five algorithms, the architecture behind them, and the distributed-correctness traps you hit at scale.
Why rate limit
- ▸Prevent resource starvation from DoS attacks. Most public APIs publish a default ceiling for exactly this reason — Twitter caps writes at 300 / 3h, Google Docs at 300 reads/min/user.
- ▸Reduce cost. If you bill third-party services per call, a rate limiter is the only thing standing between a buggy retry loop and a six-figure surprise.
- ▸Prevent server overload. Bots and misbehaving clients can fill your queues; throttling at the edge keeps the rest of the system within capacity.
Where it goes
Client-side
InsecureA client-side limiter is trivially bypassed by a hostile or modified client. Useful for UX hints only — never as a safety control.
Server-side
In your codeEmbed the limiter in the API server. Full control, but every service implements its own, and rules don't compose across teams.
Middleware / Gateway
API gatewayCloud microservices favor a dedicated gateway tier (Envoy, Kong, AWS API Gateway). Cross-cutting policy, one configuration to maintain.
Five algorithms
click any to readThe chapter compares five strategies. They split along two axes: how they store state (per-request log vs. counter vs. bucket) and what they do at burst (allow / smooth / strict).
Worked scenarios
algorithm choiceTwitter — 300 posts per 3 hours
A user can publish at most 300 posts in any rolling three-hour window. The cap protects fanout infrastructure and discourages spam.
Without a limit, a single compromised account could flood every follower's feed and the message queue. The cap is per-user, sustained over hours.
Recommended · Sliding Window Counter
Three hours is large enough that fixed-window edge effects matter; the per-user counter doesn't need timestamp-level precision, just smooth enforcement.
Google Docs APIs — read requests per minute per user
Default read quota is 300 per 60 seconds per user, configurable up. The limit is short-window and high-frequency.
Read fan-out from a single client (e.g. a custom integration) can saturate downstream services. A short rolling window catches misbehaving clients quickly.
Recommended · Token Bucket
Token bucket tolerates legitimate short bursts (a UI loading several documents at once) while still enforcing the steady-state rate.
Marketing SMS — 5 per day per user
A messaging service caps marketing SMS to 5 per user per day, hard limit. Compliance, not capacity.
User-experience and regulatory reasons. Going over is a real failure, not a soft hint.
Recommended · Sliding Window Log
Five per day is small enough that the per-request memory cost is trivial, and absolute precision matters — no edge-of-window 2× allowed.
Auth — 5 login attempts per minute per IP
Login endpoint capped at 5 attempts per minute per IP to slow credential stuffing without locking out users on retry.
Brute-force defence. The limit must be strict — letting through a 6th attempt on the boundary defeats the point.
Recommended · Sliding Window Log
Strict rolling-window precision; per-IP scope keeps memory bounded.
Algorithm · Bucket
Token Bucket
live lab availableA container has a pre-defined capacity. Tokens are added by a refiller at fixed intervals up to that capacity. Every request consumes one token; if the bucket is empty, the request is dropped. The bucket size sets the burst tolerance, the refill rate sets the steady-state throughput. Used by Amazon and Stripe to throttle their public APIs.
How it decides
- 01A refiller adds tokens at a fixed rate, capped at the bucket capacity.
- 02Each incoming request consumes one token.
- 03If the bucket is empty, the request is dropped.
- 04Capacity sets burst tolerance, refill rate sets steady-state throughput.
Pros
- +Easy to implement and reason about
- +Memory efficient (two numbers per bucket)
- +Allows short bursts when tokens are saved up
Cons
- −Two parameters (capacity, refill rate) can be hard to tune correctly
Recommended for
Public APIs that benefit from absorbing short traffic spikes (Amazon, Stripe).
Tunable parameters
Maximum tokens the bucket can hold — determines burst tolerance.
range · 1–500 tokens
How many tokens are added per second — determines sustained throughput.
range · 1–500 tokens / sec
Algorithm · Queue
Leaking Bucket
live lab availableRequests arrive and are appended to a fixed-size FIFO queue. The queue is drained at a constant leak rate. If the queue is full when a request arrives, the request is dropped. Smooths bursts by shifting them in time rather than absorbing them. Shopify uses leaky buckets to rate-limit API requests.
How it decides
- 01Requests arrive and are pushed onto a fixed-size FIFO queue.
- 02If the queue is at capacity, the new request is dropped.
- 03A background process drains the queue at a constant leak rate.
- 04Net effect: bursts are smoothed in time rather than absorbed — downstream sees a steady rate.
Pros
- +Memory efficient (queue size is bounded)
- +Processes requests at a fixed rate — predictable downstream load
- +Suitable when a stable outflow rate is required
Cons
- −Bursts fill the queue with old requests; if not processed in time, recent requests are dropped
- −Two parameters (capacity, leak rate) — same tuning challenge as token bucket
Recommended for
Stable, smoothed outflow scenarios (Shopify, payment processing).
Tunable parameters
Maximum queued requests waiting to be processed.
range · 1–500 requests
How many requests drain from the queue per second.
range · 1–500 requests / sec
Algorithm · Window counter
Fixed Window Counter
Time is divided into fixed-size windows (e.g., 1-minute slots). Each window has a counter; new requests increment the counter for the current window. When the counter reaches a threshold, further requests are dropped until the next window starts. Easy to implement and memory-efficient — but a burst at the boundary between two windows can deliver up to 2× the intended rate.
⚠ The edge-spike problem
The chapter walks through a worst case: with a 1-minute window and a 5 requests/min cap, a client can issue 5 requests in the last second of one window and 5 in the first second of the next — 10 requests in two seconds, double the intended rate. Fixed window is simple but this edge effect is real, and at API-server scale it becomes a capacity-planning problem.
How it decides
- 01Time is sliced into fixed-size windows (e.g., 1 minute each).
- 02Each window has its own counter starting at 0.
- 03Each request increments the current window's counter; once the threshold is reached, further requests are dropped.
- 04At the next window boundary, the counter resets to 0.
Pros
- +Memory efficient (one counter per window)
- +Trivially easy to understand and implement
- +Resetting the counter on a window boundary is fast
Cons
- −Spikes at window edges can let through 2× the threshold for a brief moment
Recommended for
Simple rate limits where the 2× edge effect is tolerable.
Tunable parameters
Length of each fixed time window.
range · 1000–60000 ms
Threshold per window before requests are dropped.
range · 1–500 requests
Algorithm · Window log
Sliding Window Log
Maintains a sorted set of timestamps for every recent request. When a new request arrives, expired timestamps (outside the rolling window) are removed; if the remaining count is below the threshold, the request is accepted and its timestamp added. Precise — but every request keeps a timestamp in memory, including rejected ones.
How it decides
- 01Maintain a sorted set of timestamps for every accepted request (Redis sorted set is the canonical store).
- 02When a new request arrives, remove all timestamps older than (now − window).
- 03If the remaining count is below the threshold, accept and add the new timestamp; else drop.
- 04The chapter notes: rejected requests still consume memory briefly because their timestamp is added before the count check on some implementations.
Pros
- +Very accurate — strictly enforces the rolling-window threshold
- +No window-edge spike (unlike fixed window)
Cons
- −High memory cost — even rejected requests still consume space briefly
Recommended for
Precise rate limiting where memory is not a constraint (sensitive APIs, login endpoints).
Tunable parameters
Length of the rolling window.
range · 1000–60000 ms
Threshold of accepted requests within the rolling window.
range · 1–500 requests
Algorithm · Window counter
Sliding Window Counter
live lab availableCombines fixed window's memory efficiency with sliding log's accuracy. Tracks two counters — current window and previous window — and computes a weighted estimate based on how much of the rolling window overlaps each. Smooths spikes at window edges, only minor approximation versus the precise log-based variant. Cloudflare reports < 0.003% over-allowance against the actual rate over four hundred million requests.
How it decides
- 01Track two counters: current window and previous window.
- 02When a request arrives at time t, compute the rolling-window estimate as a weighted blend of the two counters based on how much of the rolling window overlaps each.
- 03If the estimate is below threshold, accept and increment current.
- 04At the window boundary, current rolls into previous and current resets.
The chapter cites Cloudflare seeing 0.003% over-allowance versus exact tracking across 400M requests.
Pros
- +Smooths spikes from a burst — no window-edge problem
- +Memory efficient (two counters, not a per-request log)
Cons
- −Approximate — assumes the previous window's traffic was uniformly distributed
Recommended for
Production-grade rate limiting at scale (Cloudflare, public APIs).
Tunable parameters
Length of each window — the rolling window slides at this granularity.
range · 1000–60000 ms
Threshold per rolling window.
range · 1–500 requests
Section · Architecture
Architecture & distributed concerns
A single-server rate limiter is easy. The trouble starts when you scale to many web servers, each handling a fraction of traffic for the same user. Five stages: where the limiter lives, the race condition that bites at scale, how to share state across instances, the rules schema in production, and what to monitor after deploy.
Where it lives
The rate limiter is middleware: a small layer between client and API servers that consults a shared counter store (Redis) and a rules cache before forwarding the request. If the limit is reached it short-circuits with 429 Too Many Requests and a Retry-After header — the request never reaches the API tier. Below the limit it forwards the request, then atomically increments the counter in Redis (INCR + EXPIRE inside a Lua script).
Rules are pulled from a config store into a local cache on a schedule; the limiter consults the cache, not the store, on the hot path.
The race condition
⚠ Read → check → increment is three steps
Naïve implementation: read counter, check threshold, increment counter. Two concurrent threads can both read the same value, both pass the check, both increment — letting through one more request than the threshold allows. At production traffic this happens constantly.
Two solutions the chapter recommends
- ALua scripts. Run the read-check-increment sequence on the Redis server inside a single EVAL call — atomic by construction. This is what production rate limiters at Stripe, Cloudflare, and Shopify use.
- BSorted sets. For the sliding-window log variant, Redis sorted sets give O(log N) atomic add + remove-by-score. The chapter calls this out as a special case.
Distributed locks (Redlock etc.) work but are slow under contention. The chapter explicitly recommends Lua scripts over locks for this reason.
The same atomicity story shows up underneath any production key-value store — Chapter 6's storage engine relies on commit-log appends + in-memory ordering to give the equivalent guarantee one node down.
Synchronization across instances
Web servers are stateless. Two requests from the same user can land on different instances. Each instance needs to see the same counter. Two patterns:
- ✗Sticky sessions. Pin a user to a specific instance. Works, but breaks horizontal scaling and load distribution.
- ✓Centralized data store. Use Redis as the shared counter store. All instances read and write the same key, atomicity provided by Lua scripts as in stage 02.
Most production deployments accept eventual consistency: the counter may briefly disagree across regions, but the rate-limit decision converges within the network round-trip.
Rules format
The chapter references Lyft's open-source rate limiter, which uses a YAML schema with a domain and a list of descriptors. Three real-world examples:
Lyft / Envoy style5 marketing messages per day
Throttle outbound marketing per user — exactly the chapter's first example. Domain groups related rules; descriptors carry the dimension being limited.
domain: messaging
descriptors:
- key: message_type
Value: marketing
rate_limit:
unit: day
requests_per_unit: 55 login attempts per minute
Per-account auth throttling — guards against credential-stuffing without locking out legitimate retries.
domain: auth
descriptors:
- key: auth_type
Value: login
rate_limit:
unit: minute
requests_per_unit: 5100 API requests per second per IP
Coarse IP-keyed throttle for a public read API. Hot first line of defence behind the load balancer.
domain: api
descriptors:
- key: remote_address
rate_limit:
unit: second
requests_per_unit: 100Monitoring
After deployment, two questions matter: is the algorithm effective, and are the rules tuned right? If a sudden traffic surge keeps getting blocked, the rule may be too strict; if 429s never appear in the metrics, it may be too loose. Watch the drop rate per rule per scope.
Section · Wire format
HTTP responses for throttled requests
When a request is dropped, the server tells the client what happened, why, and when it's safe to retry. Three headers carry the budget; one status code carries the verdict. Getting these right is the difference between a client that gracefully backs off and one that hammers your error rate to 100%.
Status · 429 Too Many Requests
Standard HTTP status. Distinct from 503 Service Unavailable— 503 says “the system is broken,” 429 says “the system is fine and you're going too fast.” Clients should treat them differently: 503 means circuit-break and back off; 429 means slow down to within the published budget.
Rate-limit headers
X-Ratelimit-Limit
The total quota for the current window.
Example · 100
X-Ratelimit-Remaining
Allowed requests left in the current window.
Example · 42
X-Ratelimit-Retry-After
Seconds until the limit resets — clients use this to schedule the next attempt.
Example · 13
Standard Retry-After works too, but the X-Ratelimit-* trio is the convention adopted by Twitter, GitHub, and most public APIs.
Sample response
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-Ratelimit-Limit: 100
X-Ratelimit-Remaining: 0
X-Ratelimit-Retry-After: 13
{
"error": "rate_limited",
"message": "Quota exceeded for this user. Retry after 13 seconds.",
"retry_after_seconds": 13
}Where rate limiting can live in the stack
The chapter wraps up by noting rate limiting can sit at multiple OSI layers — the Live Lab in this module focuses on layer 7, but the same logic appears at lower layers in production networks.
L3 · Network
iptables-style packet rate limits, BGP flow caps. Coarse, fast, no application context.
L4 · Transport
TCP / UDP connection rate limits at load balancers. Per-IP, per-port — typical front-line DDoS shield.
L7 · Application
Per-user, per-API-key, per-endpoint quotas. Where business logic lives — and where this chapter focuses.
Client-side best practices
- ▸Read X-Ratelimit-Remaining on every response and slow down before you hit zero.
- ▸On a 429, honour Retry-After exactly. Never retry sooner than the server told you to.
- ▸Layer exponential backoff with jitter on top — multiple clients getting 429 at the same instant should not all retry at the same instant.
- ▸Catch 429s explicitly in error handling so the user sees a useful message, not a generic “something went wrong.”
Section · Live Lab
Real Redis-backed simulation
backend · celery · ssePick an algorithm, dial its parameters, choose a traffic profile, and run. The backend builds a real Redis-backed instance of the algorithm (Lua scripts for atomic check-and-increment), simulates traffic via Poisson sampling, and streams every decision back to this page over Server-Sent Events. The visualization below updates from the algorithm's actual state, not a mock.
Run configuration
Algorithm
Token Bucket parameters
Traffic profile
Bursty: Square wave alternating peak and base every period. Tests how the algorithm handles repeated burst pressure.
What to expect
steady-state mathArrivals
~638
Accepted (ceiling)
≤ 85
Acceptance rate
~13%
Watch the first ~0.3s — the bucket starts full so accepts run high, then settles at the refill rate (5/s) once it drains.
Sections
Section · Overview
Designing a rate limiter
A rate limiter caps how many requests a client or service can issue in a given window. It protects you from accidental overuse, hostile traffic, and the cost spiral that happens when neither is checked. This chapter walks through five algorithms, the architecture behind them, and the distributed-correctness traps you hit at scale.
Why rate limit
- ▸Prevent resource starvation from DoS attacks. Most public APIs publish a default ceiling for exactly this reason — Twitter caps writes at 300 / 3h, Google Docs at 300 reads/min/user.
- ▸Reduce cost. If you bill third-party services per call, a rate limiter is the only thing standing between a buggy retry loop and a six-figure surprise.
- ▸Prevent server overload. Bots and misbehaving clients can fill your queues; throttling at the edge keeps the rest of the system within capacity.
Where it goes
Client-side
InsecureA client-side limiter is trivially bypassed by a hostile or modified client. Useful for UX hints only — never as a safety control.
Server-side
In your codeEmbed the limiter in the API server. Full control, but every service implements its own, and rules don't compose across teams.
Middleware / Gateway
API gatewayCloud microservices favor a dedicated gateway tier (Envoy, Kong, AWS API Gateway). Cross-cutting policy, one configuration to maintain.
Five algorithms
click any to readThe chapter compares five strategies. They split along two axes: how they store state (per-request log vs. counter vs. bucket) and what they do at burst (allow / smooth / strict).
Worked scenarios
algorithm choiceTwitter — 300 posts per 3 hours
A user can publish at most 300 posts in any rolling three-hour window. The cap protects fanout infrastructure and discourages spam.
Without a limit, a single compromised account could flood every follower's feed and the message queue. The cap is per-user, sustained over hours.
Recommended · Sliding Window Counter
Three hours is large enough that fixed-window edge effects matter; the per-user counter doesn't need timestamp-level precision, just smooth enforcement.
Google Docs APIs — read requests per minute per user
Default read quota is 300 per 60 seconds per user, configurable up. The limit is short-window and high-frequency.
Read fan-out from a single client (e.g. a custom integration) can saturate downstream services. A short rolling window catches misbehaving clients quickly.
Recommended · Token Bucket
Token bucket tolerates legitimate short bursts (a UI loading several documents at once) while still enforcing the steady-state rate.
Marketing SMS — 5 per day per user
A messaging service caps marketing SMS to 5 per user per day, hard limit. Compliance, not capacity.
User-experience and regulatory reasons. Going over is a real failure, not a soft hint.
Recommended · Sliding Window Log
Five per day is small enough that the per-request memory cost is trivial, and absolute precision matters — no edge-of-window 2× allowed.
Auth — 5 login attempts per minute per IP
Login endpoint capped at 5 attempts per minute per IP to slow credential stuffing without locking out users on retry.
Brute-force defence. The limit must be strict — letting through a 6th attempt on the boundary defeats the point.
Recommended · Sliding Window Log
Strict rolling-window precision; per-IP scope keeps memory bounded.
Algorithm · Bucket
Token Bucket
live lab availableA container has a pre-defined capacity. Tokens are added by a refiller at fixed intervals up to that capacity. Every request consumes one token; if the bucket is empty, the request is dropped. The bucket size sets the burst tolerance, the refill rate sets the steady-state throughput. Used by Amazon and Stripe to throttle their public APIs.
How it decides
- 01A refiller adds tokens at a fixed rate, capped at the bucket capacity.
- 02Each incoming request consumes one token.
- 03If the bucket is empty, the request is dropped.
- 04Capacity sets burst tolerance, refill rate sets steady-state throughput.
Pros
- +Easy to implement and reason about
- +Memory efficient (two numbers per bucket)
- +Allows short bursts when tokens are saved up
Cons
- −Two parameters (capacity, refill rate) can be hard to tune correctly
Recommended for
Public APIs that benefit from absorbing short traffic spikes (Amazon, Stripe).
Tunable parameters
Maximum tokens the bucket can hold — determines burst tolerance.
range · 1–500 tokens
How many tokens are added per second — determines sustained throughput.
range · 1–500 tokens / sec
Algorithm · Queue
Leaking Bucket
live lab availableRequests arrive and are appended to a fixed-size FIFO queue. The queue is drained at a constant leak rate. If the queue is full when a request arrives, the request is dropped. Smooths bursts by shifting them in time rather than absorbing them. Shopify uses leaky buckets to rate-limit API requests.
How it decides
- 01Requests arrive and are pushed onto a fixed-size FIFO queue.
- 02If the queue is at capacity, the new request is dropped.
- 03A background process drains the queue at a constant leak rate.
- 04Net effect: bursts are smoothed in time rather than absorbed — downstream sees a steady rate.
Pros
- +Memory efficient (queue size is bounded)
- +Processes requests at a fixed rate — predictable downstream load
- +Suitable when a stable outflow rate is required
Cons
- −Bursts fill the queue with old requests; if not processed in time, recent requests are dropped
- −Two parameters (capacity, leak rate) — same tuning challenge as token bucket
Recommended for
Stable, smoothed outflow scenarios (Shopify, payment processing).
Tunable parameters
Maximum queued requests waiting to be processed.
range · 1–500 requests
How many requests drain from the queue per second.
range · 1–500 requests / sec
Algorithm · Window counter
Fixed Window Counter
Time is divided into fixed-size windows (e.g., 1-minute slots). Each window has a counter; new requests increment the counter for the current window. When the counter reaches a threshold, further requests are dropped until the next window starts. Easy to implement and memory-efficient — but a burst at the boundary between two windows can deliver up to 2× the intended rate.
⚠ The edge-spike problem
The chapter walks through a worst case: with a 1-minute window and a 5 requests/min cap, a client can issue 5 requests in the last second of one window and 5 in the first second of the next — 10 requests in two seconds, double the intended rate. Fixed window is simple but this edge effect is real, and at API-server scale it becomes a capacity-planning problem.
How it decides
- 01Time is sliced into fixed-size windows (e.g., 1 minute each).
- 02Each window has its own counter starting at 0.
- 03Each request increments the current window's counter; once the threshold is reached, further requests are dropped.
- 04At the next window boundary, the counter resets to 0.
Pros
- +Memory efficient (one counter per window)
- +Trivially easy to understand and implement
- +Resetting the counter on a window boundary is fast
Cons
- −Spikes at window edges can let through 2× the threshold for a brief moment
Recommended for
Simple rate limits where the 2× edge effect is tolerable.
Tunable parameters
Length of each fixed time window.
range · 1000–60000 ms
Threshold per window before requests are dropped.
range · 1–500 requests
Algorithm · Window log
Sliding Window Log
Maintains a sorted set of timestamps for every recent request. When a new request arrives, expired timestamps (outside the rolling window) are removed; if the remaining count is below the threshold, the request is accepted and its timestamp added. Precise — but every request keeps a timestamp in memory, including rejected ones.
How it decides
- 01Maintain a sorted set of timestamps for every accepted request (Redis sorted set is the canonical store).
- 02When a new request arrives, remove all timestamps older than (now − window).
- 03If the remaining count is below the threshold, accept and add the new timestamp; else drop.
- 04The chapter notes: rejected requests still consume memory briefly because their timestamp is added before the count check on some implementations.
Pros
- +Very accurate — strictly enforces the rolling-window threshold
- +No window-edge spike (unlike fixed window)
Cons
- −High memory cost — even rejected requests still consume space briefly
Recommended for
Precise rate limiting where memory is not a constraint (sensitive APIs, login endpoints).
Tunable parameters
Length of the rolling window.
range · 1000–60000 ms
Threshold of accepted requests within the rolling window.
range · 1–500 requests
Algorithm · Window counter
Sliding Window Counter
live lab availableCombines fixed window's memory efficiency with sliding log's accuracy. Tracks two counters — current window and previous window — and computes a weighted estimate based on how much of the rolling window overlaps each. Smooths spikes at window edges, only minor approximation versus the precise log-based variant. Cloudflare reports < 0.003% over-allowance against the actual rate over four hundred million requests.
How it decides
- 01Track two counters: current window and previous window.
- 02When a request arrives at time t, compute the rolling-window estimate as a weighted blend of the two counters based on how much of the rolling window overlaps each.
- 03If the estimate is below threshold, accept and increment current.
- 04At the window boundary, current rolls into previous and current resets.
The chapter cites Cloudflare seeing 0.003% over-allowance versus exact tracking across 400M requests.
Pros
- +Smooths spikes from a burst — no window-edge problem
- +Memory efficient (two counters, not a per-request log)
Cons
- −Approximate — assumes the previous window's traffic was uniformly distributed
Recommended for
Production-grade rate limiting at scale (Cloudflare, public APIs).
Tunable parameters
Length of each window — the rolling window slides at this granularity.
range · 1000–60000 ms
Threshold per rolling window.
range · 1–500 requests
Section · Architecture
Architecture & distributed concerns
A single-server rate limiter is easy. The trouble starts when you scale to many web servers, each handling a fraction of traffic for the same user. Five stages: where the limiter lives, the race condition that bites at scale, how to share state across instances, the rules schema in production, and what to monitor after deploy.
Where it lives
The rate limiter is middleware: a small layer between client and API servers that consults a shared counter store (Redis) and a rules cache before forwarding the request. If the limit is reached it short-circuits with 429 Too Many Requests and a Retry-After header — the request never reaches the API tier. Below the limit it forwards the request, then atomically increments the counter in Redis (INCR + EXPIRE inside a Lua script).
Rules are pulled from a config store into a local cache on a schedule; the limiter consults the cache, not the store, on the hot path.
The race condition
⚠ Read → check → increment is three steps
Naïve implementation: read counter, check threshold, increment counter. Two concurrent threads can both read the same value, both pass the check, both increment — letting through one more request than the threshold allows. At production traffic this happens constantly.
Two solutions the chapter recommends
- ALua scripts. Run the read-check-increment sequence on the Redis server inside a single EVAL call — atomic by construction. This is what production rate limiters at Stripe, Cloudflare, and Shopify use.
- BSorted sets. For the sliding-window log variant, Redis sorted sets give O(log N) atomic add + remove-by-score. The chapter calls this out as a special case.
Distributed locks (Redlock etc.) work but are slow under contention. The chapter explicitly recommends Lua scripts over locks for this reason.
The same atomicity story shows up underneath any production key-value store — Chapter 6's storage engine relies on commit-log appends + in-memory ordering to give the equivalent guarantee one node down.
Synchronization across instances
Web servers are stateless. Two requests from the same user can land on different instances. Each instance needs to see the same counter. Two patterns:
- ✗Sticky sessions. Pin a user to a specific instance. Works, but breaks horizontal scaling and load distribution.
- ✓Centralized data store. Use Redis as the shared counter store. All instances read and write the same key, atomicity provided by Lua scripts as in stage 02.
Most production deployments accept eventual consistency: the counter may briefly disagree across regions, but the rate-limit decision converges within the network round-trip.
Rules format
The chapter references Lyft's open-source rate limiter, which uses a YAML schema with a domain and a list of descriptors. Three real-world examples:
Lyft / Envoy style5 marketing messages per day
Throttle outbound marketing per user — exactly the chapter's first example. Domain groups related rules; descriptors carry the dimension being limited.
domain: messaging
descriptors:
- key: message_type
Value: marketing
rate_limit:
unit: day
requests_per_unit: 55 login attempts per minute
Per-account auth throttling — guards against credential-stuffing without locking out legitimate retries.
domain: auth
descriptors:
- key: auth_type
Value: login
rate_limit:
unit: minute
requests_per_unit: 5100 API requests per second per IP
Coarse IP-keyed throttle for a public read API. Hot first line of defence behind the load balancer.
domain: api
descriptors:
- key: remote_address
rate_limit:
unit: second
requests_per_unit: 100Monitoring
After deployment, two questions matter: is the algorithm effective, and are the rules tuned right? If a sudden traffic surge keeps getting blocked, the rule may be too strict; if 429s never appear in the metrics, it may be too loose. Watch the drop rate per rule per scope.
Section · Wire format
HTTP responses for throttled requests
When a request is dropped, the server tells the client what happened, why, and when it's safe to retry. Three headers carry the budget; one status code carries the verdict. Getting these right is the difference between a client that gracefully backs off and one that hammers your error rate to 100%.
Status · 429 Too Many Requests
Standard HTTP status. Distinct from 503 Service Unavailable— 503 says “the system is broken,” 429 says “the system is fine and you're going too fast.” Clients should treat them differently: 503 means circuit-break and back off; 429 means slow down to within the published budget.
Rate-limit headers
X-Ratelimit-Limit
The total quota for the current window.
Example · 100
X-Ratelimit-Remaining
Allowed requests left in the current window.
Example · 42
X-Ratelimit-Retry-After
Seconds until the limit resets — clients use this to schedule the next attempt.
Example · 13
Standard Retry-After works too, but the X-Ratelimit-* trio is the convention adopted by Twitter, GitHub, and most public APIs.
Sample response
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-Ratelimit-Limit: 100
X-Ratelimit-Remaining: 0
X-Ratelimit-Retry-After: 13
{
"error": "rate_limited",
"message": "Quota exceeded for this user. Retry after 13 seconds.",
"retry_after_seconds": 13
}Where rate limiting can live in the stack
The chapter wraps up by noting rate limiting can sit at multiple OSI layers — the Live Lab in this module focuses on layer 7, but the same logic appears at lower layers in production networks.
L3 · Network
iptables-style packet rate limits, BGP flow caps. Coarse, fast, no application context.
L4 · Transport
TCP / UDP connection rate limits at load balancers. Per-IP, per-port — typical front-line DDoS shield.
L7 · Application
Per-user, per-API-key, per-endpoint quotas. Where business logic lives — and where this chapter focuses.
Client-side best practices
- ▸Read X-Ratelimit-Remaining on every response and slow down before you hit zero.
- ▸On a 429, honour Retry-After exactly. Never retry sooner than the server told you to.
- ▸Layer exponential backoff with jitter on top — multiple clients getting 429 at the same instant should not all retry at the same instant.
- ▸Catch 429s explicitly in error handling so the user sees a useful message, not a generic “something went wrong.”
Section · Live Lab
Real Redis-backed simulation
backend · celery · ssePick an algorithm, dial its parameters, choose a traffic profile, and run. The backend builds a real Redis-backed instance of the algorithm (Lua scripts for atomic check-and-increment), simulates traffic via Poisson sampling, and streams every decision back to this page over Server-Sent Events. The visualization below updates from the algorithm's actual state, not a mock.
Run configuration
Algorithm
Token Bucket parameters
Traffic profile
Bursty: Square wave alternating peak and base every period. Tests how the algorithm handles repeated burst pressure.
What to expect
steady-state mathArrivals
~638
Accepted (ceiling)
≤ 85
Acceptance rate
~13%
Watch the first ~0.3s — the bucket starts full so accepts run high, then settles at the refill rate (5/s) once it drains.