Overview

Section · Overview

Designing a rate limiter

A rate limiter caps how many requests a client or service can issue in a given window. It protects you from accidental overuse, hostile traffic, and the cost spiral that happens when neither is checked. This chapter walks through five algorithms, the architecture behind them, and the distributed-correctness traps you hit at scale.

Why rate limit

▸Prevent resource starvation from DoS attacks. Most public APIs publish a default ceiling for exactly this reason — Twitter caps writes at 300 / 3h, Google Docs at 300 reads/min/user.
▸Reduce cost. If you bill third-party services per call, a rate limiter is the only thing standing between a buggy retry loop and a six-figure surprise.
▸Prevent server overload. Bots and misbehaving clients can fill your queues; throttling at the edge keeps the rest of the system within capacity.

Where it goes

Client-side

Insecure

A client-side limiter is trivially bypassed by a hostile or modified client. Useful for UX hints only — never as a safety control.

Server-side

In your code

Embed the limiter in the API server. Full control, but every service implements its own, and rules don't compose across teams.

Middleware / Gateway

API gateway

Cloud microservices favor a dedicated gateway tier (Envoy, Kong, AWS API Gateway). Cross-cutting policy, one configuration to maintain.

Five algorithms

click any to read

The chapter compares five strategies. They split along two axes: how they store state (per-request log vs. counter vs. bucket) and what they do at burst (allow / smooth / strict).

Worked scenarios

algorithm choice

Twitter — 300 posts per 3 hours

A user can publish at most 300 posts in any rolling three-hour window. The cap protects fanout infrastructure and discourages spam.

Without a limit, a single compromised account could flood every follower's feed and the message queue. The cap is per-user, sustained over hours.

Recommended · Sliding Window Counter

Three hours is large enough that fixed-window edge effects matter; the per-user counter doesn't need timestamp-level precision, just smooth enforcement.

Google Docs APIs — read requests per minute per user

Default read quota is 300 per 60 seconds per user, configurable up. The limit is short-window and high-frequency.

Read fan-out from a single client (e.g. a custom integration) can saturate downstream services. A short rolling window catches misbehaving clients quickly.

Recommended · Token Bucket

Token bucket tolerates legitimate short bursts (a UI loading several documents at once) while still enforcing the steady-state rate.

Marketing SMS — 5 per day per user

A messaging service caps marketing SMS to 5 per user per day, hard limit. Compliance, not capacity.

User-experience and regulatory reasons. Going over is a real failure, not a soft hint.

Recommended · Sliding Window Log

Five per day is small enough that the per-request memory cost is trivial, and absolute precision matters — no edge-of-window 2× allowed.

Auth — 5 login attempts per minute per IP

Brute-force defence. The limit must be strict — letting through a 6th attempt on the boundary defeats the point.

Recommended · Sliding Window Log

Strict rolling-window precision; per-IP scope keeps memory bounded.

Section · Architecture

Architecture & distributed concerns

A single-server rate limiter is easy. The trouble starts when you scale to many web servers, each handling a fraction of traffic for the same user. Five stages: where the limiter lives, the race condition that bites at scale, how to share state across instances, the rules schema in production, and what to monitor after deploy.

Where it lives

Architecture flow

The rate limiter is middleware: a small layer between client and API servers that consults a shared counter store (Redis) and a rules cache before forwarding the request. If the limit is reached it short-circuits with 429 Too Many Requests and a Retry-After header — the request never reaches the API tier. Below the limit it forwards the request, then atomically increments the counter in Redis (INCR + EXPIRE inside a Lua script).

Rules are pulled from a config store into a local cache on a schedule; the limiter consults the cache, not the store, on the hot path.

The race condition

⚠ Read → check → increment is three steps

Naïve implementation: read counter, check threshold, increment counter. Two concurrent threads can both read the same value, both pass the check, both increment — letting through one more request than the threshold allows. At production traffic this happens constantly.

Two solutions the chapter recommends

ALua scripts. Run the read-check-increment sequence on the Redis server inside a single EVAL call — atomic by construction. This is what production rate limiters at Stripe, Cloudflare, and Shopify use.
BSorted sets. For the sliding-window log variant, Redis sorted sets give O(log N) atomic add + remove-by-score. The chapter calls this out as a special case.

Distributed locks (Redlock etc.) work but are slow under contention. The chapter explicitly recommends Lua scripts over locks for this reason.

The same atomicity story shows up underneath any production key-value store — Chapter 6's storage engine relies on commit-log appends + in-memory ordering to give the equivalent guarantee one node down.

Synchronization across instances

Web servers are stateless. Two requests from the same user can land on different instances. Each instance needs to see the same counter. Two patterns:

✗Sticky sessions. Pin a user to a specific instance. Works, but breaks horizontal scaling and load distribution.
✓Centralized data store. Use Redis as the shared counter store. All instances read and write the same key, atomicity provided by Lua scripts as in stage 02.

Most production deployments accept eventual consistency: the counter may briefly disagree across regions, but the rate-limit decision converges within the network round-trip.

Rules format

The chapter references Lyft's open-source rate limiter, which uses a YAML schema with a domain and a list of descriptors. Three real-world examples:

Lyft / Envoy style

5 marketing messages per day

Throttle outbound marketing per user — exactly the chapter's first example. Domain groups related rules; descriptors carry the dimension being limited.

domain: messaging
descriptors:
  - key: message_type
    Value: marketing
    rate_limit:
      unit: day
      requests_per_unit: 5

5 login attempts per minute

Per-account auth throttling — guards against credential-stuffing without locking out legitimate retries.

domain: auth
descriptors:
  - key: auth_type
    Value: login
    rate_limit:
      unit: minute
      requests_per_unit: 5

100 API requests per second per IP

Coarse IP-keyed throttle for a public read API. Hot first line of defence behind the load balancer.

domain: api
descriptors:
  - key: remote_address
    rate_limit:
      unit: second
      requests_per_unit: 100

Monitoring

After deployment, two questions matter: is the algorithm effective, and are the rules tuned right? If a sudden traffic surge keeps getting blocked, the rule may be too strict; if 429s never appear in the metrics, it may be too loose. Watch the drop rate per rule per scope.

Section · Wire format

HTTP responses for throttled requests

When a request is dropped, the server tells the client what happened, why, and when it's safe to retry. Three headers carry the budget; one status code carries the verdict. Getting these right is the difference between a client that gracefully backs off and one that hammers your error rate to 100%.

Status · 429 Too Many Requests

Standard HTTP status. Distinct from 503 Service Unavailable— 503 says “the system is broken,” 429 says “the system is fine and you're going too fast.” Clients should treat them differently: 503 means circuit-break and back off; 429 means slow down to within the published budget.

Rate-limit headers

X-Ratelimit-Limit

The total quota for the current window.

Example · 100

X-Ratelimit-Remaining

Allowed requests left in the current window.

Example · 42

X-Ratelimit-Retry-After

Seconds until the limit resets — clients use this to schedule the next attempt.

Example · 13

Standard Retry-After works too, but the X-Ratelimit-* trio is the convention adopted by Twitter, GitHub, and most public APIs.

Sample response

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-Ratelimit-Limit: 100
X-Ratelimit-Remaining: 0
X-Ratelimit-Retry-After: 13

{
  "error": "rate_limited",
  "message": "Quota exceeded for this user. Retry after 13 seconds.",
  "retry_after_seconds": 13
}

Where rate limiting can live in the stack

The chapter wraps up by noting rate limiting can sit at multiple OSI layers — the Live Lab in this module focuses on layer 7, but the same logic appears at lower layers in production networks.

L3 · Network

iptables-style packet rate limits, BGP flow caps. Coarse, fast, no application context.

L4 · Transport

TCP / UDP connection rate limits at load balancers. Per-IP, per-port — typical front-line DDoS shield.

L7 · Application

Per-user, per-API-key, per-endpoint quotas. Where business logic lives — and where this chapter focuses.

Client-side best practices

▸Read X-Ratelimit-Remaining on every response and slow down before you hit zero.
▸On a 429, honour Retry-After exactly. Never retry sooner than the server told you to.
▸Layer exponential backoff with jitter on top — multiple clients getting 429 at the same instant should not all retry at the same instant.
▸Catch 429s explicitly in error handling so the user sees a useful message, not a generic “something went wrong.”

Sections

Section · Overview

Designing a rate limiter

Why rate limit

▸Prevent resource starvation from DoS attacks. Most public APIs publish a default ceiling for exactly this reason — Twitter caps writes at 300 / 3h, Google Docs at 300 reads/min/user.
▸Reduce cost. If you bill third-party services per call, a rate limiter is the only thing standing between a buggy retry loop and a six-figure surprise.
▸Prevent server overload. Bots and misbehaving clients can fill your queues; throttling at the edge keeps the rest of the system within capacity.

Where it goes

Client-side

Insecure

A client-side limiter is trivially bypassed by a hostile or modified client. Useful for UX hints only — never as a safety control.

Server-side

In your code

Embed the limiter in the API server. Full control, but every service implements its own, and rules don't compose across teams.

Middleware / Gateway

API gateway

Cloud microservices favor a dedicated gateway tier (Envoy, Kong, AWS API Gateway). Cross-cutting policy, one configuration to maintain.

Five algorithms

click any to read

The chapter compares five strategies. They split along two axes: how they store state (per-request log vs. counter vs. bucket) and what they do at burst (allow / smooth / strict).

Worked scenarios

algorithm choice

Twitter — 300 posts per 3 hours

A user can publish at most 300 posts in any rolling three-hour window. The cap protects fanout infrastructure and discourages spam.

Without a limit, a single compromised account could flood every follower's feed and the message queue. The cap is per-user, sustained over hours.

Recommended · Sliding Window Counter

Three hours is large enough that fixed-window edge effects matter; the per-user counter doesn't need timestamp-level precision, just smooth enforcement.

Google Docs APIs — read requests per minute per user

Default read quota is 300 per 60 seconds per user, configurable up. The limit is short-window and high-frequency.

Read fan-out from a single client (e.g. a custom integration) can saturate downstream services. A short rolling window catches misbehaving clients quickly.

Recommended · Token Bucket

Token bucket tolerates legitimate short bursts (a UI loading several documents at once) while still enforcing the steady-state rate.

Marketing SMS — 5 per day per user

A messaging service caps marketing SMS to 5 per user per day, hard limit. Compliance, not capacity.

User-experience and regulatory reasons. Going over is a real failure, not a soft hint.

Recommended · Sliding Window Log

Five per day is small enough that the per-request memory cost is trivial, and absolute precision matters — no edge-of-window 2× allowed.

Auth — 5 login attempts per minute per IP

Brute-force defence. The limit must be strict — letting through a 6th attempt on the boundary defeats the point.

Recommended · Sliding Window Log

Strict rolling-window precision; per-IP scope keeps memory bounded.

Section · Architecture

Architecture & distributed concerns

Where it lives

Architecture flow

Rules are pulled from a config store into a local cache on a schedule; the limiter consults the cache, not the store, on the hot path.

The race condition

⚠ Read → check → increment is three steps

Two solutions the chapter recommends

ALua scripts. Run the read-check-increment sequence on the Redis server inside a single EVAL call — atomic by construction. This is what production rate limiters at Stripe, Cloudflare, and Shopify use.
BSorted sets. For the sliding-window log variant, Redis sorted sets give O(log N) atomic add + remove-by-score. The chapter calls this out as a special case.

Distributed locks (Redlock etc.) work but are slow under contention. The chapter explicitly recommends Lua scripts over locks for this reason.

Synchronization across instances

Web servers are stateless. Two requests from the same user can land on different instances. Each instance needs to see the same counter. Two patterns:

✗Sticky sessions. Pin a user to a specific instance. Works, but breaks horizontal scaling and load distribution.
✓Centralized data store. Use Redis as the shared counter store. All instances read and write the same key, atomicity provided by Lua scripts as in stage 02.

Most production deployments accept eventual consistency: the counter may briefly disagree across regions, but the rate-limit decision converges within the network round-trip.

Rules format

The chapter references Lyft's open-source rate limiter, which uses a YAML schema with a domain and a list of descriptors. Three real-world examples:

Lyft / Envoy style

5 marketing messages per day

Throttle outbound marketing per user — exactly the chapter's first example. Domain groups related rules; descriptors carry the dimension being limited.

domain: messaging
descriptors:
  - key: message_type
    Value: marketing
    rate_limit:
      unit: day
      requests_per_unit: 5

5 login attempts per minute

Per-account auth throttling — guards against credential-stuffing without locking out legitimate retries.

domain: auth
descriptors:
  - key: auth_type
    Value: login
    rate_limit:
      unit: minute
      requests_per_unit: 5

100 API requests per second per IP

Coarse IP-keyed throttle for a public read API. Hot first line of defence behind the load balancer.

domain: api
descriptors:
  - key: remote_address
    rate_limit:
      unit: second
      requests_per_unit: 100

Monitoring

Section · Wire format

HTTP responses for throttled requests

Status · 429 Too Many Requests

Rate-limit headers

X-Ratelimit-Limit

The total quota for the current window.

Example · 100

X-Ratelimit-Remaining

Allowed requests left in the current window.

Example · 42

X-Ratelimit-Retry-After

Seconds until the limit resets — clients use this to schedule the next attempt.

Example · 13

Standard Retry-After works too, but the X-Ratelimit-* trio is the convention adopted by Twitter, GitHub, and most public APIs.

Sample response

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-Ratelimit-Limit: 100
X-Ratelimit-Remaining: 0
X-Ratelimit-Retry-After: 13

{
  "error": "rate_limited",
  "message": "Quota exceeded for this user. Retry after 13 seconds.",
  "retry_after_seconds": 13
}

Where rate limiting can live in the stack

The chapter wraps up by noting rate limiting can sit at multiple OSI layers — the Live Lab in this module focuses on layer 7, but the same logic appears at lower layers in production networks.

L3 · Network

iptables-style packet rate limits, BGP flow caps. Coarse, fast, no application context.

L4 · Transport

TCP / UDP connection rate limits at load balancers. Per-IP, per-port — typical front-line DDoS shield.

L7 · Application

Per-user, per-API-key, per-endpoint quotas. Where business logic lives — and where this chapter focuses.

Client-side best practices

▸Read X-Ratelimit-Remaining on every response and slow down before you hit zero.
▸On a 429, honour Retry-After exactly. Never retry sooner than the server told you to.
▸Layer exponential backoff with jitter on top — multiple clients getting 429 at the same instant should not all retry at the same instant.
▸Catch 429s explicitly in error handling so the user sees a useful message, not a generic “something went wrong.”

Client-side

Server-side

Middleware / Gateway

Five algorithms

Token Bucket

Leaking Bucket

Fixed Window Counter

Sliding Window Log

Sliding Window Counter

Worked scenarios

Twitter — 300 posts per 3 hours

Google Docs APIs — read requests per minute per user

Marketing SMS — 5 per day per user

Auth — 5 login attempts per minute per IP

Sections

Client-side

Server-side

Middleware / Gateway

Five algorithms

Token Bucket

Leaking Bucket

Fixed Window Counter

Sliding Window Log

Sliding Window Counter

Worked scenarios

Twitter — 300 posts per 3 hours

Google Docs APIs — read requests per minute per user

Marketing SMS — 5 per day per user

Auth — 5 login attempts per minute per IP