What is Rate Limiting in System Design?

Reliabilityhigh

Rate Limiting

Rate limiting restricts the number of requests a client can make in a time window to prevent abuse, protect backend resources, and ensure fair usage across clients.

Memory anchor

Rate limiting = a bouncer with a clicker counter at the club door. Token bucket = you get 10 drink tickets per hour; spend them whenever, but once they're gone, wait for the next refill. Leaky bucket = drinks pour out at a steady drip no matter how fast you order.

Expected depth

Token bucket: each client has a bucket of N tokens. Each request consumes one token. Tokens refill at a fixed rate. Allows bursting up to bucket capacity, then limits to refill rate. Leaky bucket: requests enter a queue (the bucket) and are processed at a constant rate — smooths traffic but adds latency. Fixed window counter: count requests in each minute window. Simple but vulnerable to boundary spikes (2x allowed requests straddle two window boundaries). Sliding window log: track timestamps of each request; count requests in the last N seconds. Accurate but memory-intensive. Sliding window counter: approximate sliding window using weighted average of current and previous window counts — good balance of accuracy and efficiency.

Deep — senior internals

Distributed rate limiting: per-instance counters are inaccurate (a user can bypass a 100 req/s limit by hitting 100 instances at 1 req/s each). Centralized rate limiting requires a distributed counter — Redis INCR with EXPIRE is the standard approach. For high-throughput rate limiting, use Redis with Lua scripts (atomic read-increment-compare) or Redis 7's built-in function feature. Token bucket in Redis: store (last_refill_time, token_count) per key. On each request, compute tokens to add since last refill, check if sufficient, decrement atomically. Cloudflare and Stripe use sliding window counters at the edge; they accept 1-2% over-counting at boundaries as an acceptable trade-off for efficiency. For API monetization, rate limits are per-plan: free tier at 10 req/min, paid at 1000 req/min — stored in a configuration service and looked up at request time.

🎤Interview-ready answer

I'd implement rate limiting in the API gateway using a sliding window counter in Redis. The key is (client_id, time_window_bucket) with atomic INCR. For each request, I'd compute the weighted average of the previous window and current window to get a smooth rate estimate. I'd expose rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After) to clients so they can self-throttle. For different tiers, I'd store the limit per API key in a configuration store and cache it locally with a short TTL.

⚠Common trap

Implementing rate limiting purely per-instance without a distributed counter. A user hitting 10 app servers can exceed the limit by 10x. Always use a shared counter for rate limiting in horizontally scaled services.

Related concepts

High-Level Design

RED Metrics (Rate / Errors / Duration)