Rate Limiting Algorithms
January 30, 2026
Overview
Rate limiting is a mechanism for controlling how many requests (or operations) a client can perform within a period of time. It's used to:
- Protect systems from overload and cascading failures
- Prevent abuse (scraping, brute-force login attempts)
- Enforce fairness across tenants/users
- Smooth traffic spikes and reduce tail latency
In practice, rate limiting is as much about product policy (who gets limited, how, and why) as it is about algorithms.
Table of Contents
What Rate Limiting Actually Solves
Rate limiting helps with:
- Overload protection: reject early instead of letting the system collapse
- Fairness: one noisy client shouldn’t starve others
- Cost control: keep expensive workloads within budget
- Security: slow down brute force and endpoint probing
- Quality of service: enforce tiers (free vs paid)
But rate limiting does not replace:
- WAF / bot detection
- Backpressure inside internal services
- Capacity planning and autoscaling
Core Concepts & Nuances
What’s the “key” you limit on?
Common keys:
- per IP
- per user
- per API key
- per tenant/org
- per route + user (e.g.,
/searchstricter than/profile) - per cost unit (e.g., GraphQL query cost)
Nuances that matter
- Bursts: is it okay to allow short spikes?
- Fairness: do you need per-tenant fairness, or global?
- Clock skew: distributed counters can drift
- Idempotency: should retries count?
- Tiering: limits differ for paid plans
- Latency impact: enforcement must be O(1) and fast
Fixed Window Counter
Idea: count requests in discrete windows (e.g., per minute). Allow if count ≤ limit.
Pros
- Very simple
- Cheap in memory and compute
Cons (big gotcha)
- Boundary burst: client can send
limitrequests at12:00:59and anotherlimitat12:01:00→ effectively 2× in 1 second.
# pseudo-code
# window_start_ts must be floored to the window boundary, e.g. ts - (ts % 60)
def allow_fixed_window(key, window_start_ts, limit, store):
counter_key = f"{key}:{window_start_ts}"
n = store.incr(counter_key)
# Important: only set TTL when the key is first created.
# Otherwise, frequent traffic can keep extending the window.
if n == 1:
store.expire(counter_key, 60)
return n <= limitSliding Window Log
Idea: store timestamps of recent requests; allow if #timestamps in the last T seconds ≤ limit.
Pros
- Accurate (no boundary burst)
Cons
- Can be memory heavy (stores per-request events)
- Concurrency gotcha: the check + insert must be atomic (e.g., Redis Lua), otherwise concurrent requests can overshoot the limit.
# pseudo-code for Redis sorted set (non-atomic sketch)
# In production: implement as a single atomic operation (Lua script / transaction).
def allow_sliding_log(key, now_ms, window_ms, limit, redis):
# One sorted set per rate-limit key (user / IP / API key / route, etc.)
zkey = f"rl:{key}"
# 1) Remove requests that are outside the sliding window
# Keeps only events in (now - window_ms, now]
redis.zremrangebyscore(zkey, 0, now_ms - window_ms)
# 2) Count how many requests remain in the window
count = redis.zcard(zkey)
# 3) Enforce the limit
# If we've already seen 'limit' events in the window, reject
if count >= limit:
return False
# 4) Generate a unique member ID for this request
# (ZSET members must be unique even if timestamps collide)
seq_key = f"{zkey}:seq"
seq = redis.incr(seq_key)
# Keep the sequence counter bounded to the window lifetime
redis.pexpire(seq_key, window_ms)
# 5) Add this request to the window
# Score = timestamp (used for range queries)
# Member = unique identifier
member = f"{now_ms}:{seq}"
redis.zadd(zkey, {member: now_ms})
# 6) Expire the window key when traffic stops
# (prevents stale keys from living forever)
redis.pexpire(zkey, window_ms)
return TrueSliding Window Counter
Idea: approximate sliding window using two adjacent fixed windows and weighted interpolation.
Pros
- Much cheaper than log
- Smooths boundary bursts
Cons
- Approximation error (usually acceptable)
Conceptually:
current_window_count + prev_window_count * overlap_ratio
Leaky Bucket
Idea: requests enter a bucket (queue). The bucket leaks at a constant rate.
Pros
- Produces stable output rate (smooths traffic)
- Great for downstream systems that hate bursts
Cons
- Adds queueing latency
- If bucket is full, requests are dropped
Use cases:
- Email sending
- Background job dispatch
- Payment provider calls
Token Bucket
Idea: tokens accumulate at a steady rate up to a max capacity. Each request consumes tokens.
This is the most common choice for APIs because it allows controlled bursts.
import time
class TokenBucket:
def __init__(self, rate_per_sec, burst):
self.rate = rate_per_sec
self.capacity = burst
self.tokens = burst
self.last = time.monotonic() # monotonic avoids issues from wall-clock jumps
def allow(self, cost=1):
now = time.monotonic()
elapsed = now - self.last
self.last = now
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
if self.tokens >= cost:
self.tokens -= cost
return True
return FalseConcurrency & Race Conditions
In real systems, multiple servers may check the same key at once.
You need one of:
- atomic operations (Redis Lua script)
- strongly consistent store
- per-node local buckets + global cap (hybrid)
Big gotcha:
If your algorithm does read → compute → write, it’s likely race-prone unless atomic.
Distributed Rate Limiting (Multi-Node)
Approach 1: Central store (Redis)
- Pros: strong global enforcement
- Cons: adds latency; central dependency; can become a bottleneck
Approach 2: Per-node local + periodic sync
- Pros: very low latency
- Cons: approximate; can overshoot global limit during spikes
Approach 3: Edge-based limiting (CDN/WAF)
- Pros: blocks abuse early; reduces origin load
- Cons: limited flexibility; might not see user identity
Choosing the Right Algorithm
| Algorithm | Best For | Pros | Cons |
|---|---|---|---|
| Fixed Window | Quick “good enough” controls | Simple, cheap | Boundary burst |
| Sliding Log | Strict correctness | Most accurate | Memory heavy |
| Sliding Counter | Practical correctness at scale | Low memory, smooth | Approximate |
| Leaky Bucket | Smoothing bursts to protect downstream | Stable rate output | Queue latency |
| Token Bucket | APIs that allow bursts | Most flexible | More complex state |
Common Pitfalls (and Fixes)
| Pitfall | Cause | Fix |
|---|---|---|
| Limits too global | One client starves others | Limit per tenant/user, not just per IP |
| Thundering herd after 429 | Clients retry immediately | Return Retry-After, use exponential backoff + jitter |
| Race conditions | Non-atomic updates | Use atomic ops (Lua/transactions) |
| Clock skew | Distributed time disagreement | Use server time; prefer monotonic time; avoid client timestamps |
| Rate limiting only at origin | Abuse reaches your servers first | Limit at edge (CDN/WAF) + origin as backup |
Putting It Together: Recommended Defaults
A good baseline for many APIs:
- Edge: coarse per-IP limits (stop obvious abuse)
- Origin: token bucket per API key / user
- Critical endpoints: stricter limits on login, password reset, search
- Retries: return
429+Retry-Afterand enforce backoff
If you need fairness across tenants:
- enforce per-tenant limits + global cap
- optionally weighted by plan tier
Conclusion
Rate limiting looks simple, but real systems require careful thought about bursts, fairness, distributed correctness, and client retry behavior. Token bucket is often the default winner for APIs, but fixed and sliding windows can be great depending on your requirements. The best implementations are layered: edge + origin, cheap controls + stricter per-identity enforcement.
Key Takeaways
- Rate limiting is both an algorithm choice and a policy choice
- Token bucket is usually the most practical API default
- Sliding window log is strict but expensive; sliding counter is a great compromise
- Distributed implementations need atomicity, skew awareness, and retry discipline

