Rate Limiting Algorithms

January 30, 2026

Overview
^

Rate limiting is a mechanism for controlling how many requests (or operations) a client can perform within a period of time. It's used to:

  • Protect systems from overload and cascading failures
  • Prevent abuse (scraping, brute-force login attempts)
  • Enforce fairness across tenants/users
  • Smooth traffic spikes and reduce tail latency

In practice, rate limiting is as much about product policy (who gets limited, how, and why) as it is about algorithms.

Table of Contents

What Rate Limiting Actually Solves
Core Concepts & Nuances
Fixed Window Counter
Sliding Window Log
Sliding Window Counter
Leaky Bucket
Token Bucket
Concurrency & Race Conditions
Distributed Rate Limiting (Multi-Node)
Choosing the Right Algorithm
Common Pitfalls (and Fixes)
Putting It Together: Recommended Defaults
Conclusion
Key Takeaways

What Rate Limiting Actually Solves
^

Rate limiting helps with:

  • Overload protection: reject early instead of letting the system collapse
  • Fairness: one noisy client shouldn’t starve others
  • Cost control: keep expensive workloads within budget
  • Security: slow down brute force and endpoint probing
  • Quality of service: enforce tiers (free vs paid)

But rate limiting does not replace:

  • WAF / bot detection
  • Backpressure inside internal services
  • Capacity planning and autoscaling

Core Concepts & Nuances
^

What’s the “key” you limit on?

Common keys:

  • per IP
  • per user
  • per API key
  • per tenant/org
  • per route + user (e.g., /search stricter than /profile)
  • per cost unit (e.g., GraphQL query cost)

Nuances that matter

  • Bursts: is it okay to allow short spikes?
  • Fairness: do you need per-tenant fairness, or global?
  • Clock skew: distributed counters can drift
  • Idempotency: should retries count?
  • Tiering: limits differ for paid plans
  • Latency impact: enforcement must be O(1) and fast

Fixed Window Counter
^

Idea: count requests in discrete windows (e.g., per minute). Allow if count ≤ limit.

Pros

  • Very simple
  • Cheap in memory and compute

Cons (big gotcha)

  • Boundary burst: client can send limit requests at 12:00:59 and another limit at 12:01:00 → effectively 2× in 1 second.
# pseudo-code
# window_start_ts must be floored to the window boundary, e.g. ts - (ts % 60)
def allow_fixed_window(key, window_start_ts, limit, store):
    counter_key = f"{key}:{window_start_ts}"
    n = store.incr(counter_key)

    # Important: only set TTL when the key is first created.
    # Otherwise, frequent traffic can keep extending the window.
    if n == 1:
        store.expire(counter_key, 60)

    return n <= limit

Sliding Window Log
^

Idea: store timestamps of recent requests; allow if #timestamps in the last T seconds ≤ limit.

Pros

  • Accurate (no boundary burst)

Cons

  • Can be memory heavy (stores per-request events)
  • Concurrency gotcha: the check + insert must be atomic (e.g., Redis Lua), otherwise concurrent requests can overshoot the limit.
# pseudo-code for Redis sorted set (non-atomic sketch)
# In production: implement as a single atomic operation (Lua script / transaction).
def allow_sliding_log(key, now_ms, window_ms, limit, redis):
    # One sorted set per rate-limit key (user / IP / API key / route, etc.)
    zkey = f"rl:{key}"

    # 1) Remove requests that are outside the sliding window
    #    Keeps only events in (now - window_ms, now]
    redis.zremrangebyscore(zkey, 0, now_ms - window_ms)

    # 2) Count how many requests remain in the window
    count = redis.zcard(zkey)

    # 3) Enforce the limit
    #    If we've already seen 'limit' events in the window, reject
    if count >= limit:
        return False

    # 4) Generate a unique member ID for this request
    #    (ZSET members must be unique even if timestamps collide)
    seq_key = f"{zkey}:seq"
    seq = redis.incr(seq_key)

    # Keep the sequence counter bounded to the window lifetime
    redis.pexpire(seq_key, window_ms)

    # 5) Add this request to the window
    #    Score = timestamp (used for range queries)
    #    Member = unique identifier
    member = f"{now_ms}:{seq}"
    redis.zadd(zkey, {member: now_ms})

    # 6) Expire the window key when traffic stops
    #    (prevents stale keys from living forever)
    redis.pexpire(zkey, window_ms)

    return True

Sliding Window Counter
^

Idea: approximate sliding window using two adjacent fixed windows and weighted interpolation.

Pros

  • Much cheaper than log
  • Smooths boundary bursts

Cons

  • Approximation error (usually acceptable)

Conceptually:

  • current_window_count + prev_window_count * overlap_ratio

Leaky Bucket
^

Idea: requests enter a bucket (queue). The bucket leaks at a constant rate.

Pros

  • Produces stable output rate (smooths traffic)
  • Great for downstream systems that hate bursts

Cons

  • Adds queueing latency
  • If bucket is full, requests are dropped

Use cases:

  • Email sending
  • Background job dispatch
  • Payment provider calls

Token Bucket
^

Idea: tokens accumulate at a steady rate up to a max capacity. Each request consumes tokens.

This is the most common choice for APIs because it allows controlled bursts.

import time

class TokenBucket:
    def __init__(self, rate_per_sec, burst):
        self.rate = rate_per_sec
        self.capacity = burst
        self.tokens = burst
        self.last = time.monotonic()  # monotonic avoids issues from wall-clock jumps

    def allow(self, cost=1):
        now = time.monotonic()
        elapsed = now - self.last
        self.last = now

        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        if self.tokens >= cost:
            self.tokens -= cost
            return True
        return False

Concurrency & Race Conditions
^

In real systems, multiple servers may check the same key at once.

You need one of:

  • atomic operations (Redis Lua script)
  • strongly consistent store
  • per-node local buckets + global cap (hybrid)

Big gotcha: If your algorithm does read → compute → write, it’s likely race-prone unless atomic.

Distributed Rate Limiting (Multi-Node)
^

Approach 1: Central store (Redis)

  • Pros: strong global enforcement
  • Cons: adds latency; central dependency; can become a bottleneck

Approach 2: Per-node local + periodic sync

  • Pros: very low latency
  • Cons: approximate; can overshoot global limit during spikes

Approach 3: Edge-based limiting (CDN/WAF)

  • Pros: blocks abuse early; reduces origin load
  • Cons: limited flexibility; might not see user identity

Choosing the Right Algorithm
^

AlgorithmBest ForProsCons
Fixed WindowQuick “good enough” controlsSimple, cheapBoundary burst
Sliding LogStrict correctnessMost accurateMemory heavy
Sliding CounterPractical correctness at scaleLow memory, smoothApproximate
Leaky BucketSmoothing bursts to protect downstreamStable rate outputQueue latency
Token BucketAPIs that allow burstsMost flexibleMore complex state

Common Pitfalls (and Fixes)
^

PitfallCauseFix
Limits too globalOne client starves othersLimit per tenant/user, not just per IP
Thundering herd after 429Clients retry immediatelyReturn Retry-After, use exponential backoff + jitter
Race conditionsNon-atomic updatesUse atomic ops (Lua/transactions)
Clock skewDistributed time disagreementUse server time; prefer monotonic time; avoid client timestamps
Rate limiting only at originAbuse reaches your servers firstLimit at edge (CDN/WAF) + origin as backup

Putting It Together: Recommended Defaults
^

A good baseline for many APIs:

  • Edge: coarse per-IP limits (stop obvious abuse)
  • Origin: token bucket per API key / user
  • Critical endpoints: stricter limits on login, password reset, search
  • Retries: return 429 + Retry-After and enforce backoff

If you need fairness across tenants:

  • enforce per-tenant limits + global cap
  • optionally weighted by plan tier

Conclusion
^

Rate limiting looks simple, but real systems require careful thought about bursts, fairness, distributed correctness, and client retry behavior. Token bucket is often the default winner for APIs, but fixed and sliding windows can be great depending on your requirements. The best implementations are layered: edge + origin, cheap controls + stricter per-identity enforcement.

Key Takeaways

  • Rate limiting is both an algorithm choice and a policy choice
  • Token bucket is usually the most practical API default
  • Sliding window log is strict but expensive; sliding counter is a great compromise
  • Distributed implementations need atomicity, skew awareness, and retry discipline