Hot Shards

    November 30, 2025

    Overview

    Hot shards are those unlucky partitions that see a disproportionate amount of traffic or work. Even if you use a “good” random hash function, hot shards still appear in production systems for reasons that have nothing to do with the hash function itself.

    This post focuses on solutions: how real-world systems (DynamoDB, Spanner, Cassandra, ScyllaDB, Redis Cluster, Elasticsearch, etc.) actually mitigate hot shards.

    TL;DR

    • Random / uniform hashing only fixes key distribution problems.
    • Hot shards still appear due to workload skew, time-series traffic, global keys, query patterns, and physical node imbalance.
    • Production systems lean on microshards, adaptive capacity, key-splitting, write-sharding, replica-aware routing, and CRDT-style counters to keep things stable.

    Table of Contents

    Recap: Why Hot Shards Exist Even With Random Hashing
    Design Principles: The “Anti–Hot-Shard” Mindset
    Hot Keys → Key-Level Load Splitting
    Temporal / Time-Series Hotspots → Write Sharding
    Query Skew & Range Scans → Smarter Query Routing and Index Sharding
    Physical Node Imbalance → Microshards & Automatic Rebalancing
    Global Logical Keys → Distributed State, Not Centralized State
    Storage-Engine Hotspots → Engine-Aware Design
    Unified View: Root Causes vs. Solution Patterns
    Putting It All Together

    1. Recap: Why Hot Shards Exist Even With Random Hashing
    ^

    Before talking about solutions, we need to be precise about what we’re solving.

    A shard can be “hot” for several root causes:

    1. Hot keys

      • A few keys receive 1000× more traffic than others (celebrity users, popular config keys, global counters).
    2. Temporal / time-series hotspots

      • Recent data or monotonic IDs funnel all new writes into a small region.
    3. Query skew & range scans

      • Queries that repeatedly target the same narrow slice of the key space, or scan wide ranges on a subset of shards.
    4. Physical resource imbalance

      • One node has worse hardware, heavier compaction, worse cache locality, or is just “unlucky” with shard placement.
    5. Global logical keys / coordination hotspots

      • Global locks, global version numbers, global rate-limiter buckets, etc., centralize traffic on a single shard.
    6. Storage engine hotspots

      • LSM compaction, GC, or disk-level behavior cause a shard to be much more expensive to maintain than others.

    Random hashing only addresses:

    “Are keys evenly distributed across shards?”

    It does not guarantee:

    “Is work evenly distributed across shards?”

    The rest of this post is about how to fix each root cause.

    2. Design Principles: The “Anti–Hot-Shard” Mindset
    ^

    The techniques differ, but the guiding principles are fairly consistent:

    • Over-partition (many small shards, not a few large ones).
    • Monitor traffic + CPU + storage per shard.
    • Move or split work when it becomes skewed.
    • Avoid single global choke points (keys, locks, counters).
    • Push complexity to the edges (routers, read aggregators, materialized views) rather than the hottest shard.

    You can think of the overall architecture as:

    “Lots of tiny pieces that can be moved around, and enough observability to know what to move.”

    We’ll now go cause-by-cause.

    3. Hot Keys → Key-Level Load Splitting
    ^

    3.1. Problem

    A single key (or small set of keys) is much hotter than the rest:

    • A celebrity’s user profile
    • A single “global-progress” row
    • A config document everyone reads
    • A global counter or rate limiter bucket

    Hashing doesn’t help: that one key maps to one shard.

    Hot Key Problem

    3.2. Solution Pattern: “Key Salting” / “Key Splitting”

    Instead of:

    Key: user:12345
    

    You create logical shards of the same key:

    user:12345#0
    user:12345#1
    user:12345#2
    ...
    

    Writes are spread across these “sub-keys”, and reads aggregate them.

    Hot Key Solution

    3.3. Alternate Solution Pattern: Reverse Architecture of Writes and Reads

    Twitter-style systems often avoid a hot shard on content poster by not using a single “global feed.” Instead, they either:

    • Fan-out-on-write for normal keys: on tweet, push the tweet into each follower’s own timeline (sharded by follower ID), or
    • Fan-out-on-read / hybrid for hot keys: for very high-follower accounts, don’t push to all followers at write time; instead, pull those tweets into a user’s home timeline dynamically when they fetch it. Tweet storage is sharded by tweet_id (high entropy), chunking follower lists, and caching tweet reads. Fan-out-on-read builds timelines from these sharded components — it doesn’t rely on key salting of the tweet itself.

    3.4. Variants

    PatternHow it worksTypical use casesTradeoffs & gotchas
    Random key suffixes (“salting”)Append a random or round-robin suffix to a hot key, distribute subkeys across shards.High-QPS writes to a single logical key (e.g., clickstream events for one page, chat messages for a busy room).Readers must merge multiple subkeys; adds fan-out and aggregation latency. Harder to enforce strict ordering.
    Sharded countersStore multiple partial counters for the same logical counter; sum them on read or periodically.Like counts, view counts, rate limits, metrics, quotas.Approximate at read time unless you read all shards; more complicated to enforce exact 'less than or equal to N' invariants.
    Adaptive key splittingSystem detects a hot key and auto-creates more internal partitions for it.Cloud databases and managed K/V stores (e.g., “adaptive capacity” features).Black-box behavior; good for ops, but you lose precise control of layout; needs strong telemetry.

    3.5. When to use

    • You see one customer, one tenant, or one resource dominating load.
    • Per-key QPS is often more useful than per-shard QPS for triggering this.

    4. Temporal / Time-Series Hotspots → Write Sharding
    ^

    4.1. Problem

    Time-series workloads tend to pile writes onto the “latest” region:

    • Monotonic IDs (Snowflake, UUIDv7)
    • Time-bucketed keys (events:2025-11-29-12:00)
    • Logs / metrics / traces

    Even if the hash function is “good”, the newest buckets or IDs very often land on a small cluster of shards.

    Hot Shard Time Series Problem

    4.2. Solution Pattern: “Write Sharding” for Time Buckets

    You keep the time locality at a coarse level (day/hour) but add randomness inside the bucket:

    events:2025-11-29T12:00#0
    events:2025-11-29T12:00#1
    ...
    events:2025-11-29T12:00#31
    

    New writes for “now” are scattered across multiple sub-buckets. Reads that look at “12:00” will fan out across these 32 sub-buckets.

    Hot Shard Time Series Solution

    At 1:00PM, there would need to be a re-allocation of shards.

    4.3. Variants

    StrategyHow it helpsWorks best when...Tradeoffs
    Randomized time-bucket keysScatter writes for one bucket across multiple partitions.Readers naturally use fairly wide time ranges and can tolerate fan-out.Readers must query multiple partitions per time slice; some extra network + CPU overhead.
    Multiple “hot” partitions per topicInstead of 1 partition per topic/tenant, you maintain N partitions in parallel.Message queues (Kafka-like), streaming ingestion, log pipelines.Reordering across partitions; consumer logic becomes more complex if strict order is needed.
    Tiered storage & compaction optimizationReduce compaction pressure on the newest data and offload older data to cheaper media.Large LSM-based systems (RocksDB, Cassandra, Scylla, ClickHouse-like engines).More moving parts; requires careful tuning and metrics to avoid surprises.

    5. Query Skew & Range Scans → Smarter Query Routing and Index Sharding
    ^

    5.1. Problem

    Even when keys are evenly distributed, queries might not be:

    • “Give me last 100 events for user X” → 1 shard, very often
    • “Scan orders between ID A and ID B” → specific small range
    • “Search posts for tag = hot” → only a few shards have relevant data

    Hashing doesn’t help here because the skew comes from the query pattern, not key distribution.

    5.2. Solution Pattern: Distributed Indexes + Smart Routing

    You want the router/query layer to know:

    • Which shards have relevant data
    • How to spread load across replicas
    • When to offload heavy aggregations to background jobs/materialized views

    5.3. Techniques

    TechniqueIdeaHelps withTradeoffs
    Sharded secondary indexesIndex (e.g. by tag, status, tenant) is itself partitioned across many nodes.Hot filters like WHERE status='OPEN' or WHERE tag='hot'.More write amplification and index-maintenance complexity; heavier background work.
    Scatter-gather with pruningRoute queries only to shards that can satisfy the predicate, not to all shards.Range queries, partial scans, filtered searches.Router must maintain metadata about shard key ranges; stale metadata can cause inefficiency.
    Replica-aware read routingDirect reads for hot ranges to less-loaded replicas.Read-heavy hotspots, hot tail latency.Increased replica counts; potential for read-after-write inconsistency depending on consistency guarantees.
    Precomputed materialized viewsMove expensive aggregation work off the hot path and into background jobs.Heavy analytics queries over the same dimensions.Freshness vs cost tradeoff; more streaming/ETL complexity.

    6. Physical Node Imbalance → Microshards & Automatic Rebalancing
    ^

    6.1. Problem

    Even with a perfect logical distribution, one machine can be overloaded:

    • Noisy neighbor on shared hardware
    • Hotter set of partitions “by accident”
    • More compaction / GC on a subset of shards
    • One bad disk or NIC

    Now you have a hot shard by virtue of placement, not data distribution.

    6.2. Solution Pattern: Many Small Virtual Shards (Microshards)

    Instead of having, say, 64 partitions each pinned to a single node, you use thousands of tiny logical partitions (virtual nodes, vnodes, microshards) spread over the cluster.

    When a few microshards become hot, the cluster can:

    • Move just those microshards to underutilized nodes
    • Replicate them more aggressively
    • Throttle or reshard them

    6.3. Techniques

    TechniqueWhat it doesStrengthsWeaknesses
    Consistent hashing with many virtual nodesEach physical node owns many small slices of the keyspace.Smooths out randomness; makes rebalance granular.More metadata; slightly more complex routing and state management.
    Load-based shard movementMove microshards away from overloaded nodes based on CPU/QPS/latency.Targets actual hotspots instead of just “big shards”.Shard moves can cause cache coldness; need careful throttling.
    Autoscaling based on hot shard metricsScale out when certain shards show sustained hotness, then redistribute.Lets you pay only when load spikes; good for cloud-native setups.Requires robust per-shard metrics; scaling lag can still cause transient pain.

    7. Global Logical Keys → Distributed State, Not Centralized State
    ^

    7.1. Problem

    Some things are inherently global:

    • Global “current version” of a config
    • Global counters (total users, total events)
    • Global locks or leases
    • Global rate limiter bucket

    If you implement these as a single key or row, that key becomes a guaranteed hot spot.

    Hot Shard Global Key Problem

    7.2. Solution Pattern: Decompose Global State

    Break global state into many smaller, less synchronized pieces:

    • Sharded counters / CRDTs instead of a single counter.
    • Per-scope limits instead of one global rate bucket.
    • Distributed lock services instead of one database row.
    Hot Shard Global Key Solution

    7.3. Patterns & Tradeoffs

    PatternHow it worksWhere it shinesTradeoffs
    CRDT / sharded countersMaintain multiple partial counts and merge them (using CRDT join) to get a global value.Metrics, analytics, dashboards, non-critical quotas.Exact consistency can be weaker; you may see temporarily stale or approximate values.
    Hierarchical rate limitingLimits per region / cluster / tenant, then aggregate up.APIs with high fan-out, multi-region systems.Designing fair & safe hierarchies is tricky; more state to manage.
    Dedicated lock/lease servicesUse a Raft/Zookeeper/etcd-like system rather than ad-hoc DB locks.Leader election, config rollouts, coordination.Another distributed system to run; limited QPS; still can be a logical bottleneck if abused.
    Scoped “global” keysTurn one global key into many “per-scope” keys (global#us-east-1, etc.).Multi-region configs, region-scoped variables.Must reason about cross-scope divergence; extra complexity if you truly require global invariants.

    8. Storage-Engine Hotspots → Engine-Aware Design
    ^

    8.1. Problem

    Even if traffic is balanced, the underlying storage engine can make some shards more expensive to maintain:

    • Compaction storms for specific SSTables/segments
    • Write amplification skewed to a few partitions
    • GC or memory pressure isolated to particular key ranges

    8.2. Solution Pattern: Align Logical Sharding With Engine Behavior

    Key ideas:

    • Prefer append-friendly patterns over in-place updates.
    • Use tiered storage (hot vs cold) so old data doesn’t penalize hot keys.
    • Keep shard sizes bounded, so compaction jobs don’t balloon.

    8.3. Techniques

    TechniqueWhat it addressesStrengthsTradeoffs
    Shard-level compaction controlHeavy compactions on a small set of shards.Lets you smooth IO usage and avoid compaction storms.Engine-specific; requires deep tuning and good metrics.
    Tiered or time-partitioned tablesMixing hot and cold data in the same structures.Keep hot data small and in fast storage; archive cold data.More complex retention/migration logic; additional tables/indexes.
    Bounded shard size & split policiesVery large shards that are expensive to compact.Prevents any shard from becoming too “heavy” to move or maintain.More shards to track; more frequent splits & metadata updates.

    9. Unified View: Root Causes vs. Solution Patterns
    ^

    Here’s the “one big table” you can reference later.

    Root CauseTypical SymptomsMain TechniquesKey Tradeoffs
    Hot keysOne tenant/user/key dominates QPS and CPU.Key salting, sharded counters, adaptive key splitting.Fan-out on reads, aggregation complexity, weaker ordering guarantees.
    Temporal / time-series hotspotsNewest time window or newest IDs are much hotter.Write sharding within time buckets, multiple “hot” partitions, compaction tuning, tiered storage.Readers must hit multiple buckets; more complex data layout and routing.
    Query pattern skewSame shard(s) hit by popular queries or range scans.Sharded secondary indexes, scatter-gather with pruning, replica-aware routing, materialized views.More metadata and background work; potential for higher read amplification.
    Physical node imbalanceOne machine’s CPU/IO much higher than peers.Many virtual nodes, load-based shard moves, autoscaling, smarter placement.More metadata; careful handling of cache coldness and rebalancing costs.
    Global logical keysCentralized counters, locks, configs become bottlenecks.CRDTs, sharded/global counters, hierarchical limits, dedicated lock services, scoped “globals”.Weaker global invariants, more complex semantics, extra infrastructure.
    Storage engine hotspotsSome shards spend outsized time compacting or GC’ing.Shard-level compaction controls, bounded shard sizes, hot/cold separation.More operational tuning; engine-specific concerns.

    10. Putting It All Together
    ^

    If you were designing a new system today and wanted to be reasonably robust to hot shards, a good baseline would be:

    1. Random hashing with many microshards

      • Start with 10×–100× more logical shards than physical nodes.
      • Route via consistent hashing.
    2. Shard-aware router with rich telemetry

      • Track QPS, latency, CPU, and storage per shard.
      • Use this to drive resharding and autoscaling.
    3. Hot-key detection + key-splitting

      • If per-key QPS crosses a threshold, automatically salt/split.
      • Expose a simple aggregation layer for reads.
    4. Write sharding for time-series workloads

      • Use time buckets + randomness rather than purely monotonic IDs.
      • Keep bucket sizes bounded and split when needed.
    5. Distributed index and query layer

      • Secondary indexes are themselves distributed.
      • Support scatter-gather with pruning and replica-aware routing.
    6. Storage-engine-aware policies

      • Avoid huge shards; cap shard size.
      • Tune compaction to avoid IO storms.
    7. Avoid single global keys

      • Use CRDT-style counters and hierarchical limits.
      • Reserve a dedicated, low-QPS system for locks and coordination.

    That combination is more or less what the big players converge on, with different flavors and tradeoffs.