Hot Shards

November 30, 2025

Overview

Hot shards are those unlucky partitions that see a disproportionate amount of traffic or work. Even if you use a “good” random hash function, hot shards still appear in production systems for reasons that have nothing to do with the hash function itself.

This post focuses on solutions: how real-world systems (DynamoDB, Spanner, Cassandra, ScyllaDB, Redis Cluster, Elasticsearch, etc.) actually mitigate hot shards.

TL;DR

  • Random / uniform hashing only fixes key distribution problems.
  • Hot shards still appear due to workload skew, time-series traffic, global keys, query patterns, and physical node imbalance.
  • Production systems lean on microshards, adaptive capacity, key-splitting, write-sharding, replica-aware routing, and CRDT-style counters to keep things stable.

Table of Contents

Recap: Why Hot Shards Exist Even With Random Hashing
Design Principles: The “Anti–Hot-Shard” Mindset
Hot Keys → Key-Level Load Splitting
Temporal / Time-Series Hotspots → Write Sharding
Query Skew & Range Scans → Smarter Query Routing and Index Sharding
Physical Node Imbalance → Microshards & Automatic Rebalancing
Global Logical Keys → Distributed State, Not Centralized State
Storage-Engine Hotspots → Engine-Aware Design
Unified View: Root Causes vs. Solution Patterns
Putting It All Together

1. Recap: Why Hot Shards Exist Even With Random Hashing
^

Before talking about solutions, we need to be precise about what we’re solving.

A shard can be “hot” for several root causes:

  1. Hot keys

    • A few keys receive 1000× more traffic than others (celebrity users, popular config keys, global counters).
  2. Temporal / time-series hotspots

    • Recent data or monotonic IDs funnel all new writes into a small region.
  3. Query skew & range scans

    • Queries that repeatedly target the same narrow slice of the key space, or scan wide ranges on a subset of shards.
  4. Physical resource imbalance

    • One node has worse hardware, heavier compaction, worse cache locality, or is just “unlucky” with shard placement.
  5. Global logical keys / coordination hotspots

    • Global locks, global version numbers, global rate-limiter buckets, etc., centralize traffic on a single shard.
  6. Storage engine hotspots

    • LSM compaction, GC, or disk-level behavior cause a shard to be much more expensive to maintain than others.

Random hashing only addresses:

“Are keys evenly distributed across shards?”

It does not guarantee:

“Is work evenly distributed across shards?”

The rest of this post is about how to fix each root cause.

2. Design Principles: The “Anti–Hot-Shard” Mindset
^

The techniques differ, but the guiding principles are fairly consistent:

  • Over-partition (many small shards, not a few large ones).
  • Monitor traffic + CPU + storage per shard.
  • Move or split work when it becomes skewed.
  • Avoid single global choke points (keys, locks, counters).
  • Push complexity to the edges (routers, read aggregators, materialized views) rather than the hottest shard.

You can think of the overall architecture as:

“Lots of tiny pieces that can be moved around, and enough observability to know what to move.”

We’ll now go cause-by-cause.

3. Hot Keys → Key-Level Load Splitting
^

3.1. Problem

A single key (or small set of keys) is much hotter than the rest:

  • A celebrity’s user profile
  • A single “global-progress” row
  • A config document everyone reads
  • A global counter or rate limiter bucket

Hashing doesn’t help: that one key maps to one shard.

Hot Key Problem

3.2. Solution Pattern: “Key Salting” / “Key Splitting”

Instead of:

Key: user:12345

You create logical shards of the same key:

user:12345#0
user:12345#1
user:12345#2
...

Writes are spread across these “sub-keys”, and reads aggregate them.

Hot Key Solution

3.3. Alternate Solution Pattern: Reverse Architecture of Writes and Reads

Twitter-style systems often avoid a hot shard on content poster by not using a single “global feed.” Instead, they either:

  • Fan-out-on-write for normal keys: on tweet, push the tweet into each follower’s own timeline (sharded by follower ID), or
  • Fan-out-on-read / hybrid for hot keys: for very high-follower accounts, don’t push to all followers at write time; instead, pull those tweets into a user’s home timeline dynamically when they fetch it. Tweet storage is sharded by tweet_id (high entropy), chunking follower lists, and caching tweet reads. Fan-out-on-read builds timelines from these sharded components — it doesn’t rely on key salting of the tweet itself.

3.4. Variants

PatternHow it worksTypical use casesTradeoffs & gotchas
Random key suffixes (“salting”)Append a random or round-robin suffix to a hot key, distribute subkeys across shards.High-QPS writes to a single logical key (e.g., clickstream events for one page, chat messages for a busy room).Readers must merge multiple subkeys; adds fan-out and aggregation latency. Harder to enforce strict ordering.
Sharded countersStore multiple partial counters for the same logical counter; sum them on read or periodically.Like counts, view counts, rate limits, metrics, quotas.Approximate at read time unless you read all shards; more complicated to enforce exact 'less than or equal to N' invariants.
Adaptive key splittingSystem detects a hot key and auto-creates more internal partitions for it.Cloud databases and managed K/V stores (e.g., “adaptive capacity” features).Black-box behavior; good for ops, but you lose precise control of layout; needs strong telemetry.

3.5. When to use

  • You see one customer, one tenant, or one resource dominating load.
  • Per-key QPS is often more useful than per-shard QPS for triggering this.

4. Temporal / Time-Series Hotspots → Write Sharding
^

4.1. Problem

Time-series workloads tend to pile writes onto the “latest” region:

  • Monotonic IDs (Snowflake, UUIDv7)
  • Time-bucketed keys (events:2025-11-29-12:00)
  • Logs / metrics / traces

Even if the hash function is “good”, the newest buckets or IDs very often land on a small cluster of shards.

Hot Shard Time Series Problem

4.2. Solution Pattern: “Write Sharding” for Time Buckets

You keep the time locality at a coarse level (day/hour) but add randomness inside the bucket:

events:2025-11-29T12:00#0
events:2025-11-29T12:00#1
...
events:2025-11-29T12:00#31

New writes for “now” are scattered across multiple sub-buckets. Reads that look at “12:00” will fan out across these 32 sub-buckets.

Hot Shard Time Series Solution

At 1:00PM, there would need to be a re-allocation of shards.

4.3. Variants

StrategyHow it helpsWorks best when...Tradeoffs
Randomized time-bucket keysScatter writes for one bucket across multiple partitions.Readers naturally use fairly wide time ranges and can tolerate fan-out.Readers must query multiple partitions per time slice; some extra network + CPU overhead.
Multiple “hot” partitions per topicInstead of 1 partition per topic/tenant, you maintain N partitions in parallel.Message queues (Kafka-like), streaming ingestion, log pipelines.Reordering across partitions; consumer logic becomes more complex if strict order is needed.
Tiered storage & compaction optimizationReduce compaction pressure on the newest data and offload older data to cheaper media.Large LSM-based systems (RocksDB, Cassandra, Scylla, ClickHouse-like engines).More moving parts; requires careful tuning and metrics to avoid surprises.

5. Query Skew & Range Scans → Smarter Query Routing and Index Sharding
^

5.1. Problem

Even when keys are evenly distributed, queries might not be:

  • “Give me last 100 events for user X” → 1 shard, very often
  • “Scan orders between ID A and ID B” → specific small range
  • “Search posts for tag = hot” → only a few shards have relevant data

Hashing doesn’t help here because the skew comes from the query pattern, not key distribution.

5.2. Solution Pattern: Distributed Indexes + Smart Routing

You want the router/query layer to know:

  • Which shards have relevant data
  • How to spread load across replicas
  • When to offload heavy aggregations to background jobs/materialized views

5.3. Techniques

TechniqueIdeaHelps withTradeoffs
Sharded secondary indexesIndex (e.g. by tag, status, tenant) is itself partitioned across many nodes.Hot filters like WHERE status='OPEN' or WHERE tag='hot'.More write amplification and index-maintenance complexity; heavier background work.
Scatter-gather with pruningRoute queries only to shards that can satisfy the predicate, not to all shards.Range queries, partial scans, filtered searches.Router must maintain metadata about shard key ranges; stale metadata can cause inefficiency.
Replica-aware read routingDirect reads for hot ranges to less-loaded replicas.Read-heavy hotspots, hot tail latency.Increased replica counts; potential for read-after-write inconsistency depending on consistency guarantees.
Precomputed materialized viewsMove expensive aggregation work off the hot path and into background jobs.Heavy analytics queries over the same dimensions.Freshness vs cost tradeoff; more streaming/ETL complexity.

6. Physical Node Imbalance → Microshards & Automatic Rebalancing
^

6.1. Problem

Even with a perfect logical distribution, one machine can be overloaded:

  • Noisy neighbor on shared hardware
  • Hotter set of partitions “by accident”
  • More compaction / GC on a subset of shards
  • One bad disk or NIC

Now you have a hot shard by virtue of placement, not data distribution.

6.2. Solution Pattern: Many Small Virtual Shards (Microshards)

Instead of having, say, 64 partitions each pinned to a single node, you use thousands of tiny logical partitions (virtual nodes, vnodes, microshards) spread over the cluster.

When a few microshards become hot, the cluster can:

  • Move just those microshards to underutilized nodes
  • Replicate them more aggressively
  • Throttle or reshard them

6.3. Techniques

TechniqueWhat it doesStrengthsWeaknesses
Consistent hashing with many virtual nodesEach physical node owns many small slices of the keyspace.Smooths out randomness; makes rebalance granular.More metadata; slightly more complex routing and state management.
Load-based shard movementMove microshards away from overloaded nodes based on CPU/QPS/latency.Targets actual hotspots instead of just “big shards”.Shard moves can cause cache coldness; need careful throttling.
Autoscaling based on hot shard metricsScale out when certain shards show sustained hotness, then redistribute.Lets you pay only when load spikes; good for cloud-native setups.Requires robust per-shard metrics; scaling lag can still cause transient pain.

7. Global Logical Keys → Distributed State, Not Centralized State
^

7.1. Problem

Some things are inherently global:

  • Global “current version” of a config
  • Global counters (total users, total events)
  • Global locks or leases
  • Global rate limiter bucket

If you implement these as a single key or row, that key becomes a guaranteed hot spot.

Hot Shard Global Key Problem

7.2. Solution Pattern: Decompose Global State

Break global state into many smaller, less synchronized pieces:

  • Sharded counters / CRDTs instead of a single counter.
  • Per-scope limits instead of one global rate bucket.
  • Distributed lock services instead of one database row.
Hot Shard Global Key Solution

7.3. Patterns & Tradeoffs

PatternHow it worksWhere it shinesTradeoffs
CRDT / sharded countersMaintain multiple partial counts and merge them (using CRDT join) to get a global value.Metrics, analytics, dashboards, non-critical quotas.Exact consistency can be weaker; you may see temporarily stale or approximate values.
Hierarchical rate limitingLimits per region / cluster / tenant, then aggregate up.APIs with high fan-out, multi-region systems.Designing fair & safe hierarchies is tricky; more state to manage.
Dedicated lock/lease servicesUse a Raft/Zookeeper/etcd-like system rather than ad-hoc DB locks.Leader election, config rollouts, coordination.Another distributed system to run; limited QPS; still can be a logical bottleneck if abused.
Scoped “global” keysTurn one global key into many “per-scope” keys (global#us-east-1, etc.).Multi-region configs, region-scoped variables.Must reason about cross-scope divergence; extra complexity if you truly require global invariants.

8. Storage-Engine Hotspots → Engine-Aware Design
^

8.1. Problem

Even if traffic is balanced, the underlying storage engine can make some shards more expensive to maintain:

  • Compaction storms for specific SSTables/segments
  • Write amplification skewed to a few partitions
  • GC or memory pressure isolated to particular key ranges

8.2. Solution Pattern: Align Logical Sharding With Engine Behavior

Key ideas:

  • Prefer append-friendly patterns over in-place updates.
  • Use tiered storage (hot vs cold) so old data doesn’t penalize hot keys.
  • Keep shard sizes bounded, so compaction jobs don’t balloon.

8.3. Techniques

TechniqueWhat it addressesStrengthsTradeoffs
Shard-level compaction controlHeavy compactions on a small set of shards.Lets you smooth IO usage and avoid compaction storms.Engine-specific; requires deep tuning and good metrics.
Tiered or time-partitioned tablesMixing hot and cold data in the same structures.Keep hot data small and in fast storage; archive cold data.More complex retention/migration logic; additional tables/indexes.
Bounded shard size & split policiesVery large shards that are expensive to compact.Prevents any shard from becoming too “heavy” to move or maintain.More shards to track; more frequent splits & metadata updates.

9. Unified View: Root Causes vs. Solution Patterns
^

Here’s the “one big table” you can reference later.

Root CauseTypical SymptomsMain TechniquesKey Tradeoffs
Hot keysOne tenant/user/key dominates QPS and CPU.Key salting, sharded counters, adaptive key splitting.Fan-out on reads, aggregation complexity, weaker ordering guarantees.
Temporal / time-series hotspotsNewest time window or newest IDs are much hotter.Write sharding within time buckets, multiple “hot” partitions, compaction tuning, tiered storage.Readers must hit multiple buckets; more complex data layout and routing.
Query pattern skewSame shard(s) hit by popular queries or range scans.Sharded secondary indexes, scatter-gather with pruning, replica-aware routing, materialized views.More metadata and background work; potential for higher read amplification.
Physical node imbalanceOne machine’s CPU/IO much higher than peers.Many virtual nodes, load-based shard moves, autoscaling, smarter placement.More metadata; careful handling of cache coldness and rebalancing costs.
Global logical keysCentralized counters, locks, configs become bottlenecks.CRDTs, sharded/global counters, hierarchical limits, dedicated lock services, scoped “globals”.Weaker global invariants, more complex semantics, extra infrastructure.
Storage engine hotspotsSome shards spend outsized time compacting or GC’ing.Shard-level compaction controls, bounded shard sizes, hot/cold separation.More operational tuning; engine-specific concerns.

10. Putting It All Together
^

If you were designing a new system today and wanted to be reasonably robust to hot shards, a good baseline would be:

  1. Random hashing with many microshards

    • Start with 10×–100× more logical shards than physical nodes.
    • Route via consistent hashing.
  2. Shard-aware router with rich telemetry

    • Track QPS, latency, CPU, and storage per shard.
    • Use this to drive resharding and autoscaling.
  3. Hot-key detection + key-splitting

    • If per-key QPS crosses a threshold, automatically salt/split.
    • Expose a simple aggregation layer for reads.
  4. Write sharding for time-series workloads

    • Use time buckets + randomness rather than purely monotonic IDs.
    • Keep bucket sizes bounded and split when needed.
  5. Distributed index and query layer

    • Secondary indexes are themselves distributed.
    • Support scatter-gather with pruning and replica-aware routing.
  6. Storage-engine-aware policies

    • Avoid huge shards; cap shard size.
    • Tune compaction to avoid IO storms.
  7. Avoid single global keys

    • Use CRDT-style counters and hierarchical limits.
    • Reserve a dedicated, low-QPS system for locks and coordination.

That combination is more or less what the big players converge on, with different flavors and tradeoffs.