Distributed Systems Lessons: Hot Keys & Client Coordination
When you build APIs that fan out to hundreds of thousands of users polling in near‑real time, your data model becomes a reliability decision, not just a storage one.
Recently, I worked on a real‑time game system where users continuously polled to retrieve live state. During a peak event, we experienced elevated 5xx error rates caused by DynamoDB partition overheating. Not traffic volume in general, traffic volume focused on just one key.
The Root Cause
We stored shared "game state" under a single partition key. All users hitting that endpoint hammered the same item. DynamoDB doesn’t “autoscale” a single key because throughput is distributed per partition, not globally.
One hot key = one hot partition = throttling.
We had planned for scale, but not for load concentration patterns.
What Actually Happened
During the incident, we discovered that client polling behavior was significantly higher than our initial estimates. The situation was compounded by retry logic that created burst patterns, amplifying the load beyond what we had anticipated. To make matters worse, two live events were running concurrently, doubling the expected user base hitting our endpoints.
CloudFront's default error caching behavior added another layer of complexity—when the DynamoDB partition started throttling, CloudFront cached those error responses for 10 seconds, creating a feedback loop that delayed recovery even after we addressed the underlying issue.
The core problem was that one DynamoDB partition was handling nearly all read requests for the shared game state. This concentrated load resulted in approximately 75-80% throttling on that specific partition key, causing widespread service degradation across the entire user base.
The Fix
To prevent this issue from reoccurring, we approached the solution in layers. The first step was to introduce state caching directly inside the Lambda execution context. Since the state being fetched rarely changed, keeping it in memory between invocations dramatically reduced the number of DynamoDB reads, cutting thousands of requests per minute and relieving immediate pressure on the database.
Next, we shifted our load-testing approach. The original load profile was based on expected behaviour, but real users especially across certain device platforms, behaved very differently in production. Updating our tests to reflect real polling patterns and retry logic allowed us to validate the fix under conditions that more accurately matched live traffic.
Finally, we outlined a longer-term architectural improvement: spreading read traffic across multiple keys instead of hammering a single “hot” partition. This could be done by sharding state by region, by game, or even by consumer tier effectively creating controlled replicas that allow DynamoDB to distribute the load more evenly across partitions.
These adjustments align closely with AWS's own guidance on hot key prevention and key design best practices, and give the system a more resilient foundation for future scale.
Key Takeaway
Scaling isn’t about handling more traffic, it’s about distributing it. Architectures that work under load tests can still fail under real‑world access patterns if one key becomes a hotspot. Plan for traffic shape, not just traffic volume.