[i]Users playing across our three regions.
Troubles started at 16:45 UTC, and were subsequently made much worse by ourselves at 17:20 UTC[/i]
----
With our authentication system back on its feet, we decided to re-allow all requests to the system. As we now had all our players attempting to log in simultaneously, the previously noticed problem with the entitlement system became glaringly obvious. The system was overwhelmed and went completely offline, meaning even those users who managed to connect appeared as if they no longer owned the game. At this point, we knew that we needed to fight on two fronts at the same time. The issue with the entitlements system couldn’t wait for the root cause behind the disconnections to be resolved. Thankfully, we were on the verge of solving this one.
---
So, what triggered this bombardment you will ask? Unfortunately, it was a self-inflicted wound. During our diagnosis, we found that some DNS hostnames were not resolving correctly. One of the engineers on the call had identified the problem - a missing DNS record, specifically a critical NS delegation record. Another engineer immediately realized what had happened.
Earlier in the day, a cleanup task involved removing unnecessary infrastructure, and this crucial record had been mistakenly deleted. The last commands in the cleanup had been issued around 16:40, and it so happens that the default TTL (time-to-live) for many DNS records is exactly 5 minutes. A few manual steps later, the record was recreated, and we had to wait patiently for DNS caches around the world to update.
--
The root cause, however, is only the triggering moment, and with the on-call staff in our War Room assessing the situation and a couple of senior backend engineers ready for action, we still needed to tackle the entitlement service issue. We decided to fight it from two fronts at the same time:
- The escalation and support route. We contacted and escalated the issue up the support chain of the provider responsible for our entitlement storage. They confirmed they were overwhelmed and promised a solution to be deployed soon. The clock started ticking at 17:40 UTC.
- The hotfix path. We modified our entitlement code to include a fallback method favoring our players. If we had issues resolving entitlements, but the player had previously owned a certain license type, we would assume they still did and let them pass. We already had a code path similar to this one, but it was handling the case of being rate-limited. Adding the case to have a similar result for when the entitlement service was unresponsive was an easy task, and the hotfix was reviewed, compiled, built, and rolled out within 10 minutes.
