Steve Koontz, Lead Developer of Roll20, put together a post mortem to share with you on the service outage that some Roll20 users experienced on November 21-22. With no further ado, here’s Steve to talk about what happened and how we approached it:
On November 21st, Thanksgiving Eve here in the US, a little after 7PM PT, Roll20 suffered an extreme slowdown. For most users pages were taking a very long time to load and for some users not load at all. Our engineers responded immediately to the problem but the issue turned out to be difficult to diagnose. We hadn’t changed anything on our end going into the holiday but over the course of just a few minutes our database had been overwhelmed by requests and since everything was then running slowly it wasn’t easy to determine the offending process. We did some maintenance on our database, and 90 minutes after the issues started we rolled out an emergency patch that brought things back to normal. We continued to monitor the performance of the site.
Thanksgiving morning the issue reoccured, which let us know that our fix from the previous night had minimized the problem but not resolved it. Since we were dialed in on the issue it only took the team a few minutes to come up with a longer lasting solution that brought back almost 100% of Roll20’s functionality. We isolated the issue, which turned out to be a foible of the database software Roll20 uses when dealing with very, very large tables, causing a query that normally ran very quickly to suddenly become very slow. We’re still working to resolve the problem with a permanent solution. Until then notifications for completed queue processes, things like installing an addon into a game or rolling a game back, aren’t reaching the end user. The addons will still install correctly, you just aren’t getting the notification when they’ve finished. Refreshing the game detail page after a few seconds will show the addon installed and all content will be correctly added to your game. We expect a special patch for this issue in the coming week.
We apologize for the inconvenience this caused your holiday games.
-Steve Koontz, Lead Developer