In the last few weeks we’ve seen some of our worst stability and most heavily impacted service quality in at least a year. It’s hard enough to get a game scheduled without these interruptions, so we owe you an apology and an explanation.
WHAT WENT WRONG
Our team is engaged in constant efforts to upgrade our infrastructure, fix old code and speed up your games. What happened here was a perfect storm: A newly upgraded service – the Character Sheet Service – had an infrastructure issue that applied strain to our main servers. That strain caused a cascade of failures in the handful of infrastructure components that had not been fully upgraded (for more on all of these see the Background section). We spent a lot of time chasing down and fixing issues in these components, one by one, before we discovered the root cause. There were numerous times that we thought we had the right fix, only to have new issues emerge a few days later. This led both to an unacceptably long period of instability and a lack of clear communication about it.
HOW WE’RE FIXING IT
We’ve upgraded every service impacting infrastructure component, rolled back code issues and we’ve now gotten safely through a full week without any recurrences. As of Thursday morning we have re-launched the Character Sheet Service with all known issues repaired. Keeping the service stable is our top priority and we’ll continue to slowly release minor configuration changes early next week to further improve things. We won’t be returning to our normal release cadence until we’re 100% confident that all issues have been resolved.
Roll20 began as a little startup with a single server that hosted just about every single line of code and infrastructure component. Eventually we moved into the modern era and began to deploy tiny servers that could autoscale into the hundreds as traffic rises and falls on the site. We moved most infrastructure components that were under heavy strain off of the old production server, but a few things that were stable at the time stayed there.
Recently, the need to have character sheets on the mobile application meant we were going to need to make some changes to how we delivered the character sheet, so the time seemed opportune to break character sheets off into their own service. This would allow us to scale around need there separately from the main service, deploy new character sheet features efficiently and to open up more possibilities for character sheet authors to innovate. We launched the service integrated with the Virtual Tabletop back in December.
We now know that strain introduced by this service led to a systemic failure on that old production server which in turn caused each of the infrastructure failures we’ve witnessed.
MISTAKES AND CORRECTIONS
So, that’s the story. But the detail of the story can obfuscate the places where we, quite honestly, screwed up, and we can’t make a good apology without being honest about that. Having mission critical infrastructure co-located on a single, old, fragile machine was not a good call.
Furthermore, we underestimated just how difficult it can be to test features that touch our Virtual Tabletop. It has nearly a decade of features built on top of it and in some cases the ways in which features rely on each other isn’t always obvious. That led us to miss issues our work caused. We’re working hard on expanding our automated testing to cover those gaps, but in the meantime there are steps we could have taken to catch more of these issues earlier. While triaging these issues we developed a robust manual test plan for character sheets that has allowed us to make bolder moves with less risk. Expanded stress testing helped us to identify infrastructure configuration issues and solve them. Those sorts of tests, as brute force as they may be, should exist on any substantial new feature. We should have had them for this release long before they were able to introduce issues for you.
If you’re still reading this, thanks for bearing with me! Improving Roll20’s code and infrastructure to continue to improve the service and to help us launch new features, games and content faster and with more reliability remains a top priority for me and my team. We’ll do our best not to roll any more nat 1s while we do it.