Early yesterday morning, Roll20 experienced a site-wide outage which affected your ability to play games. We strive to have the site up and running 24/7, because we know that with a global audience, every hour of the day is someone’s game night, and we never want to be the reason that your group has to cancel.
Unfortunately, this time we fell short of that goal, and we want to sincerely apologize for that. The short version is that we encountered a database issue that we had never experienced previously. The good news is that our backup and recovery processes worked correctly and prevented any data from being lost. The bad news is that it took way longer than we expected for those processes to complete, which led to a significant amount of downtime, which is not acceptable. We’re taking steps to make sure that a) this type of issue doesn’t happen again, and b) if there are other issues in the future, we are able to recover from them much more promptly.
For those interested, let’s get into the technical information:
Beginning at approximately 6:30 AM US Central Time on April 4th, our Postgres database system encountered an error and entered a read-only mode in order to prevent data loss.This issue was caused by a misconfiguration with Postgres’ built-in autovacuum system, which was not completing its tasks in a timely manner, leading to a situation where the database’s transaction ID tracking was in danger of encountering a 32-bit integer overflow. Postgres has a built-in failsafe to keep this from happening, which is what caused the read-only mode to engage.
Our automatic monitoring systems alerted our Development Team that there was an issue with the site, and within 30 minutes we had our first developers responding and working to identify the issue. The solution recommended by the official Postgres documentation was to boot the database in a single-user mode and do a database-wide vacuum, which would in theory resolve the problem.
However, when running vacuum in single-user mode, the verbose option doesn’t seem to work, and as such we were left with no real information about what the process was doing or when it would complete. As this process was running, we monitored the disk_read_bytes IO information of the process to attempt to get some idea of how quickly it was progressing. After a few hours, we became concerned that the process was going to take much longer to complete than we originally hoped. (It turned out we were right, as it’s been running for 24+ hours on our original database server and is still not finished.)
Around noon Central Time we began putting into place two alternate plans, one of which was to bring the site back online in a read-only mode, and the other of which was to fall back to our standard database restore, based on a database backup that we captured when the database’s read-only mode was engaged.
We were able to get the site running in read-only mode at approximately 3 PM Central Time, which allowed users to access their existing games, so that at least some groups would be able to play games on the site. During this time, we were continuing to run the database restore.
At approximately 4 AM Central Time today (April 5th), the database restore completed, and we were able to bring the site back online fully and out of read-only mode. No data was lost during the restore process.
How we’re preventing this from happening again
The original root cause was due to a misconfiguration of the autovacuum process, which, although it was running, was not tuned properly and was not fully completing all the tasks it should have been. We are still investigating this but the most likely cause is simply that our database has grown to such a large size that we need to custom-tune autovacuum to work better for our specific needs. We do not anticipate this issue happening again, since it was caused by a long-term (most likely months if not years) build-up of transaction IDs not being cleared, which we will resolve well in advance of it building up to that level again.
However, the larger issue which this highlighted to us is that if there is any sort of major database issue, our current restoration process is far too slow to work for the site at this point. This is, again, due to the size at which the site has grown over the past year. Taking 16 hours to recover from a database failure via a complete restore is simply unacceptable and does not fit with our goals of providing a service with high uptime and availability.
On our roadmap for this year already included planned upgrades to our database infrastructure which would help with this, in the form of a secondary hot-swappable database server which we could fail over to in the case of a primary database issue. We will be accelerating this and making it a top priority to have done within the next few weeks.
If there’s any positive outcome from this situation, it’s that the database restoration process did work properly and did not lose any data. In addition, we now have the newly-created ability to put the site into read-only mode immediately should a similar situation ever occur.
Again, we offer our sincerest apologies for the downtime and for the interruption to your gaming. We know that getting everyone together around our virtual table can require making schedules work, and we never want to be the reason why your planned activity doesn’t happen. We hope that this post-mortem sheds some light into what caused the issue, and assures you that everything is being done to prevent if from happening again.
I’d also like to offer a very sincere thank you to the many users who cheered us on throughout yesterday’s chaos. At Roll20 we’ve managed to maintain incredibly strong up-time in our nearly six years of providing the service, and the amount of people who took the time to say, “We know this is abnormal, and we know you’ll get it fixed,” was humbling.
-Nolan T. Jones, Managing Partner