Performance and Support Lessons from foursquare Downtime
A few days ago foursquare experienced their first major downtime. Over October 5th and 6th they were down for a total of almost 17 hours. They also had some partial downtime when some services were up and others were down. They’ve been very open about the problems they experienced and the steps they’re going to take to prevent it from happening again. The cause of the downtime was one of their MongoDB servers running out of RAM. While foursquare’s blog post linked above covers the issue at a higher level, Eliot Horowitz of 10gen (MongoDB’s company) took the time to write a great, in-depth post about the technical issues at hand. Some of these problems were preventable and highlight some best practices everyone should implement.
Server Monitoring is a Must
The core problem was MongoDB running out of RAM, forcing it to hit the hard drives for data. This created an instant bottleneck, overwhelming the server and bringing it down. My question is, why was this allowed to happen? If proper server monitoring was in place they would have been alerted to the impending collapse before it happened. I am very conservative when it comes to monitoring and server resource allocation. I get nervous when my servers are in the 25-50% load range. A penny-pinching startup may push their servers closer to the limit but there still needs to be an automated monitoring and alert system in place.
For a MongoDB server I would be very aggressive with RAM monitoring. The simplest way would be to pick arbitrary points for alerts to go out (eg 75% used, 85% used). If your data growth rate is linear and/or reasonably slow this could work. A better way would be to log RAM usage over time and use those data points to create a function to model growth. This would allow you to predict when a server will need an upgrade, schedule upgrades, and pick more accurate alert points. This can be significantly complicated by servers and applications the preallocate resources. In those cases you may need to use custom utilities to monitor effectively.
Sharding Keys Must Be Considered Carefully
According to Eliot, the reason the server reached capacity was because data wasn’t being sharded correctly across the two MongoDB servers. In simpler terms, data needs to be near evenly distributed between servers for predictable and proper performance. If the data isn’t distributed correctly one server can become overloaded, as in this case. Foursquare is using the user ID as the sharding key. Seems pretty logical, right? Keep all of a user’s data on one server will result in better read performance. User IDs are unique and already exist. Set it and forget it right? What could possibly go wrong?
This immediately seemed like a bad idea to me, especially with only two large servers. Just like all systems you’ll have a small portion of users who are very active, most users will be moderately active and some users don’t use the service at all. The distribution of very active, active and inactive users is likely to be skewed. You can easily end up with imbalanced data. This is not something to take lightly and requires some serious engineering, planning and analysis. Each application is different. It may be possible for foursquare to use user IDs with appropriate planning. I’m sure their engineers are reviewing this decision carefully.
Beyond the technical problems there were additional lessons to be learned from this incident. Foursquare hadn’t dealt with downtime like this before so they didn’t have a support system in place to notify and assist users. One of my mottos has always been “Plan for the worst, hope for the best.” Even if your service isn’t essential to your users you should have a contingency plan for downtime. How will you inform your users? How will you keep them updated on progress? How can they contact you in the meantime? Luckily foursquare was already using other communication mediums (Twitter and a Tumblr blog) that allowed them to get the word out. In the wake of the downtime they’ve added a status blog and a support Twitter account.
I believe transparency is key and foursquare nailed it. No matter what the cause or how important your services are you need to be transparent with users about downtime, data loss, bugs and problems.