One Week After The Outage

We launched a major new version of a week and a half ago. The headliner was the addition of near-real-time “solds” data through our MLS-based virtual office website (VOW) data feeds. On launch day, we had a 3 hour outage and intermittent “brownouts” for another 2 days after. We wanted to give people an idea of what happened and what we’re doing to make sure this kind of outage doesn’t happen again.

Better and… Bigger
For 14 days prior to launch, we ran data imports day and night. We added 1.4 million records, 9 million photos, and revamped our internal database schema. As a result, the disk space used by our Postgres database grew by 30%. Way more disk was needed to store photos.

By Thursday morning, we were not able to go live with all our slave databases as planned. We use Slony replication for our slave databases. Errors in scripts can cause a Slony slave to require a complete re-sync, and that is, unfortunately, what happened to us. We launched believing that our single master database would handle the load. We were wrong.

By 9am PST on Thursday, our site was maxing out. First it was slow, then it was non-responsive. The problem wasn’t a rush of traffic from the press coverage. The problem was our single master database. The increase in database size and new schema overloaded it. We ended up throttling our database to allow most people to access, but this just caused intermittent issues and “brownouts,” where the site would be overwhelmed with requests and become non-responsive for a minute or two at a time.

Many engineers spent all Thursday and Friday looking at code, looking at the database, and looking at the traffic. Everyone was looking for some magical bug that was causing the problem. In the end, the solution was very simple. Once the slave databases were synced up and put into production on Friday at 8pm PST, the problem mostly went away. We’re still investigating the root cause, but all indicators are strongly pointing to the idea that we just didn’t have enough RAM to avoid disk I/O slowness and thrashing.

Lessons Learned
Redfin learned that the scalability & performance testing that we do before every release isn’t good enough. This outage hurt our professional pride, and we are newly dedicated to fixing this. We need to know every new release is going to run well against expected load and existing hardware.

For our next major release in December, we had been planning to upgrade our master database from Postgres 8.3 on 32GB of RAM to Postgres 8.4 on 72GB of RAM. The database servers are over two years old now. Too bad we didn’t do it sooner, but we’ve accelerated the hardware upgrade to have it ready this week. We’re also intrigued by the idea of using Fusion-IO SSDs at some point.

We also plan to spend more time looking at ways we can streamline the code to run the site more efficiently on the hardware. Hardware is relatively cheap these days, but smart engineers can often find places in the code that can be made 10x faster!

And as the site grows, we’ll also look at more scalable database solutions like partitioning or switching at least some parts to Hadoop HBase. We use Hadoop for log analysis, but it’s very promising as a high-scale query engine.

I know there are a lot of folks in technology who use Redfin. What do you think? Did we learn the right lessons?