For many years, Redfin has used Postgres for storing all of the data that backs the main redfin.com site. Until early this year, we were running on Postgres 8.4, but as the the Postgres folks continued to add exciting features to the 9.x line, we became more and more eager to make the jump.
First, a bit about our architecture. We run with one master database, and several slaves that handle read-only queries. If the master fails, we fail over to one of these slaves.
One of our pain points with Postgres 8.4 was finding a flexible and reliable replication solution. We tried Slony for a while, but we found that we often had to rebuild slaves after a schema-changing deploy, and then we’d end up having to run on one master while the slaves caught up. (Disclaimer: this was a while ago, and Slony has added a lot of features since then…)
So then we switched to Skytools. Skytools had better scripting support than Slony, but it did not support making DDL changes (alter table, etc.) while replication was enabled. So, to make a DDL change, we would essentially “take down” replication (turning all of our slaves into solitary masters), apply schema changes separately to each database, and then restart the cluster. It was manageable, sure, but it still took time to take down and rebuild the replication setup, which led to more downtime during deploys.
Then, in late 2010, Postgres 9.0 introduced hot standby and streaming replication. Since DDL changes could be applied to the master and replicated to slaves while replication was enabled, this solution gave us everything we were looking for. We were nervous about being early adopters though on this critical infrastructure piece, so we waited for Postgres 9.1, which shipped in late 2011 and added on K-nearest-neighbor indexing (yay!). We tested Postgres 9.1 for a few months, and everything seemed to work swimmingly, so we deployed it to our production environment in January.
And that’s when our troubles began. The master performed fine, but we started seeing roughly two dozen failed queries each day on the slaves, which resulted in an error in the logs: “ERROR: invalid memory alloc request size 18446744073709551613”. Yikes. And then roughly every three days, one of the slaves would segfault. Double yikes. We’d waited more than a year since streaming replication and hot standby first shipped – and our in-house testing was flawless – but our production load was triggering some serious issues. Was it “just one bug” we were hitting, or was the whole streaming replication feature suspect? Were we doing something unusual in our workload that might trigger these problems?
A quick search of postgres bugs revealed that Heroku had also seen the invalid memory alloc error (reported as bug 6200), but it was much less frequent for them, and they hadn’t zeroed in on a repro. Heroku and also some very kind Postgres experts (mainly Tom Lane and Robert Haas, from EnterpriseDB) were fortunately willing to help us gather more information, to try to zero in on a cause.
We built our own version of Postgres 9.1.2, and put in code to make it panic when we hit an invalid memory alloc error, and also turned on debug settings that would allow us to capture a stack trace on segfault. The stack traces we captured helped the Postgres experts to form a theory about what might be happening. At this point, the issues we were seeing were connected to another reported issue: bug 6425. That thread contains some really gory details about what was going on… but essentially, our issues all stemmed from a single bug, which was allowing a buffer to be overwritten while it was already in use. Once we applied the patch (which is also available in Postgres 9.1.3), the issues went away, and streaming replication has been running perfectly for us ever since. It’s worth noting, too, that we weren’t doing anything “unusual” to trigger the error – it was just some simple SELECT statements running against a table with a btree primary index.
There were several lessons learned for us here. We are looking into how to better reproduce our production load in test so that we can uncover these problems before we go live. On the plus side, we’re more knowledgeable about how to enable debugging in Postgres, and also how to access help from the Postgres open source community. Will we be more gun shy about adopting future Postgres features as a result of this experience? Maybe… but I’m sure it’s only a matter of time before a great new feature is introduced in Postgres that captures our imaginations, and makes us eager to pursue the next version…
We are curious though as to whether anyone else has seen issues like this with streaming replication prior to version 9.1.3. Anyone out there care to comment on how streaming replication has been working for them?