Redfin recently switched some of our backend DB infrastructure from MySQL to Postgres, and we plan to wholly switch to Postgres in the near future. This wasn’t an easy decision; MySQL has a lot going for it, and switching has been a lot of work. However, we’ve already seen major benefits from choosing Postgres, and we expect to see more as we complete our transition. In particular, performance on certain geographic queries has improved dramatically.
A simple Google search shows that a lot of people have already opined about MySQL versus Postgres (e.g. here, here, here, here, here, and here) but we weren’t able to find much information that applied directly to the problem we were having. Specifically, we were having some major performance problems with queries that were constrained by both spatial and numeric columns, and all of our attempts to squeeze more performance out of MySQL (including hiring expensive outside consultants) had come to naught.
Redfin is an online real estate company, and our map based UI is the most-used part of our web site (as well as being the biggest performance hog.) When a user views the map, we use SQL to find the relevant listings or past sales. Users typically constrain a search by numerous criteria, such as maximum price or minimum square footage. Since the UI is map based, users are ALWAYS constraining by geography, though that constraint might be weak.
How We Did It In MySQL
In MySQL, the queries might look something like:
price <= 400000 AND
num_bedrooms >= 2 AND
num_bathrooms >= 1.5 AND
type = 'condo' AND
MBRContains(GeomFromText('POLYGON((X1 Y1,X1 Y2,X2 Y2,X2 Y1,X1 Y1))'), centroid_col)
where X1/Y1 and X2/Y2 are lat/long pairs that describe the region to be searched. To improve performance, we create indexes on the relevant columns. In MySQL, a normal index cannot include spatial columns, and spatial indexes cannot include normal columns. In this example, we might have one multi-column index on price, num_bedrooms, and num_bathrooms, and another single-column index on centroid_col. In many cases, this performs great. Examples include:
- When the table is small (we have hundreds of thousands of listings, but tens of millions of past sales records)
- When the geographic constraint is very selective (i.e. when the map is zoomed very far in)
- When the geographic constraint is the only constraint (i.e. the user doesn’t care about price, bedrooms, etc.)
- When the constraints are poor, but the LIMIT amount is hit quickly (e.g. search for all listings in the the world; MySQL can quickly find the first 101 rows in the table, and once it’s found 101, it can give up)
However, there were also cases where it performed terribly, particularly when the table was big, the geographic constraints were relatively weak, and other constraints were relatively strong. For example, a search for all past sales in the San Francisco Bay Area that had 1 bedroom, but sold for over $10,000,000 resulted in a “killer” query. This is a little counterintuitive, but was definitely a problem for our customers (though my example is a very extreme case.) The problem with this query is that:
- The table is large (tens of millions of rows)
- The geographic index is the best index to use, but still isn’t great (might return 500,000 rows, or ~1% of the table)
- MySQL would “short circuit” the query when 101 records were found, but the query returns less than 101 records (there are few 1 bedroom condos that sold for more than $10M), so MySQL examines all 500,000 rows that match the geographic constraint
This does happen in real life.
For example, a user might be looking at homes in a small neighborhood. She’s looking for a 2 bedroom condo between $350k and $375k with a view (a fairly heavily constrainted query.) Then she zooms the map out a few levels (maybe she wants to see a lot of the map to pick out other neighborhoods of interest.) She has just unwittingly made a killer query- she’s searching a large geography with tight constraints on other attributes.
Another example is an investor- someone who wants to search large geographic areas for “fixer” properties that have a low asking price and large living area. Again, this results in a query that’s tightly constrained by some criteria, but relatively loosely constrained by geography.
Postgres and PostGIS
Jeff Yee, our intrepid head of QA, pointed out that geographic indexes in Postgres are supported through the feature-rich PostGIS plug-in. PostGIS supports all sorts of goodies (such as polygon containment, distance calculations, projection conversion, etc.), but the biggest gain is support for indexes on multiple, mixed-type columns. Using PostGIS, we could create an index on centroid_col, price, and num_bedrooms. These indexes turned many of our “killer” queries into pussycats. It was immediately obvious that for Redfin, PostGIS is a Very Good Thing. PostGIS offers us more than just a huge performance improvement and robust, sophisticated geographic functionality. It also offers an active community- there are lots of users available to answer silly newbie questions, and the software is being actively developed. On top of that, there’s a great Windows installer.
MyISAM and Data Corruption
In MySQL, our tables were MyISAM, since the geometric indexes we used were only supported on MyISAM tables. MyISAM generally offers very good performance, but unfortunately we’ve experienced data corruption on our production systems a number of times. It’s VERY painful, but we can live with occasional corruption if that’s the only way to deliver the performance we seek. PostGIS has given us another option, and we expect the advanced locking and data protection in Postgres to make data corruption a thing of the past.
We use a “single master-multiple slave” configuration in production, which requires data replication. The MySQL replication options are not super flexible, but they did exactly what we needed them to do, and they did it really well. Replication was easy to set up, easy to monitor, and proved to be very reliable. In Postgres, we had more options, and more confusion. It took us a while to work out exactly how we would do replication; validating and implementing that plan took considerable effort. It’s in production now, and it is working fine, but it was certainly a lot more effort than in MySQL. There’s also an ongoing cost- replicating DDL changes is more complicated under Postgres than it was under MySQL.
Advanced PostGIS features such as polygon matching and distance calculation have already helped us move much more quickly on Redfin features. Most of these things CAN be done in MySQL (e.g. by post-processing query results in Java using the excellent JTS Topology Suite library from Vivid Solutions), but it’s significantly more work, and in some cases would degrade performance. Hopefully, you’ll see new Redfin features in the near future, and think to yourself “Aah, they’re making PostGIS do the heavy lifting- the lazy bastards.” Postgres also contains advanced features that we were able to immediately benefit from. In particular, we use the CLUSTER command to optimize our table for access via the multicolumn geographic index.
Switching to Postgres was a lot of work. This was compounded by the fact that we chose to “toe-dip” into Postgres- most of our tables are still in MySQL- so our Java code is cluttered with logic to choose the correct DB connection for each query, to construct the “correct” SQL for each DB (most Redfin developers were not required to use Postgres during the development cycle, and we wanted to be able to fall back to MySQL if Postgres turned out to be a disaster), etc. We use Hibernate for persistence, which added another layer of complexity. However, when I see the performance gains we’ve made, I know it’s all worth while. The best cases probably aren’t much better, but the worst cases are startlingly better. Postgres and PostGIS let me feel good about telling my friends to use “past sales” searches on Redfin- I’m confident they won’t be waiting long for their results!
Dolphins may be smarter than elephants, but in the end, elephants are domesticable and can carry a heavy load.