Elephant versus Dolphin: Which is Faster? Which is Smarter?

Elephant vs.MySQL

Redfin recently switched some of our backend DB infrastructure from MySQL to Postgres, and we plan to wholly switch to Postgres in the near future. This wasn’t an easy decision; MySQL has a lot going for it, and switching has been a lot of work. However, we’ve already seen major benefits from choosing Postgres, and we expect to see more as we complete our transition. In particular, performance on certain geographic queries has improved dramatically.
A simple Google search shows that a lot of people have already opined about MySQL versus Postgres (e.g. here, here, here, here, here, and here) but we weren’t able to find much information that applied directly to the problem we were having. Specifically, we were having some major performance problems with queries that were constrained by both spatial and numeric columns, and all of our attempts to squeeze more performance out of MySQL (including hiring expensive outside consultants) had come to naught.

GIS Indexes

Redfin is an online real estate company, and our map based UI is the most-used part of our web site (as well as being the biggest performance hog.) When a user views the map, we use SQL to find the relevant listings or past sales. Users typically constrain a search by numerous criteria, such as maximum price or minimum square footage. Since the UI is map based, users are ALWAYS constraining by geography, though that constraint might be weak.

How We Did It In MySQL

In MySQL, the queries might look something like:

SELECT
    *
FROM
    listings
WHERE
    price <= 400000 AND
    num_bedrooms >= 2 AND
    num_bathrooms >= 1.5 AND
    type = 'condo' AND
    MBRContains(GeomFromText('POLYGON((X1 Y1,X1 Y2,X2 Y2,X2 Y1,X1 Y1))'), centroid_col)
LIMIT
    101

where X1/Y1 and X2/Y2 are lat/long pairs that describe the region to be searched. To improve performance, we create indexes on the relevant columns. In MySQL, a normal index cannot include spatial columns, and spatial indexes cannot include normal columns. In this example, we might have one multi-column index on price, num_bedrooms, and num_bathrooms, and another single-column index on centroid_col. In many cases, this performs great. Examples include:

  • When the table is small (we have hundreds of thousands of listings, but tens of millions of past sales records)
  • When the geographic constraint is very selective (i.e. when the map is zoomed very far in)
  • When the geographic constraint is the only constraint (i.e. the user doesn’t care about price, bedrooms, etc.)
  • When the constraints are poor, but the LIMIT amount is hit quickly (e.g. search for all listings in the the world; MySQL can quickly find the first 101 rows in the table, and once it’s found 101, it can give up)

However, there were also cases where it performed terribly, particularly when the table was big, the geographic constraints were relatively weak, and other constraints were relatively strong. For example, a search for all past sales in the San Francisco Bay Area that had 1 bedroom, but sold for over $10,000,000 resulted in a “killer” query. This is a little counterintuitive, but was definitely a problem for our customers (though my example is a very extreme case.) The problem with this query is that:

  • The table is large (tens of millions of rows)
  • The geographic index is the best index to use, but still isn’t great (might return 500,000 rows, or ~1% of the table)
  • MySQL would “short circuit” the query when 101 records were found, but the query returns less than 101 records (there are few 1 bedroom condos that sold for more than $10M), so MySQL examines all 500,000 rows that match the geographic constraint

This does happen in real life.
For example, a user might be looking at homes in a small neighborhood. She’s looking for a 2 bedroom condo between $350k and $375k with a view (a fairly heavily constrainted query.) Then she zooms the map out a few levels (maybe she wants to see a lot of the map to pick out other neighborhoods of interest.) She has just unwittingly made a killer query- she’s searching a large geography with tight constraints on other attributes.
Another example is an investor- someone who wants to search large geographic areas for “fixer” properties that have a low asking price and large living area. Again, this results in a query that’s tightly constrained by some criteria, but relatively loosely constrained by geography.

Postgres and PostGIS

Jeff Yee, our intrepid head of QA, pointed out that geographic indexes in Postgres are supported through the feature-rich PostGIS plug-in. PostGIS supports all sorts of goodies (such as polygon containment, distance calculations, projection conversion, etc.), but the biggest gain is support for indexes on multiple, mixed-type columns. Using PostGIS, we could create an index on centroid_col, price, and num_bedrooms. These indexes turned many of our “killer” queries into pussycats. It was immediately obvious that for Redfin, PostGIS is a Very Good Thing. PostGIS offers us more than just a huge performance improvement and robust, sophisticated geographic functionality. It also offers an active community- there are lots of users available to answer silly newbie questions, and the software is being actively developed. On top of that, there’s a great Windows installer.

Other Considerations

MyISAM and Data Corruption

In MySQL, our tables were MyISAM, since the geometric indexes we used were only supported on MyISAM tables. MyISAM generally offers very good performance, but unfortunately we’ve experienced data corruption on our production systems a number of times. It’s VERY painful, but we can live with occasional corruption if that’s the only way to deliver the performance we seek. PostGIS has given us another option, and we expect the advanced locking and data protection in Postgres to make data corruption a thing of the past.

Replication

We use a “single master-multiple slave” configuration in production, which requires data replication. The MySQL replication options are not super flexible, but they did exactly what we needed them to do, and they did it really well. Replication was easy to set up, easy to monitor, and proved to be very reliable. In Postgres, we had more options, and more confusion. It took us a while to work out exactly how we would do replication; validating and implementing that plan took considerable effort. It’s in production now, and it is working fine, but it was certainly a lot more effort than in MySQL. There’s also an ongoing cost- replicating DDL changes is more complicated under Postgres than it was under MySQL.

Advanced Features

Advanced PostGIS features such as polygon matching and distance calculation have already helped us move much more quickly on Redfin features. Most of these things CAN be done in MySQL (e.g. by post-processing query results in Java using the excellent JTS Topology Suite library from Vivid Solutions), but it’s significantly more work, and in some cases would degrade performance. Hopefully, you’ll see new Redfin features in the near future, and think to yourself “Aah, they’re making PostGIS do the heavy lifting- the lazy bastards.” Postgres also contains advanced features that we were able to immediately benefit from. In particular, we use the CLUSTER command to optimize our table for access via the multicolumn geographic index.

Conclusion

Switching to Postgres was a lot of work. This was compounded by the fact that we chose to “toe-dip” into Postgres- most of our tables are still in MySQL- so our Java code is cluttered with logic to choose the correct DB connection for each query, to construct the “correct” SQL for each DB (most Redfin developers were not required to use Postgres during the development cycle, and we wanted to be able to fall back to MySQL if Postgres turned out to be a disaster), etc. We use Hibernate for persistence, which added another layer of complexity. However, when I see the performance gains we’ve made, I know it’s all worth while. The best cases probably aren’t much better, but the worst cases are startlingly better. Postgres and PostGIS let me feel good about telling my friends to use “past sales” searches on Redfin- I’m confident they won’t be waiting long for their results!
Dolphins may be smarter than elephants, but in the end, elephants are domesticable and can carry a heavy load.
elephant_lift

Discussion

  • Glenn Kelman

    Michael, do you think performance will be improved much by having Postgres run LISTINGS search?

  • http://devblog.redfin.com/author/michael.smedberg Michael Smedberg

    Well, listing perf is already pretty good for most users (almost always under 200ms), so there isn’t as much of room for improvement. But yes, I do expect to see performance improvements on the listing end.
    As the number of listings grows (i.e. as we move into more markets), this will be a bigger effect, and the effort of indexing listings well will be more apparent.

  • Dutch Marlowe

    First of all, good points on GIS support in Postgres. I always liked a database that let me compute spheroid distances based on the Geodetic survey of my choice (big fan of the 1980 :) ). Yeah, Oracle can do it, but who wants to blow $20K on a server license.

    One interesting thing I never got to doing while playing with Postgres 8.1+ was the ability to mix Postgres query partitioning (http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html)
    with spatial filters.

    In theory, your listing are much more dense in some areas than in others. If you could partition on arbitrary geometry and then have the query split among resources as it executes, you could create this giant query computer that really focuses on geography as a way of keeping query times sane. Your geographic partitions can be very large (“The Midwest”) or very small (“The Upper West Side”). Now imagine data for the midwest sitting on one machine and data for the UWS sitting on another machine. The machines have the same resources (memory, CPU), but cover vastly different areas because there are more listings in one than the other. They individually perform well and can be easily re-partitioned (split Midwest into 3 “query machines” for example). Performance is linear on the machines as long as they are similar. Performance on an area that does not cross geographic boundaries split between machines is linear. Performance when someone spans boundaries is NOT #machines X individual machine peformance. It looks more logarithmic. Now you have the “Google” of geographic searches.

    You should be able to arbitrarily scale this system ad nauseum.

    Now, if you only had on-demand access to an arbitrarily large number of machines to perform the processing…

  • http://devblog.redfin.com/author/michael.smedberg Michael Smedberg

    Yup, we’ve thought a little about these issues, and had some internal debates on how to do partitioning.

    As you point out, our business naturally partitions geographically. There are a few possible approaches to partitioning that I’m aware of:

    1. Let the DB do partitioning, as you mentioned
    2. Partition by market. We currently break the world into markets (the San Francisco Bay Area market, the Seattle market, etc.) Real Estate agents are assigned to markets, most users are searching for a home within a single market, and we have “home” pages for each market (e.g. http://sfbay.redfin.com/ or http://seattle.redfin.com/.) However, as you point out, markets are NOT of equal size- it’s a convenient way to partition, but not necessarily a very smart way.
    3. Handle partitioning in custom logic. For example, we could write our own partitioner, and direct queries to the server that corresponds to the region being searched (or even to all partitions in parallel.)
    4. Use a service like Elastra (http://www.elastra.com/), which provides something like the “on-demand access to an arbitrarily large number of machines to perform the processing” that you’re proposing.

    For now, Redfin probably doesn’t need to do any of those things. The combination of good GIS indexing, 64 bit machines, and cheap RAM lets us handle pretty big data volumes (tens of gigs) without sweating. We can probably double the number of markets we serve without worrying too much about partitioning. Once we approach hundreds of gigs of data, though, we’ll definitely have to think about this more carefully.

  • Durch Marlowe

    #1 is easy. Oracle is a great database for this. Unfortunately you have to deal with sales guys.
    #2 isn’t that every Web 2.0 company’s knee-jerk reactions. Don’t they have a name: “sharding” for that. Seems like a band aid.
    #3 is cool. If you can formulate your query in something other than SQL (i.e. XML), you can split the query quite simply. The problem is assembling the result set. That is a major flaw of something like this. Databases are good at merging records. No one, to my knowledge has written a thing that efficiently merges records asynchronously arriving from processing node (unless Google does that).
    #4. Elastra — interesting company, but still thinking small and database focused. There are parallel problems like yours that need to be addressed and they are beyond the DB technologies that exist today. Hopefully someone gets to it one of these days, especially now, with virtual computing and web-based CPU farms being all the rage.

  • Logan Bowers

    On the Partitioning methods, you can also partition randomly, e.g. of n servers, each has 1/n properties. Your overall load is higher since the query has to be prepared separately on each server, but your total throughput is N times greater so (theoretically) your query finishes in nearly 1/N the time. For a small reduction in efficiency, you can significantly reduce visible latency to the user.