MySQL to Postgres

ramona1.JPG

A couple of weeks ago we released an upgrade to the Redfin site. Some of the changes were obvious: we now support Safari, we show polygons for different region types on the map (cities, neighborhoods, zip codes, etc.), we show price history on the listing details page, we allow users to download search results as an EXCEL sheet, and so on. And then there are a bunch of changes that are not so obvious, things that happened behind the scenes. Most notably — we switched our database from MySQL to Postgres.

For a website, the database is basically the lowest layer in the stack, so changing it is not a simple thing to do. My colleague Michael Smedberg wrote about the “toe dip” we took in this direction in a prior release: we still used MySQL as our “gold” database, but we replicated some of the data to a Postgres DB in order to support certain past-sale queries that were underperforming on MySQL. When that effort proved successful, we decided to go full throttle and make the switch to Postgres 100%, which is what we just did.

I can say that so far we’re pretty happy. Perhaps the two most important benefits we’ve gained are: 1. with PostGIS we can support true polygon searches, and 2. Postgres’ row-level multi-value concurrency control (MVCC) means that we can run “heavy” import operations on a DB that is serving live queries without worrying about locking or performance loss. And of course there are other benefits as well.

But in this blog post I thought I’d talk about the gotchas we faced along the way as we transitioned from MySQL to Postgres. If you’re contemplating your own switch, hopefully this will save you some time. Here goes…


MIGRATING THE DATA

dance_with_me_by_ramona.JPG

Some of our first steps in migrating to Postgres were to define the new Postgres schema, to create an empty database with this schema, and to migrate data over from a the MySQL database.

In doing this, we discovered that the data types supported by MySQL are not all supported by Postgres. For example: double, tinyint, enum, and datetime don’t exist at all. There are often reasonable alternatives (e.g. timestamp instead of datetime), but in some cases you have to resort to a data type that is not quite the same (e.g. smallint instead of tinyint). Note: we are using Postgres version 8.2; some of these types (e.g. enum) are available in the recently released version 8.3.

So, in migrating our data we had to do some translation/mapping of our data into the new schema. We figured that this was also a golden opportunity to do some additional cleanup: to remove unused tables, to remove unused columns, to name columns more consistently, to add some constraints, to remove duplicate data, etc.

We needed a migration tool that was flexible enough to handle all these needs, and we ended up writing our own — a java program we call MySqlToPostgres. We considered other alternatives, such as using the Postgres COPY command to bulk load from file. But our own migration tool ended up being a good choice given all the transformations we wanted to do in-transit.

If you’re going to write your own tool, here are a few tips to make the migration faster:

  1. Make sure to start with a target database that has tables but has no indexes or unique constraints; add those only after you migrate your data, it makes a huge difference.
  2. Migrate the data using a few big transactions, instead of many small transactions. In our case, we issued a “begin” before inserting any rows into a given table and a “commit” after the last row was inserted. In between, we issued giant “insert” statements, each of which contained many rows. Here’s an example of a single insert statement with two rows:
     INSERT INTO films (code, title, did, date_prod, kind) VALUES
         ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
         ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
  3. There’s a setting in the postgresql.conf file you can use to log slow queries. Normally we set it to 200, so that we’ll know about queries that take more than 200 milliseconds: log_min_duration_statement=200. However, while doing our migration, it was important to disable this option by setting it to 0. Otherwise our slow query log grew very large with all those multi-row insert statements.

Most of our data is stored in “regular” columns: numeric, string, etc. However, we have one table with some binary data, and this data proved particularly challenging. The Postgres documentation tells you how to format binary data as a string in an insert statement. The trick is that while MySQL and Postgres both treat a byte as an unsigned value between 0 and 255, Java treats bytes as signed values between -127 and 128. That means that if you write your own Java migration tool, you have to do some tricky arithmetic with the “bytes” that Java gives you to convert them into the kind of bytes that Postgres expects.


CONNECTIONS

ardila1.JPG

With most databases, you’re allowed to connect to one database and then issue the use command to start dealing with a different database. Not so with Postgres. When you establish a connection to Postgres it’s to a specific database, and the “use” command doesn’t exist. This also means that you can’t issue queries that span multiple databases.

During the development cycle, especially in tests, it’s helpful to create a fresh database, do some stuff with it, and then drop it. Most databases will prevent you from dropping the database if there are users connected to it, but Postgres is particularly draconian about it. There’s no easy way to say “just drop it anyways and kill all connections”. So, you have to be careful to close all your DB connections at the end of the test, before trying to clean up the DB. Of course, this is good practice anyways, but it’s sometimes tricky to discover who’s holding onto that one connection.

Postgres has a setting that defines the maximum number of connections allowed to the DB: max_connections. By default, this is set to 100. You need to make sure that this value is big enough to support your worst-case situation. In particular, your website probably has some connection pooling mechanism on each web server, and this connection pool probably has a cap. You need to make sure that the DB can handle the case of each web server asking for the max number of connections. You should also check whether your connection pool is configured to reap unused connections, although you shouldn’t count on reaping to keep you below the worst-case scenario. If you find that there’s a mismatch, there are two things you can do: reduce the connection pool sizes all around, or increase the DB connections cap. If you want to do the latter, be aware that this will at least require restarting the DB. In some cases it may even require restarting the server completely. So, it’s best to decide ahead of time how many connections you’ll need in the worst case and to configure all settings accordingly.


SEQUENCES

halloween_by_arum.JPG

There are many situations where it’s convenient to have the database generate ID’s for new records. In MySQL, this is a pretty simple matter: you set the AUTO_INCREMENT attribute on your column and it automatically picks up the next value.

With Postgres it’s a little trickier. The AUTO_INCREMENT keyword doesn’t exist. Instead you have sequences. You can create a sequence and call nextval(‘my_sequence_name’) to retrieve the next value. You can also associate this sequence with a column by specifying that the default value for the column is nextval(‘my_sequence_name’). There’s also a shortcut for defining all this: serial.

Having these independent sequences in Postgres gives you the flexibility to do interesting things, like to associate the same sequence with multiple tables. But in practice we haven’t found this to be particularly useful. Also, it was a little more challenging to configure our O-R mapping layer, hibernate, to recognize and use these sequences. It involved adorning our classes with the @SequenceGenerator tag, and our ID columns with the @GeneratedValue tag.

Finally, it was important to explicitly initialize these sequences to the next available value after migrating data from MySQL to Postgres.


CASING

desi3.JPG

In MySQL, and in may other databases, string comparisons are case insensitive. For example, on MySQL you would get one result for each of the following queries:

 create table names(firstname varchar(10));
 insert into names values ('Shahaf');
 select * from names where firstname='Shahaf';
 select * from names where firstname='SHAHAF';
 select * from names where firstname='sHaHaF';

Postgres, on the other hand, is case-sensitive. That means that only the first query would return a result.

After some research, we found three ways to deal with this problem:

  1. Use the upper() or lower() Postgres functions to normalize casing before comparison
  2. Use the case-insensitive regular expression operator: ~*
  3. Use the case-insensitive string-matching operator: ilike

The problem with #2 and #3 is that they don’t support index usage in some or all of the cases. The only way to get index usage (and therefore better performance) is to go with the upper() or lower() approach. By the way, we didn’t find any difference between upper() and lower(), so we arbitrarily went with upper().

Of course, it’s a pain to have to actually store your data in all upper-case. You might want to display it with normal casing, and you don’t want to have to store two copies. Luckily, you can store your data using normal casing, but create a functional index on the upper()-cased values. For example:

 create index employees_name on employees(upper(name));

After that, you just need to remember to upper-case any arguments you get, and to always construct queries that compare against upper(name). If you accidentally generate a query that compares against name (without the upper()), then the index will go unused and you’ll see bad performance.


TEXT SEARCH

rice_field_by_desi.JPG

Text search is one of those areas where MySQL has a basic search solution that works and requires little fuss, whereas Postgres has a myriad of solutions that require more TLC. One of these solutions, TSearch2, is now included in the 8.3 version of Postgres, so perhaps it’s becoming a kind of standard.

TSearch2 is built completely on top of Postgres. To use it, you need to add a set of objects to your database: functions, tables, etc. If you want to be able to search some text columns in a given table, you need to add a new column to that table, a column of type tsvector. You need to bulk-populate it using a tsearch2 function with the raw data from your string columns. And you need to define some triggers to keep this tsvector column updated whenever the source columns change. You can then define an index on this tsvector column, and start issuing text search queries, again using some special tsearch2 functions.

We’ve had reasonable success with TSearch2 so far, but we haven’t exploited all its functionality so far, and we’re still considering other approaches (e.g. Lucene).


REPLICATION

elephantart2.JPG

Replication is much like text search: MySQL has a basic replication solution that works, and Postgres has a bunch of competing solutions that are far more complex to implement. However, as with TSearch2, there’s one solution that’s becoming a kind of standard and is now included in the 8.3 version of Postgres: Slony.

Slony is a single-master, multi-slave, asynchronous replication system. In other words, you update your data in one master DB only, but you can query the data from the master or from the slaves. Just beware that the data on the slaves may be a little stale. In that regard, Slony is very similar to MySQL replication. However, the similarities pretty much end there.

MySQL replication is a built-in part of the database, and it uses its own binary logs to record what changes need to be replicated to which slave. Slony is built completely on top of the Postgres database. It creates a set of tables and triggers in your master database to keep track of which changes need to be replicated to which slave. So, in a way, Slony “pollutes” your schema. If you take a database snapshot, you now have all your tables plus a bunch of Slony stuff. On the other hand, Slony also creates triggers on the slaves to prevent you from accidentally making changes there, something that MySQL will happily let you do.

MySQL replication assumes that you want all databases on the master server to be replicated to each of the slave servers. Slony takes a more fine-grained approach. You need to explicitly specify which set of tables from which database on the master you want replicated to which target slave database. You don’t have to include all tables, and the slave database can live on the same machine as the master if you want.

MySQL replication automatically includes all kinds of changes: inserts, updated, deletes, truncates, table creation, index creation, etc. Slony is far more limited. It can only
replicate changes to tables and sequences. If you have schema changes to make, you need to orchestrate them carefully such that they happen on all databases at the same point in the transaction order.

Slony has it’s own SQL-like language that is consumed by an interpreter called slonik. So, you have to invest a little bit of time learning this language and how to use it to define all the “first-class citizens” of Slony replication: clusters, nodes, sets, subscriptions, etc.

By the way, we found it very convenient to add a table used strictly to track replication latency:

 create table replication_heartbeat(
   id int primary key,
   last_updated timestamp
 );
 insert into replication_heartbeat values(1, now());

We set up a cron job that updates the timestamp on the master every minute, and we set up Nagios alerts both on the master and on the slaves to tell us if the timestamp is older than some threshold. There are various other tools out there to monitor Slony, but we found this scheme to be simple and effective.

Once you have replication set up, watch out for big transactions on the master database. For example, let’s say you issue this query:

 update my_big_table set my_column=null where my_column=0

This statement may take a while to run. Until it completes, no changes will be replicated to the slaves. Once it does complete, though, the changes will be replicated to the slaves one row at a time:

 update my_big_table set my_column=null where id=0;
 update my_big_table set my_column=null where id=1;
 etc.

This takes much longer, and it can cause your replication to fall behind.

Also, be careful to match the horsepower of your master and slave databases. If you have a beefy master database server and a dud slave server, the slave may fall behind as soon as you apply any “significant” changes on the master.


TOOLS AND DOCUMENTATION

make_me_fly_seng_wong.JPG

With MySQL we used the free command-line tool (mysql) as well as a commercial GUI tool (SQL Yog). Both worked well for our use-cases. In particular, it was nice to be able to change data directly in SQL Yog’s result view, much like you would in EXCEL.

Postgres comes with a similar command-line tool: psql. It’s very similar to the mysql command-line tool, but it has it’s own esoteric syntax for various things that simply takes time to learn. For example, instead of “show tables” you use “d”.

We haven’t found a good replacement SQL Yog yet. We’ve used one free tool, pgAdmin III, but we found that it’s not quite as good. It doesn’t let you edit data in the result view, it tends to open a lot of windows, and it tends to keep database connections open more than it should.

We’ve also found that Postgres documentation is not quite as good as MySQL’s. We’ve gotten by so far, but we definitely have to do more Google’ing for various details that are not fully explained in the core Postgres docs.


So there you have it. I hope these tips help you out.

If you also made the switch to Postgres, we’d love to hear about your experience. Did you run into the same issues? Are there others I forgot to mention? Have you found any useful Postgres tools that we should know about? If you considered Postgres but decided against it, we’d love to know your reasons.

Image credits: Bali Adventure Tours

Discussion

  • http://www.alexloddengaard.com Alex Loddengaard

    Awesome post, Shahaf. It sounds like Postres requires much more work to setup and maintain. Do you think its performance improvements are worth its setup costs? Generally speaking, when would you choose to use MySQL and when would you choose to use Postres?

  • shahaf.abileah

    Thanks Alex.

    I think in our case the decision was relatively straight forward — we wanted the kind of spatial search that we *couldn’t* get with MySQL, but we *could* get with PostGIS. Michael Smedberg showed that Postgres performs significantly better on past-sales queries, and our QA lead, Jeff Yee, showed that overall performance is likely to improve.

    So, it’s true that Postgres requires a little more attention to detail, but in our case I think it’s worth the extra cost. This is definitely one of those decisions you need to make very carefully on a case-by-case basis.

  • kusum

    Hii Shahaf
    A great post.
    I am new to postgresql.We have made our polygon database in postgresql and are trying to perform polygon search using it.
    We are trying to find a particular point lies in which polygon and after finding that polygon we have to highlight it on map.
    The thing is if we are able to find the desired polygon in datbase we are not able to get how we will be able to find its vertices to highlight it on map.
    basicaly a polygon search.
    Please help if u have any information regarding this or if you can give me an example that would be great.

    Thanks in advance
    Kusum

  • Matt

    Hi Shahaf, thanks a lot for that super-informative post.

    Are you able to share any experiences & stats about PostgreSQL’s full-text search performance? I have used PostgreSQL in a few production environments & can vouch for its predictability & stability, especially when coupled with OCaml. But I’ve not used the FTS feature, would prefer to use PostgreSQL’s FTS than an external solution eg Lucene.

    Thanks again!
    Matt

  • shahaf.abileah

    Thanks guys.

    Kusum, to answer your question, the first step as you said would be to pull the polygon from the database, something like this:

    select my_geometry_column from my_table where ST_Intersects(my_geometry_column, ST_GeomFromText(‘POINT(-122.299 47.609)’,4326));

    Now, this will give you the raw geometry representation, which is OK if you’re using the PostGIS libraries — you can use them to parse that binary blob and convert it into another representation that is suitable for the front-end. Alternatively, you can use the astext() function to have the DB convert the geometry data into text form.

    Either way, you’ll then need to transfer that data to the client-side in order to display the polygon on the map. We use Microsoft’s Virtual Earth map API, which includes support for drawing polygons. You can learn more about that API here: http://dev.live.com/virtualearth/sdk/

    Incidentally, the query you’re describing (find a polygon given a point) is different than our situation, currently. Our queries are more like “given a polygon, find all points in it”. But PostGIS supports both scenarios.

  • shahaf.abileah

    Matt, to answer your question…

    So far we’ve had pretty good luck with text search using TSearch2. We haven’t seen any records in our slow query log that involve TSearch2 queries, which means that they are all definitely under 200 ms. In our testing they were well under that (more like 10 ms). However, I should point out that our text search use cases are not particularly demanding. We use TSearch2 to look for regions (e.g. cities, neighborhoods, zip codes, etc.) whose name matches what you type into the search box. So, the number of queries is not huge and the volume of data to search is also not that big. For example, we get far more queries after the region has been identified (by name) when it’s time to look for listings that fit in that region’s polygon, and the data to search through (all listings/properties) is also much bigger.

    I haven’t used Lecene myself, but a few of the other engineers here previously worked at Plum Tree, where I hear they did use Lucene. What I gather is that it gives you more functionality than TSearch2, but it comes at the cost of having to maintain your DB and Lucene in sync. Apparently there are some libraries that do some of this grunt work for you (e.g. in the hibernate OR layer), but I personally didn’t investigate those too far.

    At some point we will look to support a wider range of text search features on our side, and when we do I’m sure we’ll spend time thinking about whether to stick with TSearch2 or to go with something else. For now it definitely does the job.

  • Anton Fedorov

    I can suggest you try EMS PgSql Managers.
    It cool for GUI access to postgres.
    I’m use free version, but have tried trial pro version too — its completely do what i want :)

  • shahaf.abileah

    Thanks for the suggestion Anton!

    I just installed EMS SQL Manager Lite for Postgres and I agree that it’s pretty good. The only thing it lacks is the ability to edit fields in place (in the grid view), but otherwise it looks better than the others I’ve tried. I’ll probably start using this one instead of my previous choice: DbVisualizer.

  • Walter Minden

    Im curious about how you take advantage of the slony replication via Hibernate or do you just use it for backup issues or similar?

    I wonder how one would assert that data, read from the slave, gets never written back to the master? Two session factories and more db logic in business layers?

  • shahaf.abileah

    We use Slony replication for both reasons: backup purposes and load balancing for better performance.

    Our hibernate setup includes two separate connection definitions: one to the master DB and one to a slave DB. All write operations go to the master (obviously). *Most* read operations go to the slave (in some cases we read from the master DB, e.g. for queries where we can’t tolerate any replication latency).

    We use Guice dependency injection to initialize our DAO’s with a SessionFactory, and we defined our own attributes to indicate whether a DAO should be initialized with a master DB SessionFactory (@ReadWrite) or a slave DB SessionFactory (@ReadOnly).

    We’ve been talking about setting up a 3rd attribute (@Reporting) that goes to a different slave, one that isn’t guaranteed to be as up-to-date as the others but that is appropriate for “heavy” queries (for reporting purposes and such).

  • Walter Minden

    First of all thanks for your quick reply!

    So, you then just put your annotated DAOs in your service classes and use them as you did before, right?

    So, you can’t actually ever write data read from the slave back to the master because the DAO prohibits it? What happens when a service which is injected with a DAO that has a read only session factory tries to modify data, e.g. update()?

  • shahaf.abileah

    Yes, we essentially use DAO’s just like we did before.

    If you have a @ReadOnly annotated DAO and you try to use it to do an update, the update will be sent to the slave and it will fail with a SQLException. The error is generated by a trigger that Slony creates on tables in the slave DB to prevent you from accidentally making changes to them.

  • http://www.discount-nike-dunk-shoes.com nike dunk shoes

    Hhe article's content rich variety which make us move for our mood after reading this article. surprise, here you will find what you want! Recently, I found some wedsites which commodity is colorful of fashion.
    http://www.scarf8.net