Archive for the ‘Uncategorized’ Category

May 6, 2008

Fun with generate_series

A few months ago I attended the PostgreSQL conference in Portland, OR. There were a lot of talks ranging from hard-core stuff like Neil Conway’s talk about the internals of query execution, to random fun stuff like David Fetter’s discussion of procedural languages, including LOLCODE.

gulakuno.jpg

During their talks, a few people mentioned a handy function called generate_series. It took me a while to discover how useful this function really is. I thought I’d post an example. Here goes…

Let’s say that you have a table with sales information:

postgres=# select * from sales order by date;
    date    | sales_person | part_number
------------+--------------+-------------
 2008-05-05 | Glenn        |           1
 2008-05-05 | Shahaf       |           1
 2008-05-06 | Mike         |           1
 2008-05-06 | Mike         |           2
 2008-05-08 | Glenn        |           1
 2008-05-08 | Shahaf       |           1
 2008-05-08 | Mike         |           2
 2008-05-09 | Mike         |           1
 2008-05-09 | Glenn        |           1
(9 rows)

You might want to get an idea of how many sales happened on each day. You could try to do it with a query like this:

postgres=# select date, count(*) from sales group by date order by date;
    date    | count
------------+-------
 2008-05-05 |     2
 2008-05-06 |     2
 2008-05-08 |     3
 2008-05-09 |     2
(4 rows)

This basically works, but it hides one important fact — no sales happened on May 7.

To fix this, we can use generate_series. When you run this function normally, it just generates a series of numbers:

postgres=# select generate_series(0,4);
generate_series
-----------------
0
1
2
3
4
(5 rows)

However, you can easily change it to generate a series of dates:

postgres=# select generate_series(0,4) + date '2008-05-05' as date;
date
------------
2008-05-05
2008-05-06
2008-05-07
2008-05-08
2008-05-09
(5 rows)

Once you have this, you can now join against the sales table to generate the report:

postgres=# select series.date, count(sales.date)
from (select generate_series(0,4) + date '2008-05-05' as date) as series
left outer join sales on series.date=sales.date group by series.date
order by series.date;
    date    | count
------------+-------
 2008-05-05 |     2
 2008-05-06 |     2
 2008-05-07 |     0
 2008-05-08 |     3
 2008-05-09 |     2
(5 rows)

The trick here is to do a left-outer join betwee the date sequence and the sales, and to count the rows that have a non-null sales date.

This is by no means the only use of generate_series, it’s just the most recent use I found. If you know of other ways to use this function, or if you know of other handy functions, drop a note below.

Image credits: Electric Vehicle Guide


February 13, 2008

MySQL to Postgres

ramona1.JPG

A couple of weeks ago we released an upgrade to the Redfin site. Some of the changes were obvious: we now support Safari, we show polygons for different region types on the map (cities, neighborhoods, zip codes, etc.), we show price history on the listing details page, we allow users to download search results as an EXCEL sheet, and so on. And then there are a bunch of changes that are not so obvious, things that happened behind the scenes. Most notably — we switched our database from MySQL to Postgres.

For a website, the database is basically the lowest layer in the stack, so changing it is not a simple thing to do. My colleague Michael Smedberg wrote about the “toe dip” we took in this direction in a prior release: we still used MySQL as our “gold” database, but we replicated some of the data to a Postgres DB in order to support certain past-sale queries that were underperforming on MySQL. When that effort proved successful, we decided to go full throttle and make the switch to Postgres 100%, which is what we just did.

I can say that so far we’re pretty happy. Perhaps the two most important benefits we’ve gained are: 1. with PostGIS we can support true polygon searches, and 2. Postgres’ row-level multi-value concurrency control (MVCC) means that we can run “heavy” import operations on a DB that is serving live queries without worrying about locking or performance loss. And of course there are other benefits as well.

But in this blog post I thought I’d talk about the gotchas we faced along the way as we transitioned from MySQL to Postgres. If you’re contemplating your own switch, hopefully this will save you some time. Here goes…


MIGRATING THE DATA

dance_with_me_by_ramona.JPG

Some of our first steps in migrating to Postgres were to define the new Postgres schema, to create an empty database with this schema, and to migrate data over from a the MySQL database.

In doing this, we discovered that the data types supported by MySQL are not all supported by Postgres. For example: double, tinyint, enum, and datetime don’t exist at all. There are often reasonable alternatives (e.g. timestamp instead of datetime), but in some cases you have to resort to a data type that is not quite the same (e.g. smallint instead of tinyint). Note: we are using Postgres version 8.2; some of these types (e.g. enum) are available in the recently released version 8.3.

So, in migrating our data we had to do some translation/mapping of our data into the new schema. We figured that this was also a golden opportunity to do some additional cleanup: to remove unused tables, to remove unused columns, to name columns more consistently, to add some constraints, to remove duplicate data, etc.

We needed a migration tool that was flexible enough to handle all these needs, and we ended up writing our own — a java program we call MySqlToPostgres. We considered other alternatives, such as using the Postgres COPY command to bulk load from file. But our own migration tool ended up being a good choice given all the transformations we wanted to do in-transit.

If you’re going to write your own tool, here are a few tips to make the migration faster:

  1. Make sure to start with a target database that has tables but has no indexes or unique constraints; add those only after you migrate your data, it makes a huge difference.
  2. Migrate the data using a few big transactions, instead of many small transactions. In our case, we issued a “begin” before inserting any rows into a given table and a “commit” after the last row was inserted. In between, we issued giant “insert” statements, each of which contained many rows. Here’s an example of a single insert statement with two rows:
     INSERT INTO films (code, title, did, date_prod, kind) VALUES
         ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
         ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
  3. There’s a setting in the postgresql.conf file you can use to log slow queries. Normally we set it to 200, so that we’ll know about queries that take more than 200 milliseconds: log_min_duration_statement=200. However, while doing our migration, it was important to disable this option by setting it to 0. Otherwise our slow query log grew very large with all those multi-row insert statements.

Most of our data is stored in “regular” columns: numeric, string, etc. However, we have one table with some binary data, and this data proved particularly challenging. The Postgres documentation tells you how to format binary data as a string in an insert statement. The trick is that while MySQL and Postgres both treat a byte as an unsigned value between 0 and 255, Java treats bytes as signed values between -127 and 128. That means that if you write your own Java migration tool, you have to do some tricky arithmetic with the “bytes” that Java gives you to convert them into the kind of bytes that Postgres expects.


CONNECTIONS

ardila1.JPG

With most databases, you’re allowed to connect to one database and then issue the use command to start dealing with a different database. Not so with Postgres. When you establish a connection to Postgres it’s to a specific database, and the “use” command doesn’t exist. This also means that you can’t issue queries that span multiple databases.

During the development cycle, especially in tests, it’s helpful to create a fresh database, do some stuff with it, and then drop it. Most databases will prevent you from dropping the database if there are users connected to it, but Postgres is particularly draconian about it. There’s no easy way to say “just drop it anyways and kill all connections”. So, you have to be careful to close all your DB connections at the end of the test, before trying to clean up the DB. Of course, this is good practice anyways, but it’s sometimes tricky to discover who’s holding onto that one connection.

Postgres has a setting that defines the maximum number of connections allowed to the DB: max_connections. By default, this is set to 100. You need to make sure that this value is big enough to support your worst-case situation. In particular, your website probably has some connection pooling mechanism on each web server, and this connection pool probably has a cap. You need to make sure that the DB can handle the case of each web server asking for the max number of connections. You should also check whether your connection pool is configured to reap unused connections, although you shouldn’t count on reaping to keep you below the worst-case scenario. If you find that there’s a mismatch, there are two things you can do: reduce the connection pool sizes all around, or increase the DB connections cap. If you want to do the latter, be aware that this will at least require restarting the DB. In some cases it may even require restarting the server completely. So, it’s best to decide ahead of time how many connections you’ll need in the worst case and to configure all settings accordingly.


SEQUENCES

halloween_by_arum.JPG

There are many situations where it’s convenient to have the database generate ID’s for new records. In MySQL, this is a pretty simple matter: you set the AUTO_INCREMENT attribute on your column and it automatically picks up the next value.

With Postgres it’s a little trickier. The AUTO_INCREMENT keyword doesn’t exist. Instead you have sequences. You can create a sequence and call nextval(’my_sequence_name’) to retrieve the next value. You can also associate this sequence with a column by specifying that the default value for the column is nextval(’my_sequence_name’). There’s also a shortcut for defining all this: serial.

Having these independent sequences in Postgres gives you the flexibility to do interesting things, like to associate the same sequence with multiple tables. But in practice we haven’t found this to be particularly useful. Also, it was a little more challenging to configure our O-R mapping layer, hibernate, to recognize and use these sequences. It involved adorning our classes with the @SequenceGenerator tag, and our ID columns with the @GeneratedValue tag.

Finally, it was important to explicitly initialize these sequences to the next available value after migrating data from MySQL to Postgres.


CASING

desi3.JPG

In MySQL, and in may other databases, string comparisons are case insensitive. For example, on MySQL you would get one result for each of the following queries:

 create table names(firstname varchar(10));
 insert into names values ('Shahaf');
 select * from names where firstname='Shahaf';
 select * from names where firstname='SHAHAF';
 select * from names where firstname='sHaHaF';

Postgres, on the other hand, is case-sensitive. That means that only the first query would return a result.

After some research, we found three ways to deal with this problem:

  1. Use the upper() or lower() Postgres functions to normalize casing before comparison
  2. Use the case-insensitive regular expression operator: ~*
  3. Use the case-insensitive string-matching operator: ilike

The problem with #2 and #3 is that they don’t support index usage in some or all of the cases. The only way to get index usage (and therefore better performance) is to go with the upper() or lower() approach. By the way, we didn’t find any difference between upper() and lower(), so we arbitrarily went with upper().

Of course, it’s a pain to have to actually store your data in all upper-case. You might want to display it with normal casing, and you don’t want to have to store two copies. Luckily, you can store your data using normal casing, but create a functional index on the upper()-cased values. For example:

 create index employees_name on employees(upper(name));

After that, you just need to remember to upper-case any arguments you get, and to always construct queries that compare against upper(name). If you accidentally generate a query that compares against name (without the upper()), then the index will go unused and you’ll see bad performance.


TEXT SEARCH

rice_field_by_desi.JPG

Text search is one of those areas where MySQL has a basic search solution that works and requires little fuss, whereas Postgres has a myriad of solutions that require more TLC. One of these solutions, TSearch2, is now included in the 8.3 version of Postgres, so perhaps it’s becoming a kind of standard.

TSearch2 is built completely on top of Postgres. To use it, you need to add a set of objects to your database: functions, tables, etc. If you want to be able to search some text columns in a given table, you need to add a new column to that table, a column of type tsvector. You need to bulk-populate it using a tsearch2 function with the raw data from your string columns. And you need to define some triggers to keep this tsvector column updated whenever the source columns change. You can then define an index on this tsvector column, and start issuing text search queries, again using some special tsearch2 functions.

We’ve had reasonable success with TSearch2 so far, but we haven’t exploited all its functionality so far, and we’re still considering other approaches (e.g. Lucene).


REPLICATION

elephantart2.JPG

Replication is much like text search: MySQL has a basic replication solution that works, and Postgres has a bunch of competing solutions that are far more complex to implement. However, as with TSearch2, there’s one solution that’s becoming a kind of standard and is now included in the 8.3 version of Postgres: Slony.

Slony is a single-master, multi-slave, asynchronous replication system. In other words, you update your data in one master DB only, but you can query the data from the master or from the slaves. Just beware that the data on the slaves may be a little stale. In that regard, Slony is very similar to MySQL replication. However, the similarities pretty much end there.

MySQL replication is a built-in part of the database, and it uses its own binary logs to record what changes need to be replicated to which slave. Slony is built completely on top of the Postgres database. It creates a set of tables and triggers in your master database to keep track of which changes need to be replicated to which slave. So, in a way, Slony “pollutes” your schema. If you take a database snapshot, you now have all your tables plus a bunch of Slony stuff. On the other hand, Slony also creates triggers on the slaves to prevent you from accidentally making changes there, something that MySQL will happily let you do.

MySQL replication assumes that you want all databases on the master server to be replicated to each of the slave servers. Slony takes a more fine-grained approach. You need to explicitly specify which set of tables from which database on the master you want replicated to which target slave database. You don’t have to include all tables, and the slave database can live on the same machine as the master if you want.

MySQL replication automatically includes all kinds of changes: inserts, updated, deletes, truncates, table creation, index creation, etc. Slony is far more limited. It can only
replicate changes to tables and sequences. If you have schema changes to make, you need to orchestrate them carefully such that they happen on all databases at the same point in the transaction order.

Slony has it’s own SQL-like language that is consumed by an interpreter called slonik. So, you have to invest a little bit of time learning this language and how to use it to define all the “first-class citizens” of Slony replication: clusters, nodes, sets, subscriptions, etc.

By the way, we found it very convenient to add a table used strictly to track replication latency:

 create table replication_heartbeat(
   id int primary key,
   last_updated timestamp
 );
 insert into replication_heartbeat values(1, now());

We set up a cron job that updates the timestamp on the master every minute, and we set up Nagios alerts both on the master and on the slaves to tell us if the timestamp is older than some threshold. There are various other tools out there to monitor Slony, but we found this scheme to be simple and effective.

Once you have replication set up, watch out for big transactions on the master database. For example, let’s say you issue this query:

 update my_big_table set my_column=null where my_column=0

This statement may take a while to run. Until it completes, no changes will be replicated to the slaves. Once it does complete, though, the changes will be replicated to the slaves one row at a time:

 update my_big_table set my_column=null where id=0;
 update my_big_table set my_column=null where id=1;
 etc.

This takes much longer, and it can cause your replication to fall behind.

Also, be careful to match the horsepower of your master and slave databases. If you have a beefy master database server and a dud slave server, the slave may fall behind as soon as you apply any “significant” changes on the master.


TOOLS AND DOCUMENTATION

make_me_fly_seng_wong.JPG

With MySQL we used the free command-line tool (mysql) as well as a commercial GUI tool (SQL Yog). Both worked well for our use-cases. In particular, it was nice to be able to change data directly in SQL Yog’s result view, much like you would in EXCEL.

Postgres comes with a similar command-line tool: psql. It’s very similar to the mysql command-line tool, but it has it’s own esoteric syntax for various things that simply takes time to learn. For example, instead of “show tables” you use “\d”.

We haven’t found a good replacement SQL Yog yet. We’ve used one free tool, pgAdmin III, but we found that it’s not quite as good. It doesn’t let you edit data in the result view, it tends to open a lot of windows, and it tends to keep database connections open more than it should.

We’ve also found that Postgres documentation is not quite as good as MySQL’s. We’ve gotten by so far, but we definitely have to do more Google’ing for various details that are not fully explained in the core Postgres docs.


So there you have it. I hope these tips help you out.

If you also made the switch to Postgres, we’d love to hear about your experience. Did you run into the same issues? Are there others I forgot to mention? Have you found any useful Postgres tools that we should know about? If you considered Postgres but decided against it, we’d love to know your reasons.

Image credits: Bali Adventure Tours


January 31, 2008

Greasemonkey Hack To Show Walk Score

A number of users have asked us to add a Walk Score widget to our details page and unfortunately we’ve been busy working on a few other things.

Fortunately a crafty Redfin Forums user created a Greasemonkey script for Firefox to add a link to Walk Score on our listing detail’s pages.

Walk Score

To install the widget:

  1. Install the Firefox plugin Greasemonkey
  2. Restart Firefox
  3. Install the Redfin Walkscore Greasemonkey script
  4. Find some Redfin listings and click the link that appears below the address

What other Greasemonkey scripts would be interesting?


October 22, 2007

West by West North

North by Northwest

Listing alerts is a Redfin feature that lets you keep track of new listings in your area. You start by doing a search on our map page, and then you click on the “Save this Search” link:

Save this Search

This lets you assign a name to your search and save it. You can also sign up to get a daily email with new listings that match your criteria.

We recently released an upgrade to the Redfin website. There were a few minor tweaks to the listing alerts code, but nothing significant. Still, a few days later we got an email from a customer saying (to paraphrase): I used to get an email alert almost every day, but recently they stopped, what’s up?

Being the owner of the Listing Alerts feature, my heart skipped a couple of beats. Then I jumped in to analyze the problem. What I noticed was that indeed this customer (Norm) had a steady stream of emails that mysteriously ended. However, the emails didn’t stop on the day of the release; they stopped several days later. I also noticed that Norm made changes to his saved searches right around the time the emails stopped. So, my hypothesis was that Norm’s new searches were somehow more restrictive and we simply didn’t have results for them. To verify this, I did a test run using two of Norm’s searches, an old search and a new search, for the last 24 hours of data. I found that the old search returned results and the new one didn’t, which seemed to confirm my suspicion. So I went ahead and replied to Norm saying: please relax your criteria and try again. Luckily for me, Norm is a stubborn guy. He replied a few minutes later saying: actually, I think my new search is more broad than the previous one.

Now I was really worried. I took another look and I found the smoking gun: for all saved searches created since our recent release, the search area parameter was not saved correctly. To explain what was wrong about it, I need to give some background on how we store your saved search…

Our saved_searches table was changed a little in the recent release. Prior to the release, it looked like this:

[saved_search_id] [name] [min_latitude] [max_latitude] [min_longitude] [max_longitude] [other criteria…]

After the release, it looked like this:

[saved_search_id] [name] [polygon] [other criteria…]

Why did we make this change? Well, this release was also Redfin’s “toe dip” towards changing our database platform from MySQL to Postgres. We found that Postgres gives us better performance on certain kinds of searches (e.g. past-sales searches). So, in this release we replicated our past sales data from MySQL to Postgres and changed the website to issue all past-sales queries to Postgres. Now, to get the performance benefits from Postgres, you have to store your regions in a column of type GEOMETRY, as opposed to simple columns of type DOUBLE or whatnot. So, we wrote the code to store and fetch GEOMETRY types. And once we had it, we figured we may as well use it for saved searches as well — the search area is just a POLYGON after all.

So where’s the problem? The problem is that while in MySQL coordinates are typically stored as (latitude, longitude), Postgres prefers to store them as (longitude, latitude). Or, to be more specific, Postgres doesn’t particularly care which order you use, but the Java library we used (Vivid Solutions) opts for the (longitude, latitude) style. Why does it do that? Probably because it thinks about these coordinates as points in X-Y space, where X is like the longitude (east/west = right/left), and Y is the latitude (north/south = up/down).

We actually noticed this discrepancy when we were close to our code-complete milestone, and we had to make a decision about it. One option was to standardize — to make all coordinates in both databases follow the same (longitude, latitude) style, the style we will eventually adopt when we move to Postgres 100%. Another option was to keep keep MySQL coordinates (latitude, longitude) and to only use (longitude, latitude) in Postgres. We chose the latter approach because it was less disruptive and we were running out of time. Only we didn’t get it quite right — we missed the code used for the saved searches table. The effect was that all existing saved searches kept their (latitude, longitude) format and continued to work, but all new searches were saved as (longitude, latitude) and were busted.

How did we miss this in our testing? I will be the first to admit that we don’t do as much testing at Redfin as we could or should. However, we did have about 40 unit tests and 70 integration tests, and they definitely included scenarios that checked listings that are in/out of the search area. The trouble was that our test cases were too “symmetrical”. This is best explained by an illustration:

Symmetric and Asymmetric Test Cases

Well, when we discovered this problem, we spent the first few minutes being horribly embarrassed for letting it slip through. Then we buckled down, found a solution, and released it later that day. We also added a bunch more test cases to catch the problem in the future. And we sent an email to all the affected users with our apologies for the missed emails. At this point we believe listing alerts work again.

So, what are the lessons?

Strive for consistency in your data, even if it’s more painful and/or takes more time.

Avoid dual-DB situations. Both MySQL and Postgres support the SQL standard, but there are enough differences between them to make life difficult, and this lat/long issue is just one example. Now, we chose to take a toe-dip, and if we had to do it over again I think I would have made the same decision. In other words, do as I say, not as I do. :)

Pick asymmetric test cases. Discussed above.

Add constraints. Our database didn’t complain when the coordinate (47.5, -122), which is in Seattle, was rewritten as (-122, 47). Take a moment to think about that. A longitude of 47 is probably somewhere around India. A latitude of -122 doesn’t exist: it’s beyond the south pole! Why didn’t the database complain? Because as far as the database is concerned, this is just a point, and it’s a perfectly valid point. If you can afford the performance hit, it helps to add constraints to tell the database: in *my* application, a Y value of -122 is no good.

Don’t repeat yourself. The search area for a saved search was actually stored in two formats for two purposes. The first purpose was for evaluating the saved searches for the purpose of sending emails, as discussed above. The second purpose for interactive use – where you can simply pull up your saved search on the website and click “run search”. In the second case, we chose not to use the bounding box because your view may be a little different each time you pull it up, depending on the size of your screen or whether the window is maximized. For interactive use, it’s better to use the center point of your search and a zoom level – this combination can adjust more gracefully to different situations. The problem in our case was that we separately stored the latitude and longitude for interactive use. Had we simply derived them from the polygon’s center, we would have caught this error earlier.

Image credits: http://www.slashfilm.com/2007/02/20/movies-that-should-have-won-an-academy-award-but-didnt/