Archive for the ‘Uncategorized’ Category

September 29, 2009

Installing Beta Builds on iPhone

Probably the hardest part of learning to code an iPhone app is figuring out how to get your app installed on a phone. Even after you’ve installed your app on your personal development phone, installing it on other people’s phones for beta testing is especially tricky.

Here’s the rough outline:

  1. Create an “Ad Hoc” Provisioning Profile (a .mobileprovision file) for beta testing.
  2. Find yourself a friendly beta tester.
  3. Each beta tester needs to give you the Universal Device Identifier (UDID) of his/her device. (This number is not easily visible to users.)
  4. Submit the UDID as a named “Device” to Apple.
  5. Attach the device to your Provisioning Profile. Apple will generate (or regenerate) a .mobileprovision file for you to download. Always use the latest .mobileprovision file.
  6. Build (or rebuild) your software in Xcode using the latest Provisioning Profile.
  7. Zip up your application.
  8. Deliver your finished .app together with the .mobileprovision to your beta tester.
  9. The beta tester installs your provisioning profile, either by dragging it into iTunes or by using the iPhone Configuration Utility (iPCU).
  10. Finally, the beta tester installs your app, using either iTunes or iPCU.

That’s a lot of steps; there’s a lot of room for things to go wrong. Here’s a few gotchas we encountered on our way to releasing the Redfin iPhone app.

Some Zip Tools Don’t Work; Use ditto

Zipping your .app file is surprisingly tricky business. For example, if you use Apache Ant’s <zip> task to create your zipfile, it will ignore the UNIX permissions of the files it zips. When unsuspecting beta testers try to extract the app, they’ll find that it installs successfully under Windows (which doesn’t honor UNIX permissions) but if they install from OS X, the app will immediately crash on startup, with no crash log.

FYI, if you observe this permissions crash via the device console logs, it looks a little like this:

Tue Sep 29 17:47:04 unknown com.apple.launchd[1] : (UIKitApplication:com.redfin.redfin[0x43d3]) posix_spawn("/var/mobile/Applications/2B5E0CE3-362F-4FF0-80A2-C45DE3923C86/Redfin.app/Redfin", ...): Permission denied
Tue Sep 29 17:47:04 unknown com.apple.launchd[1] : (UIKitApplication:com.redfin.redfin[0x43d3]) Exited with exit code: 1
Tue Sep 29 17:47:04 unknown SpringBoard[24] : Failed to spawn Redfin. Unable to obtain a task name port right for pid 13432: (os/kern) failure
Tue Sep 29 17:47:04 unknown com.apple.launchd[1] : (UIKitApplication:com.redfin.redfin[0x43d3]) Throttling respawn: Will start in 2147483647 seconds
Tue Sep 29 17:47:04 unknown SpringBoard[24] : Application 'Redfin' exited abnormally with exit status 1

(Please excuse the scrolling; WordPress automatically screws up this log file entry.)

Even the standard command-line zip tool from Info-ZIP that comes with OS X doesn’t quite do the job. Insidiously, zip archives created with zip work fine for beta testing, but the App Store will reject them, claiming that the app signature is invalid.

You can check the validity of your app signature with codesign -vvvv YourAppName.app. (Yes, all four “v”s are necessary.) Valid apps should look like this:

$ codesign -vvvv Redfin.app
Redfin.app: valid on disk
Redfin.app: satisfies its Designated Requirement

It is very easy to disturb this signature. For example, if you copy the app with cp -r Redfin.app /tmp and then check the signature, you’ll get a message like this:

$ codesign -vvvv Redfin.app
Redfin.app: a sealed resource is missing or invalid
/private/tmp/Redfin.app/CodeResources: resource added

Unfortunately, cp isn’t the only command-line tool that can invalidate the signature. If you create your zip archive using zip -r yourappname YourAppName.app and then unzip it (either with Finder or with unzip), you’ll invalidate the signature.

You don’t get this problem if you use Finder to create the zip archive, by putting YourAppName.app in a folder called YourAppName, and right-clicking on the YourAppName folder and selecting the “Compress” option. To duplicate this functionality automatically, you’ll have to use ditto, which comes with OS X. ditto -c -k YourAppName YourAppName.zip should do the trick.

Don’t Use iTunes; Use the iPhone Configuration Utility

Some people are able to successfully install apps and provisioning profiles using iTunes; if iTunes works for you, great. But a lot of our beta testers weren’t so lucky. iTunes would often fail mysteriously, sometimes with no useful error messages.

The iPhone Configuration Utility (iPCU) allows users to generate log files with error messages; these error messages are critical for diagnosing problems with application installation. More generally, iPCU includes better error messages than iTunes all around; in many cases, switching to iPCU gave our beta testers enough information to solve the problem without the aid of developers.

There’s another good reason to use iPCU instead of iTunes for app installation: you can sync an iPhone with only one iTunes machine. If you want to install beta apps using any other machine, you have to use the iPhone Configuration Utility. In most cases, you can workaround even the trickiest app installation problems just by using iPCU to install from another machine.

To install an app using iPCU:

  1. Download and install the iPhone Configuration Utility from Apple.
  2. Connect your iPhone/iPod to your computer via USB cable.
  3. Close iTunes, if it is open, before launching iPCU
  4. Launch iPCU. You should see your device appear in the “Devices” section on the left side of the window.
  5. Add the provisioning profile (mobile provision) to the iPCU Library: Click on “Provisioning Profiles”, then drag the .mobileprovision file into the Provisioning Profiles pane.
  6. Add the application to the iPCU Library: Click on “Applications”, then drag the app into the Applications Pane. On Windows, the app will be a folder, named something like YourAppName.app. On Mac OS X, the app will be a single application file.
  7. Access your device: Click on your device in the list of devices on the left. You should see five tabs appear in the main window: “Summary”, “Configuration Profiles”, “Provisioning Profiles”, “Applications”, and “Console”.
  8. Install the Provisioning Profile: Click on the “Provisioning Profiles” tab of your device. You should see your Ad Hoc Provisioning Profile in the list of profiles. In the “Install” column you should see either a button that says “Install” or a button that says “Remove” (if you’ve installed this profile previously).
  9. Install the app: Click on your device’s “Applications” tab. You should see your app in the list of applications. In the “Install” column you should see either a button that says “Install” or a button that says “Uninstall” (if you’ve already installed the app).

That’s all there is to it, assuming that nothing goes wrong.

Error Messages: What Could Go Wrong?

  • iPhone Config Utility: “Could not start session with device. Error: kAMDSessionActiveError

    Your device has turned off / fallen asleep. Push the power button, slide to unlock, and try it again.

  • iPhone Config Utility: I tried to install, and I got an error like “Could not install application on device. Error: -402620395.”.

    This is almost certainly because the .mobileprovision didn’t install correctly. Clear out all provisioning profiles from your phone and from the iPCU Library (following directions above), drag your .mobileprovision file into the Library / Provisioning Profiles section, then click on Devices / Your Device / Provisioning Profiles. Make sure you see only one profile on the list, and make sure it’s installed.

  • iPhone Config Utility: I tried to install, and I got an error like “Could not install application on device. Error: -402620393.”.

    We saw this error several times but never figured out the root cause. Installing from another machine worked around the problem.

  • iPhone Config Utility: I tried to install, and I got an error like “Could not transfer application to device. Error: kAMDUndefinedError.”.

    A number of our beta users had this error, but we never deduced the root cause. In at least one case, re-installing the latest version of iPCU resolved the problem. In all known cases, installing from another machine worked around the problem.

  • iTunes: I tried to sync, and I got an error like “The application ‘YourAppName’ was not installed on the iPhone because an unknown error occurred (0xE8008017).”

    This is the same as error -402620393 above. (Note that E8008017 is hexadecimal for the signed integer -402620393.) We saw this error several times but never figured out the root cause. Installing from another machine without iTunes (using iPCU) worked around the problem.

  • iTunes: I tried to sync, and I got an error like “The application ‘YourAppName’ was not installed on the iPhone because an unknown error occurred (0xE8008015).”

    This is the same as error -402620395 above. (Note that E8008015 is hexadecimal for the signed integer -402620395.)This is almost certainly because the .mobileprovision didn’t install correctly. Try dragging the .mobileprovision again to Applications pane in iTunes and syncing. It might also help to delete all of your existing mobile provisions (see Clear Out the Phone Using the Phone below).

Troubleshooting

Still not working? Try this.

  1. Using iTunes? Try the iPhone Configuration Utility.
  2. Try installing from another machine. It’s often especially helpful to try installing from a Mac if you’re failing on Windows, or to try installing from a Windows box if you’re failing on a Mac. (If you’re on Windows and you don’t have access to a Mac, try switching to a Windows machine that does not have iTunes installed.)
  3. Close iTunes. The iPhone Configuration Utility doesn’t work so well if iTunes is open. Note that iTunes auto-launches by default when you plug in an iPhone/iPod.
  4. Try restarting your phone. Press and hold the power button at the top of the phone, slide to power off. Wait until the screen turns off, then press and hold the power button again to start it up again. No, seriously. This actually works, more often than you’d think.
  5. Clear out the app and provisioning profiles correctly, as described below.
  6. Delete your entire iPCU Library, as described below.
  7. Using iPhone Configuration Utility? Try iTunes.
  8. Still not working? Examine the iPCU Log. Directions below.

Clear Out the Phone Using the Phone

Both iTunes and iPCU provide the ability to remove apps and provisioning profiles from the phone, but they don’t work very reliably, especially when you have multiple versions of the same app/profile bouncing around.

The most reliable way to remove an app from the iPhone is to do it the normal way: on the Home screen, press and hold your finger on the app; an X will appear. Tap on the X to delete the app.

You can remove provisioning profiles using the iPhone “Settings” app. In the “General” section, scroll down to the “Profiles” section, select it, tap on the old provisioning profile, and tap “Remove”. (Ordinarily you shouldn’t need to delete provisioning profiles, but the best way to be sure that you’ve installed a .mobileprovision file is to remove all profiles and then install just the provision you need.)

profiles 200x300 Installing Beta Builds on iPhone

Deleting the iPhone Configuration Utility Library

Removing the app and provisioning profile from your iPCU library should just be a simple matter of deleting them in iPCU. But if something goes really wrong, you may have to purge the iPCU library. (In at least one recorded case, purging the iPCU library not only fixed iPCU, but also fixed installing the app via iTunes.)

On OSX, the iPCU library is stored in your Home directory, in ./Library/MobileDevice. You can delete the entire folder.

On Windows, the library is stored in your local application data folder. To find your local app data folder:

  • On Vista, press the Windows key, type %localappdata% in the search box, and press Enter.

  • On XP, press the Windows key, select “Run” and type %userprofile%\Local Settings\Application Data

Go to the “Apple Computer” folder. (There may also be an “Apple” folder and an “Apple_Inc” folder; ignore those.) You should see an “iPhone Configuration Utility” folder here and a “MobileDevice” folder here; delete both of them.

Extracting the iPCU Log

To access your device console, click on your device in the “Devices” section on the left pane of iPCU. There is a “Console” tab on the far right side. Click on that and you should start seeing console messages. (It sometimes takes a few seconds for the messages to appear.) You can click “Save Log As…” to generate a text file that you can send out as an attachment.

September 28, 2009

Maven Reactor Tricks

Not many people know this, but you can use Maven to resume a failed build from the middle, or to build just a subset of projects in a multi-module (a.k.a. “reactor”) build.

Example Reactor Project

Consider this complex multi-module reactor build:

my-root-project
|-- pom.xml
|-- barBusinessLogic
|   `-- pom.xml
|-- bazDataAccess
|   `-- pom.xml
|-- quz
|   |-- pom.xml
|   |-- quzAdditionalLogic
|   |   `-- pom.xml
|   `-- quzUI
|       `-- pom.xml
`-- fooUI
    `-- pom.xml

Suppose project fooUI depends on project barBusinessLogic, which depends on project bazDataAccess.

fooUI --> barBusinessLogic --> bazDataAccess

Furthermore, quzUI depends on quzAdditionalLogic, which depends on barBusinessLogic.

quzUI --> quzAdditionalLogic --> barBusinessLogic --> bazDataAccess

Ordinarily, when you run mvn install from my-root-project, you’ll build the projects in this order:

  1. my-root-project (parent project)
  2. bazDataAccess
  3. barBusinessLogic
  4. fooUI
  5. quz (parent project)
  6. quzAdditionalLogic
  7. quzUI

Resuming the Build with --resume-from

Suppose you’re working on your code and you attempt to run mvn install from my-root-project, but you encounter a test failure in fooUI. You make additional changes to barBusinessLogic without changing bazDataAccess; you know that bazDataAccess is fine, so there’s no need to rebuild/test it. You can then use the --resume-from argument, like this:

mvn install --resume-from=fooUI

That will skip over bazDataAccess and barBusinessLogic and pick up the build where you left off in fooUI. If fooUI succeeds, it will go on to build quzAdditionalLogic and quzUI.

Specify a Subset of Projects with –projects

Suppose you’ve made some changes to fooUI and bazDataAccess and would like to rebuild just those two projects. You can use the --projects argument, like this:

mvn install --projects fooUI,bazDataAccess

That will automatically build just those two projects, saving you the trouble of running Maven in each directory separately.

Making fooUI Without Building quz Using --also-make

Suppose you’re a developer working on fooUI; you don’t want to work on quz right now, but just want to get a working build of fooUI. You can use --also-make, like this:

mvn install --projects fooUI --also-make

--also-make will examine fooUI and walk down its dependency tree, finding all of the projects that it needs to build. In this case, it will automatically build bazDataAccess, barBusinessLogic and fooUI without building quz.

Changing barBusinessLogic and Verifying You Didn’t Break Anything Using --also-make-dependents

Suppose you’ve made a change to barBusinessLogic; you want to make sure you didn’t break any of the projects that depend on you. You also want to avoid rebuilding/testing projects that you know you haven’t changed. In this case, you want to avoid building bazDataAccess. You can use --also-make-dependents, like this:

mvn install --projects barBusinessLogic --also-make-dependents

--also-make-dependents will examine all of the projects in your reactor to find projects that depend on barBusinessLogic; it will automatically build those and anything that depends on them.

Resuming a make Build

When you use --also-make or --also-make-dependents, you run a subset of projects, but that doesn’t mean stuff won’t fail halfway through the build. You can resume an --also-make build from the project that stopped the build by using --resume-from together with --also-make, like this:

mvn install --projects quz/quzUI --also-make --resume-from barBusinessLogic

At Redfin, our Maven reactor has 86 projects; avoiding unnecessary rebuilds is essential to our productivity!


March 6, 2009

Cleaning Whitespace in Postgres

School TeacherAs part of our recent release, we added every survey submitted about our agents to our agent profiles. To prep for this, we needed to do a mail merge with each survey response, which we exported from a Postgres database. The problem was, when our users submitted their answers, they used the skills their grade 3 English teacher taught them and wrote in paragraphs. But all the common export formats indicate new records by using line-breaks, so we needed a way to clean the whitespace from these surveys.

As it turns out, Postgres has some very useful regex functions that make string operations a breeze. But since no one wants to have to reconstruct the appropriate syntax every time they need to clean whitespace, you can make a Postgres function that wraps the functionality:

CREATE OR REPLACE FUNCTION clean_whitespace(to_clean text) RETURNS text AS $$
    BEGIN
        RETURN regexp_replace(to_clean, E'[ \t\n\r]+', ' ', 'g');
    END;
$$ LANGUAGE plpgsql IMMUTABLE;

This replaces each group of whitespace in the argument with a single space. The immutable flag indicates that the function will have no side-effects, and thus allows it to be used in indices. Also notice that we only want to match occurrences of length at least 1 (by using “+” rather than “*”), because otherwise you end up with a space between every character!

Thanks to Thomas Kellerer on the postgres message board for pointing us in the right direction with regards to the arguments we needed.

(Photo credits: jamesdale10 on Flickr)


February 3, 2009

Announcing SitemapGen4j 1.0

Redfin is happy to announce SitemapGen4j 1.0. SitemapGen4j is a library to generate XML sitemaps in Java.

Download SitemapGen4j 1.0

What’s an XML sitemap?

Quoting from sitemaps.org:

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.

Sitemap 0.90 is offered under the terms of the Attribution-ShareAlike Creative Commons License and has wide adoption, including support from Google, Yahoo!, and Microsoft.

Getting started

The easiest way to get started is to just use the WebSitemapGenerator class, like this:

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
wsg.addUrl("http://www.example.com/index.html"); // repeat multiple times
wsg.write();

Configuring options

But there are a lot of nifty options available for URLs and for the generator as a whole. To configure the generator, use a builder:

WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .gzip(true).build(); // enable gzipped output
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

To configure the URLs, construct a WebSitemapUrl with WebSitemapUrl.Options.

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
WebSitemapUrl url = new WebSitemapUrl.Options("http://www.example.com/index.html")
    .lastMod(new Date()).priority(1.0).changeFreq(ChangeFreq.HOURLY).build();
// this will configure the URL with lastmod=now, priority=1.0, changefreq=hourly
wsg.addUrl(url);
wsg.write();

Configuring the date format

One important configuration option for the sitemap generator is the date format. The W3C datetime standard allows you to choose the precision of your datetime (anything from just specifying the year like “1997″ to specifying the fraction of the second like “1997-07-16T19:20:30.45+01:00″); if you don’t specify one, we’ll try to guess which one you want, and we’ll use the default timezone of the local machine, which might not be what you prefer.

// Use DAY pattern (2009-02-07), Greenwich Mean Time timezone
W3CDateFormat dateFormat = new W3CDateFormat(Pattern.DAY);
dateFormat.setTimeZone(TimeZone.getTimeZone("GMT"));
WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .dateFormat(dateFormat).build(); // actually use the configured dateFormat
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

Lots of URLs: a sitemap index file

One sitemap can contain a maximum of 50,000 URLs. (Some sitemaps, like Google News sitemaps, can contain only 1,000 URLs.) If you need to put more URLs than that in a sitemap, you’ll have to use a sitemap index file. Fortunately, WebSitemapGenerator can manage the whole thing for you.

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
for (int i = 0; i < 60000; i++) wsg.addUrl("http://www.example.com/doc"+i+".html");
wsg.write();
wsg.writeSitemapsWithIndex(); // generate the sitemap_index.xml

That will generate two sitemaps for 60K URLs: sitemap1.xml (with 50K urls) and sitemap2.xml (with the remaining 10K), and then generate a sitemap_index.xml file describing the two.

It’s also possible to carefully organize your sub-sitemaps. For example, it’s recommended to group URLs with the same changeFreq together (have one sitemap for changeFreq “daily” and another for changeFreq “yearly”), so you can modify the lastMod of the daily sitemap without modifying the lastMod of the yearly sitemap. To do that, just construct your sitemaps one at a time using the WebSitemapGenerator, then use the SitemapIndexGenerator to create a single index for all of them.

WebSitemapGenerator wsg;
// generate foo sitemap
wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .fileNamePrefix("foo").build();
for (int i = 0; i < 5; i++) wsg.addUrl("http://www.example.com/foo"+i+".html");
wsg.write();
// generate bar sitemap
wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .fileNamePrefix("bar").build();
for (int i = 0; i < 5; i++) wsg.addUrl("http://www.example.com/bar"+i+".html");
wsg.write();
// generate sitemap index for foo + bar
SitemapIndexGenerator sig = new SitemapIndexGenerator("http://www.example.com", myFile);
sig.addUrl("http://www.example.com/foo.xml");
sig.addUrl("http://www.example.com/bar.xml");
sig.write();

You could also use the SitemapIndexGenerator to incorporate sitemaps generated by other tools. For example, you might use Google’s official Python sitemap generator to generate some sitemaps, and use WebSitemapGenerator to generate some sitemaps, and use SitemapIndexGenerator to make an index of all of them.

Validate your sitemaps

SitemapGen4j can also validate your sitemaps using the official XML Schema Definition (XSD). If you used SitemapGen4j to make the sitemaps, you shouldn’t need to do this unless there’s a bug in our code. But you can use it to validate sitemaps generated by other tools, and it provides an extra level of safety.

It’s easy to configure the WebSitemapGenerator to automatically validate your sitemaps right after you write them (but this does slow things down, naturally).

WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .autoValidate(true).build(); // validate the sitemap after writing
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

You can also use the SitemapValidator directly to manage sitemaps. It has two methods: validateWebSitemap(File f) and validateSitemapIndex(File f).

Google-specific sitemaps

Google can understand a wide variety of custom sitemap formats that they made up, including a Mobile sitemaps, Geo sitemaps, Code sitemaps (for Google Code search), Google News sitemaps, and Video sitemaps. SitemapGen4j can generate any/all of these different types of sitemaps.

To generate a special type of sitemap, just use GoogleMobileSitemapGenerator, GoogleGeoSitemapGenerator, GoogleCodeSitemapGenerator, GoogleCodeSitemapGenerator, GoogleNewsSitemapGenerator, or GoogleVideoSitemapGenerator instead of WebSitemapGenerator.

You can’t mix-and-match regular URLs with Google-specific sitemaps, so you’ll also have to use a GoogleMobileSitemapUrl, GoogleGeoSitemapUrl, GoogleCodeSitemapUrl, GoogleNewsSitemapUrl, or GoogleVideoSitemapUrl instead of a WebSitemapUrl. Each of them has unique configurable options not available to regular web URLs.


January 16, 2009

Improved Geocoding, Or: How I Learned to Stop Worrying and Love the Map

Those of us on the Redfin data team see every data problem that gets reported by our users. Of course, if we were to respond directly to each of them, we’d never get anything else done (okay, our data’s not that bad), but luckily the product management team takes care of that for us. One of the most bothersome ones we see, partly due to getting at least a dozen each week, is where:

  • The user is reporting we mapped something wrong;
  • We know it’s because our mapping software isn’t perfect;
  • No-of-course-we-don’t-check-each-one-of-our-450,000-listings-by-hand; and
  • Hopefully the user provided enough information that we can correct that one listing

Well, we decided it was time to do something about it. In particular, we upgraded the geocoding algorithm we use to place listings on the map so that, for the listings in our system:

  • Approximately 1.1% have been hand-mapped
  • Our automated “point-level” (that is, mapped to the rooftop of the given address by a geocoder) mapping percentage went from about 53.3% to 69.1%
  • Our percentage of listings that are mapped, but not necessarily to the exact rooftop, went from 35.7% to 23.5%
  • Our unmapped percentage went from 9.9% to 6.3%
  • Our percentage of listings that will never be mappable (due to the agent choosing to not disclose the address) is 5.7%, so this is a pretty big improvement.

How’d we do it? As I’m sure you’ve already noticed we made the switch to Google Maps in Mid-December, much to the dismay of our Birds-Eye loving users. One of the benefits of this was access to Google’s web-based geocoder. We investigated a wholesale replacement of our existing (super-secret) geocoder with the Google geocoder, but decided instead to enhance our geocoding rate and accuracy by integrating Google’s geocoder into our current system and using a feedback algorithm when we knew we weren’t getting the best result possible.Dangerous cliffs

This seems fairly straightforward, but unfortunately there were a few classes of gotchas that we ran into. Many of these stem from us relying on our geocoder not only for geocoding, but for address parsing and normalization as well. Here are the biggest problems we found with the current version of the Google HTTP geocoder.

  • If possible, don’t pass unit information (the “Unit 33″ part of the address) to Google’s geocoder

    Currently it discards the unit information; it’s not returned in the parsed, corrected address, so you get no benefit from inputting it in the first place. The real problem though is that if it finds a better match to the address you input – say, 3000 Federal Avenue, Unit 33 – by replacing the street number with the unit number – for example, saying you live on 33 Federal Avenue when really you live on 3000 North Federal Avenue, Unit 33 – then that’s what it will return you.

  • Google’s geocoder doesn’t warn you when it drastically changes an address – for example Google could do something that seems totally oddball like changing “822 Country Avenue, Quincy, Washington” to “822 North Quincy Street, Arlington, Virginia”, without telling you. Sometimes you’ll see these suggestions as a consumer when Google Maps asks, “Did you mean: 822 North Quincy Street, Arlington, Virginia?” But when you’re dealing with them programmatically, they don’t even ask.

    We solved this in a couple of ways:

    • If the state code changes, we disregard the results. If your input is clean you could actually be stricter about when to disregard the results, but unfortunately when dealing with real estate data it’s common to see a zip code fat-fingered, different city names for the same actual city, a mistyped street name, a missing directional, or the wrong street type.
    • We use the string distance between the input address and our possible results to determine which result is best. Sometimes Google’s geocoder will provide multiple results, and we always have the results from our other geocoder, so this tends to filter out the most erroneous outliers. For example, let’s say our input way “Quincey Road” and we ended up with two results, “Quincey Street” and “Quincy Road”. We would take the second result, because there’s only a one-character difference rather than differing on an entire word.
  • Sometimes Google’s geocoder over-simplifies complex street numbers – reducing “1421-1423 Hayes Street” to just the first address “1421 Hayes Street” (compound addresses like this are somewhat common for tenancy in common listings in San Francisco)

    We got around this simply by checking that the street number of the input corresponds to the street number of the result, and disregarding the result if they don’t match. But it’s important to bear in mind the cases where a simplified version of your input address might be a valid address, albeit not the result you wanted – for the Hayes Street example, it’s very important for us to maintain that we’re talking about both 1421 and 1423 Hayes Street, not just one of them, even though each are valid addresses taken separately. Another great example of this in real estate is when we’re dealing with the historical form of Chicago addresses (which I only learned about because of this problem, and I must say, are completely sweet in their functionality).

There were also a few lessons that we learned and proved to be important:

  •  Google’s address level geocodes (indicated in their system as having accuracy code “8″) can be either point-level or street-interpolated. Their street level geocodes are only on the best-matching street, and don’t appear to take the street number into account when placing the coordinates on the given block. This also means that the street number is not returned as part of the normalized address at this accuracy.
  • Their geocoder can return multiple results, and it’s not always the case that the first result returned is the one you want. Having a good filtering algorithm for choosing the best result is incredibly important.
  • java.util.concurrent has some incredibly powerful and easy-to-use utility classes for multi-threading applications that can be broken up into independent units of work.Streets shut down in Seattle
  • Just because snow shuts down your city doesn’t mean you don’t have to work. Yes, that’s right, we cranked this out over the holidays!

Overall, we’ve been happy with the results of integrating Google’s geocoder (clearly, otherwise we wouldn’t be talking so much about it). Many of our initial concerns ended up being non-issues, partly because Google was so helpful when we brought them up. That being said, there’s still a short list of things we’d like to see added (who knows, maybe they’ll read this!):

  •  The ability to distinguish between point-level and street-interpolated geocodes. This is one of our largest remaining issues, since we take our data quality so seriously and we like to be able to measure it.
  • More point-level data. All of the listings that are mapped directly onto a street rather than over a specific house are placed there because Google didn’t know exactly which house to put them over, only approximately how far down the street they are. We would love to see these directly over houses in the future.
  • Componentized address parsing. Instead of just telling us the result is “710 2nd Avenue” we’d like to know that “710″ is the street number, “2nd” is the street name, and “Avenue” is the street type, without having to do any post-processing on our end.
  • Less latency. Since it’s a web-based service, the network latency can add considerably to the time it takes us to get results back. A batch geocoder would be one possible solution.
  • More throughput. Currently there are caps of 10 requests per second per IP address (so Google can protect against denial-of-service attacks). It would be nice if they could raise or eliminate these caps for customers.
  • A pony. A big, shiny one.

There are still a few areas left where we know our geocoding could still be improved. One of the biggest remaining problems is with vacant land, which might not have a complete address yet, and neither of our geocoders supports partial addresses (such as “123XX Main St”). To take care of these cases, we’ll be looking to integrate a geocoder that can geocode by APN (essentially, a locally unique id that every property has).

Have you found any other bugs in Google’s geocoder that we might not have caught? Or know of any other cities with quirks that make geocoding difficult? Perhaps you know of a good APN geocoder? Let us know!

(Photo credits: tympsy and cheukiecfu on Flickr, respectively)


December 22, 2008

Deciphering Apache error messages and other pleasant pastimes

There comes a dreaded time in every developer’s life when the inevitable happens – you are forced to switch to a new machine, and re-configure the entire dev environment you spent months (or years, in my case) tweaking and perfecting.  For me, that time unexpectedly came last week when the hard disk on my laptop decided it was time for a winter holiday.

Armed with a reimaged laptop and Redfin’s internal developer machine setup guide, I was making decent progress until I hit a stumbling block: Apache was crashing. No matter what I did – on the first browser hit, I was greeted with a friendly error:

Apache error

The Event Log had a slightly more descriptive message:

Faulting application Apache.exe, version 2.0.63.200, faulting module ntdll.dll, version 5.1.2600.2180, fault address 0x00011e58.

Apache’s error log was mysterious in its own way:

[notice]  Parent: child process exited with status 3221225477 -- Restarting.

To save you hours of debugging, hair pulling, reinstalling, and commenting out of httpd.conf that I went through, I’ll just point you directly to the root cause of the problem:  https://issues.apache.org/bugzilla/show_bug.cgi?id=44338: mod_deflate crashes and does not return response.  If you have Apache 2.0.63 on Windows, are loading the mod_deflate.so module, and are using it by including AddOutputFilterByType/SetOutputFilter directives in your httpd.conf – Apache will crash.  Currently there is no fix for this (although the bug above may be updated with a fix in the future) – you have a choice of taking out mod_deflate out of your Apache config, or upgrading to Apache 2.2.

Since mod_deflate is not essential in our developer setup, I happily chose the former option and got my Apache up and running.


December 11, 2008

A Virtual Earth to Google Maps Transition: From Idea to Deployment In a Few Weeks

VE and GMaps, side-by-sideWhenever I out myself as a member of the Redfin search team to someone who has used Redfin, one of the first questions I get is, “so why do you guys use the Microsoft Map?  Why didn’t you choose Google?”.  The full answer is a bit long, but the short answer is easy: speed.

Every few months, we’ve tested Google Maps against Virtual Earth, and the result was always the same: Google’s script and tiles loaded considerably faster, but Virtual Earth was as much as four times faster at adding a ton of items to the map.  Since our user interface can add up to 500 houses at a time to a map, we just felt like Google wasn’t able to give us the performance we needed.  To be fair, part of VE’s speed was due to a bulk add feature that we had lobbied for them to put in, but it worked well, and so we put Google aside, wistfully looking at those speedy script and tile load times.

A few months ago, though, we started contract renegotiations with Microsoft, and we decided to give Google Maps a closer look.  One of my colleagues, the brilliant Dan Fabulich of Selenium fame, figured out that we could code our own custom GOverlay to make Google Maps display items much faster than it had previously.  The question then became: how hard would it be to port our site from one platform to the other?  And would it be worth it to do so? Read the rest of this entry »


May 6, 2008

Fun with generate_series

A few months ago I attended the PostgreSQL conference in Portland, OR. There were a lot of talks ranging from hard-core stuff like Neil Conway‘s talk about the internals of query execution, to random fun stuff like David Fetter‘s discussion of procedural languages, including LOLCODE.

gulakuno Fun with generate series

During their talks, a few people mentioned a handy function called generate_series. It took me a while to discover how useful this function really is. I thought I’d post an example. Here goes…

Let’s say that you have a table with sales information:

postgres=# select * from sales order by date;
    date    | sales_person | part_number
------------+--------------+-------------
 2008-05-05 | Glenn        |           1
 2008-05-05 | Shahaf       |           1
 2008-05-06 | Mike         |           1
 2008-05-06 | Mike         |           2
 2008-05-08 | Glenn        |           1
 2008-05-08 | Shahaf       |           1
 2008-05-08 | Mike         |           2
 2008-05-09 | Mike         |           1
 2008-05-09 | Glenn        |           1
(9 rows)

You might want to get an idea of how many sales happened on each day. You could try to do it with a query like this:

postgres=# select date, count(*) from sales group by date order by date;
    date    | count
------------+-------
 2008-05-05 |     2
 2008-05-06 |     2
 2008-05-08 |     3
 2008-05-09 |     2
(4 rows)

This basically works, but it hides one important fact — no sales happened on May 7.

To fix this, we can use generate_series. When you run this function normally, it just generates a series of numbers:

postgres=# select generate_series(0,4);
generate_series
-----------------
0
1
2
3
4
(5 rows)

However, you can easily change it to generate a series of dates:

postgres=# select generate_series(0,4) + date '2008-05-05' as date;
date
------------
2008-05-05
2008-05-06
2008-05-07
2008-05-08
2008-05-09
(5 rows)

Once you have this, you can now join against the sales table to generate the report:

postgres=# select series.date, count(sales.date)
from (select generate_series(0,4) + date '2008-05-05' as date) as series
left outer join sales on series.date=sales.date group by series.date
order by series.date;
    date    | count
------------+-------
 2008-05-05 |     2
 2008-05-06 |     2
 2008-05-07 |     0
 2008-05-08 |     3
 2008-05-09 |     2
(5 rows)

The trick here is to do a left-outer join betwee the date sequence and the sales, and to count the rows that have a non-null sales date.

This is by no means the only use of generate_series, it’s just the most recent use I found. If you know of other ways to use this function, or if you know of other handy functions, drop a note below.

Image credits: Electric Vehicle Guide


February 13, 2008

MySQL to Postgres

 MySQL to Postgres

A couple of weeks ago we released an upgrade to the Redfin site. Some of the changes were obvious: we now support Safari, we show polygons for different region types on the map (cities, neighborhoods, zip codes, etc.), we show price history on the listing details page, we allow users to download search results as an EXCEL sheet, and so on. And then there are a bunch of changes that are not so obvious, things that happened behind the scenes. Most notably — we switched our database from MySQL to Postgres.

For a website, the database is basically the lowest layer in the stack, so changing it is not a simple thing to do. My colleague Michael Smedberg wrote about the “toe dip” we took in this direction in a prior release: we still used MySQL as our “gold” database, but we replicated some of the data to a Postgres DB in order to support certain past-sale queries that were underperforming on MySQL. When that effort proved successful, we decided to go full throttle and make the switch to Postgres 100%, which is what we just did.

I can say that so far we’re pretty happy. Perhaps the two most important benefits we’ve gained are: 1. with PostGIS we can support true polygon searches, and 2. Postgres’ row-level multi-value concurrency control (MVCC) means that we can run “heavy” import operations on a DB that is serving live queries without worrying about locking or performance loss. And of course there are other benefits as well.

But in this blog post I thought I’d talk about the gotchas we faced along the way as we transitioned from MySQL to Postgres. If you’re contemplating your own switch, hopefully this will save you some time. Here goes…


MIGRATING THE DATA

 MySQL to Postgres

Some of our first steps in migrating to Postgres were to define the new Postgres schema, to create an empty database with this schema, and to migrate data over from a the MySQL database.

In doing this, we discovered that the data types supported by MySQL are not all supported by Postgres. For example: double, tinyint, enum, and datetime don’t exist at all. There are often reasonable alternatives (e.g. timestamp instead of datetime), but in some cases you have to resort to a data type that is not quite the same (e.g. smallint instead of tinyint). Note: we are using Postgres version 8.2; some of these types (e.g. enum) are available in the recently released version 8.3.

So, in migrating our data we had to do some translation/mapping of our data into the new schema. We figured that this was also a golden opportunity to do some additional cleanup: to remove unused tables, to remove unused columns, to name columns more consistently, to add some constraints, to remove duplicate data, etc.

We needed a migration tool that was flexible enough to handle all these needs, and we ended up writing our own — a java program we call MySqlToPostgres. We considered other alternatives, such as using the Postgres COPY command to bulk load from file. But our own migration tool ended up being a good choice given all the transformations we wanted to do in-transit.

If you’re going to write your own tool, here are a few tips to make the migration faster:

  1. Make sure to start with a target database that has tables but has no indexes or unique constraints; add those only after you migrate your data, it makes a huge difference.
  2. Migrate the data using a few big transactions, instead of many small transactions. In our case, we issued a “begin” before inserting any rows into a given table and a “commit” after the last row was inserted. In between, we issued giant “insert” statements, each of which contained many rows. Here’s an example of a single insert statement with two rows:
     INSERT INTO films (code, title, did, date_prod, kind) VALUES
         ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
         ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
  3. There’s a setting in the postgresql.conf file you can use to log slow queries. Normally we set it to 200, so that we’ll know about queries that take more than 200 milliseconds: log_min_duration_statement=200. However, while doing our migration, it was important to disable this option by setting it to 0. Otherwise our slow query log grew very large with all those multi-row insert statements.

Most of our data is stored in “regular” columns: numeric, string, etc. However, we have one table with some binary data, and this data proved particularly challenging. The Postgres documentation tells you how to format binary data as a string in an insert statement. The trick is that while MySQL and Postgres both treat a byte as an unsigned value between 0 and 255, Java treats bytes as signed values between -127 and 128. That means that if you write your own Java migration tool, you have to do some tricky arithmetic with the “bytes” that Java gives you to convert them into the kind of bytes that Postgres expects.


CONNECTIONS

 MySQL to Postgres

With most databases, you’re allowed to connect to one database and then issue the use command to start dealing with a different database. Not so with Postgres. When you establish a connection to Postgres it’s to a specific database, and the “use” command doesn’t exist. This also means that you can’t issue queries that span multiple databases.

During the development cycle, especially in tests, it’s helpful to create a fresh database, do some stuff with it, and then drop it. Most databases will prevent you from dropping the database if there are users connected to it, but Postgres is particularly draconian about it. There’s no easy way to say “just drop it anyways and kill all connections”. So, you have to be careful to close all your DB connections at the end of the test, before trying to clean up the DB. Of course, this is good practice anyways, but it’s sometimes tricky to discover who’s holding onto that one connection.

Postgres has a setting that defines the maximum number of connections allowed to the DB: max_connections. By default, this is set to 100. You need to make sure that this value is big enough to support your worst-case situation. In particular, your website probably has some connection pooling mechanism on each web server, and this connection pool probably has a cap. You need to make sure that the DB can handle the case of each web server asking for the max number of connections. You should also check whether your connection pool is configured to reap unused connections, although you shouldn’t count on reaping to keep you below the worst-case scenario. If you find that there’s a mismatch, there are two things you can do: reduce the connection pool sizes all around, or increase the DB connections cap. If you want to do the latter, be aware that this will at least require restarting the DB. In some cases it may even require restarting the server completely. So, it’s best to decide ahead of time how many connections you’ll need in the worst case and to configure all settings accordingly.


SEQUENCES

 MySQL to Postgres

There are many situations where it’s convenient to have the database generate ID’s for new records. In MySQL, this is a pretty simple matter: you set the AUTO_INCREMENT attribute on your column and it automatically picks up the next value.

With Postgres it’s a little trickier. The AUTO_INCREMENT keyword doesn’t exist. Instead you have sequences. You can create a sequence and call nextval(‘my_sequence_name’) to retrieve the next value. You can also associate this sequence with a column by specifying that the default value for the column is nextval(‘my_sequence_name’). There’s also a shortcut for defining all this: serial.

Having these independent sequences in Postgres gives you the flexibility to do interesting things, like to associate the same sequence with multiple tables. But in practice we haven’t found this to be particularly useful. Also, it was a little more challenging to configure our O-R mapping layer, hibernate, to recognize and use these sequences. It involved adorning our classes with the @SequenceGenerator tag, and our ID columns with the @GeneratedValue tag.

Finally, it was important to explicitly initialize these sequences to the next available value after migrating data from MySQL to Postgres.


CASING

 MySQL to Postgres

In MySQL, and in may other databases, string comparisons are case insensitive. For example, on MySQL you would get one result for each of the following queries:

 create table names(firstname varchar(10));
 insert into names values ('Shahaf');
 select * from names where firstname='Shahaf';
 select * from names where firstname='SHAHAF';
 select * from names where firstname='sHaHaF';

Postgres, on the other hand, is case-sensitive. That means that only the first query would return a result.

After some research, we found three ways to deal with this problem:

  1. Use the upper() or lower() Postgres functions to normalize casing before comparison
  2. Use the case-insensitive regular expression operator: ~*
  3. Use the case-insensitive string-matching operator: ilike

The problem with #2 and #3 is that they don’t support index usage in some or all of the cases. The only way to get index usage (and therefore better performance) is to go with the upper() or lower() approach. By the way, we didn’t find any difference between upper() and lower(), so we arbitrarily went with upper().

Of course, it’s a pain to have to actually store your data in all upper-case. You might want to display it with normal casing, and you don’t want to have to store two copies. Luckily, you can store your data using normal casing, but create a functional index on the upper()-cased values. For example:

 create index employees_name on employees(upper(name));

After that, you just need to remember to upper-case any arguments you get, and to always construct queries that compare against upper(name). If you accidentally generate a query that compares against name (without the upper()), then the index will go unused and you’ll see bad performance.


TEXT SEARCH

 MySQL to Postgres

Text search is one of those areas where MySQL has a basic search solution that works and requires little fuss, whereas Postgres has a myriad of solutions that require more TLC. One of these solutions, TSearch2, is now included in the 8.3 version of Postgres, so perhaps it’s becoming a kind of standard.

TSearch2 is built completely on top of Postgres. To use it, you need to add a set of objects to your database: functions, tables, etc. If you want to be able to search some text columns in a given table, you need to add a new column to that table, a column of type tsvector. You need to bulk-populate it using a tsearch2 function with the raw data from your string columns. And you need to define some triggers to keep this tsvector column updated whenever the source columns change. You can then define an index on this tsvector column, and start issuing text search queries, again using some special tsearch2 functions.

We’ve had reasonable success with TSearch2 so far, but we haven’t exploited all its functionality so far, and we’re still considering other approaches (e.g. Lucene).


REPLICATION

 MySQL to Postgres

Replication is much like text search: MySQL has a basic replication solution that works, and Postgres has a bunch of competing solutions that are far more complex to implement. However, as with TSearch2, there’s one solution that’s becoming a kind of standard and is now included in the 8.3 version of Postgres: Slony.

Slony is a single-master, multi-slave, asynchronous replication system. In other words, you update your data in one master DB only, but you can query the data from the master or from the slaves. Just beware that the data on the slaves may be a little stale. In that regard, Slony is very similar to MySQL replication. However, the similarities pretty much end there.

MySQL replication is a built-in part of the database, and it uses its own binary logs to record what changes need to be replicated to which slave. Slony is built completely on top of the Postgres database. It creates a set of tables and triggers in your master database to keep track of which changes need to be replicated to which slave. So, in a way, Slony “pollutes” your schema. If you take a database snapshot, you now have all your tables plus a bunch of Slony stuff. On the other hand, Slony also creates triggers on the slaves to prevent you from accidentally making changes there, something that MySQL will happily let you do.

MySQL replication assumes that you want all databases on the master server to be replicated to each of the slave servers. Slony takes a more fine-grained approach. You need to explicitly specify which set of tables from which database on the master you want replicated to which target slave database. You don’t have to include all tables, and the slave database can live on the same machine as the master if you want.

MySQL replication automatically includes all kinds of changes: inserts, updated, deletes, truncates, table creation, index creation, etc. Slony is far more limited. It can only
replicate changes to tables and sequences. If you have schema changes to make, you need to orchestrate them carefully such that they happen on all databases at the same point in the transaction order.

Slony has it’s own SQL-like language that is consumed by an interpreter called slonik. So, you have to invest a little bit of time learning this language and how to use it to define all the “first-class citizens” of Slony replication: clusters, nodes, sets, subscriptions, etc.

By the way, we found it very convenient to add a table used strictly to track replication latency:

 create table replication_heartbeat(
   id int primary key,
   last_updated timestamp
 );
 insert into replication_heartbeat values(1, now());

We set up a cron job that updates the timestamp on the master every minute, and we set up Nagios alerts both on the master and on the slaves to tell us if the timestamp is older than some threshold. There are various other tools out there to monitor Slony, but we found this scheme to be simple and effective.

Once you have replication set up, watch out for big transactions on the master database. For example, let’s say you issue this query:

 update my_big_table set my_column=null where my_column=0

This statement may take a while to run. Until it completes, no changes will be replicated to the slaves. Once it does complete, though, the changes will be replicated to the slaves one row at a time:

 update my_big_table set my_column=null where id=0;
 update my_big_table set my_column=null where id=1;
 etc.

This takes much longer, and it can cause your replication to fall behind.

Also, be careful to match the horsepower of your master and slave databases. If you have a beefy master database server and a dud slave server, the slave may fall behind as soon as you apply any “significant” changes on the master.


TOOLS AND DOCUMENTATION

 MySQL to Postgres

With MySQL we used the free command-line tool (mysql) as well as a commercial GUI tool (SQL Yog). Both worked well for our use-cases. In particular, it was nice to be able to change data directly in SQL Yog’s result view, much like you would in EXCEL.

Postgres comes with a similar command-line tool: psql. It’s very similar to the mysql command-line tool, but it has it’s own esoteric syntax for various things that simply takes time to learn. For example, instead of “show tables” you use “\d”.

We haven’t found a good replacement SQL Yog yet. We’ve used one free tool, pgAdmin III, but we found that it’s not quite as good. It doesn’t let you edit data in the result view, it tends to open a lot of windows, and it tends to keep database connections open more than it should.

We’ve also found that Postgres documentation is not quite as good as MySQL’s. We’ve gotten by so far, but we definitely have to do more Google’ing for various details that are not fully explained in the core Postgres docs.


So there you have it. I hope these tips help you out.

If you also made the switch to Postgres, we’d love to hear about your experience. Did you run into the same issues? Are there others I forgot to mention? Have you found any useful Postgres tools that we should know about? If you considered Postgres but decided against it, we’d love to know your reasons.

Image credits: Bali Adventure Tours


January 31, 2008

Greasemonkey Hack To Show Walk Score

A number of users have asked us to add a Walk Score widget to our details page and unfortunately we’ve been busy working on a few other things.

Fortunately a crafty Redfin Forums user created a Greasemonkey script for Firefox to add a link to Walk Score on our listing detail’s pages.

Walk Score

To install the widget:

  1. Install the Firefox plugin Greasemonkey
  2. Restart Firefox
  3. Install the Redfin Walkscore Greasemonkey script
  4. Find some Redfin listings and click the link that appears below the address

What other Greasemonkey scripts would be interesting?


close