Those of us on the Redfin data team see every data problem that gets reported by our users. Of course, if we were to respond directly to each of them, we’d never get anything else done (okay, our data’s not that bad), but luckily the product management team takes care of that for us. One of the most bothersome ones we see, partly due to getting at least a dozen each week, is where:
- The user is reporting we mapped something wrong;
- We know it’s because our mapping software isn’t perfect;
- No-of-course-we-don’t-check-each-one-of-our-450,000-listings-by-hand; and
- Hopefully the user provided enough information that we can correct that one listing
Well, we decided it was time to do something about it. In particular, we upgraded the geocoding algorithm we use to place listings on the map so that, for the listings in our system:
- Approximately 1.1% have been hand-mapped
- Our automated “point-level” (that is, mapped to the rooftop of the given address by a geocoder) mapping percentage went from about 53.3% to 69.1%
- Our percentage of listings that are mapped, but not necessarily to the exact rooftop, went from 35.7% to 23.5%
- Our unmapped percentage went from 9.9% to 6.3%
- Our percentage of listings that will never be mappable (due to the agent choosing to not disclose the address) is 5.7%, so this is a pretty big improvement.
How’d we do it? As I’m sure you’ve already noticed we made the switch to Google Maps in Mid-December, much to the dismay of our Birds-Eye loving users. One of the benefits of this was access to Google’s web-based geocoder. We investigated a wholesale replacement of our existing (super-secret) geocoder with the Google geocoder, but decided instead to enhance our geocoding rate and accuracy by integrating Google’s geocoder into our current system and using a feedback algorithm when we knew we weren’t getting the best result possible.
This seems fairly straightforward, but unfortunately there were a few classes of gotchas that we ran into. Many of these stem from us relying on our geocoder not only for geocoding, but for address parsing and normalization as well. Here are the biggest problems we found with the current version of the Google HTTP geocoder.
- If possible, don’t pass unit information (the “Unit 33″ part of the address) to Google’s geocoder
Currently it discards the unit information; it’s not returned in the parsed, corrected address, so you get no benefit from inputting it in the first place. The real problem though is that if it finds a better match to the address you input – say, 3000 Federal Avenue, Unit 33 – by replacing the street number with the unit number – for example, saying you live on 33 Federal Avenue when really you live on 3000 North Federal Avenue, Unit 33 – then that’s what it will return you.
- Google’s geocoder doesn’t warn you when it drastically changes an address – for example Google could do something that seems totally oddball like changing “822 Country Avenue, Quincy, Washington” to “822 North Quincy Street, Arlington, Virginia”, without telling you. Sometimes you’ll see these suggestions as a consumer when Google Maps asks, “Did you mean: 822 North Quincy Street, Arlington, Virginia?” But when you’re dealing with them programmatically, they don’t even ask.
We solved this in a couple of ways:
- If the state code changes, we disregard the results. If your input is clean you could actually be stricter about when to disregard the results, but unfortunately when dealing with real estate data it’s common to see a zip code fat-fingered, different city names for the same actual city, a mistyped street name, a missing directional, or the wrong street type.
- We use the string distance between the input address and our possible results to determine which result is best. Sometimes Google’s geocoder will provide multiple results, and we always have the results from our other geocoder, so this tends to filter out the most erroneous outliers. For example, let’s say our input way “Quincey Road” and we ended up with two results, “Quincey Street” and “Quincy Road”. We would take the second result, because there’s only a one-character difference rather than differing on an entire word.
- Sometimes Google’s geocoder over-simplifies complex street numbers – reducing “1421-1423 Hayes Street” to just the first address “1421 Hayes Street” (compound addresses like this are somewhat common for tenancy in common listings in San Francisco)
We got around this simply by checking that the street number of the input corresponds to the street number of the result, and disregarding the result if they don’t match. But it’s important to bear in mind the cases where a simplified version of your input address might be a valid address, albeit not the result you wanted – for the Hayes Street example, it’s very important for us to maintain that we’re talking about both 1421 and 1423 Hayes Street, not just one of them, even though each are valid addresses taken separately. Another great example of this in real estate is when we’re dealing with the historical form of Chicago addresses (which I only learned about because of this problem, and I must say, are completely sweet in their functionality).
There were also a few lessons that we learned and proved to be important:
- Google’s address level geocodes (indicated in their system as having accuracy code “8″) can be either point-level or street-interpolated. Their street level geocodes are only on the best-matching street, and don’t appear to take the street number into account when placing the coordinates on the given block. This also means that the street number is not returned as part of the normalized address at this accuracy.
- Their geocoder can return multiple results, and it’s not always the case that the first result returned is the one you want. Having a good filtering algorithm for choosing the best result is incredibly important.
- java.util.concurrent has some incredibly powerful and easy-to-use utility classes for multi-threading applications that can be broken up into independent units of work.
- Just because snow shuts down your city doesn’t mean you don’t have to work. Yes, that’s right, we cranked this out over the holidays!
Overall, we’ve been happy with the results of integrating Google’s geocoder (clearly, otherwise we wouldn’t be talking so much about it). Many of our initial concerns ended up being non-issues, partly because Google was so helpful when we brought them up. That being said, there’s still a short list of things we’d like to see added (who knows, maybe they’ll read this!):
- The ability to distinguish between point-level and street-interpolated geocodes. This is one of our largest remaining issues, since we take our data quality so seriously and we like to be able to measure it.
- More point-level data. All of the listings that are mapped directly onto a street rather than over a specific house are placed there because Google didn’t know exactly which house to put them over, only approximately how far down the street they are. We would love to see these directly over houses in the future.
- Componentized address parsing. Instead of just telling us the result is “710 2nd Avenue” we’d like to know that “710″ is the street number, “2nd” is the street name, and “Avenue” is the street type, without having to do any post-processing on our end.
- Less latency. Since it’s a web-based service, the network latency can add considerably to the time it takes us to get results back. A batch geocoder would be one possible solution.
- More throughput. Currently there are caps of 10 requests per second per IP address (so Google can protect against denial-of-service attacks). It would be nice if they could raise or eliminate these caps for customers.
- A pony. A big, shiny one.
There are still a few areas left where we know our geocoding could still be improved. One of the biggest remaining problems is with vacant land, which might not have a complete address yet, and neither of our geocoders supports partial addresses (such as “123XX Main St”). To take care of these cases, we’ll be looking to integrate a geocoder that can geocode by APN (essentially, a locally unique id that every property has).
Have you found any other bugs in Google’s geocoder that we might not have caught? Or know of any other cities with quirks that make geocoding difficult? Perhaps you know of a good APN geocoder? Let us know!