March 6, 2009
As part of our recent release, we added every survey submitted about our agents to our agent profiles. To prep for this, we needed to do a mail merge with each survey response, which we exported from a Postgres database. The problem was, when our users submitted their answers, they used the skills their grade 3 English teacher taught them and wrote in paragraphs. But all the common export formats indicate new records by using line-breaks, so we needed a way to clean the whitespace from these surveys.
As it turns out, Postgres has some very useful regex functions that make string operations a breeze. But since no one wants to have to reconstruct the appropriate syntax every time they need to clean whitespace, you can make a Postgres function that wraps the functionality:
CREATE OR REPLACE FUNCTION clean_whitespace(to_clean text) RETURNS text AS $$
BEGIN
RETURN regexp_replace(to_clean, E'[ \t\n\r]+', ' ', 'g');
END;
$$ LANGUAGE plpgsql IMMUTABLE;
This replaces each group of whitespace in the argument with a single space. The immutable flag indicates that the function will have no side-effects, and thus allows it to be used in indices. Also notice that we only want to match occurrences of length at least 1 (by using “+” rather than “*”), because otherwise you end up with a space between every character!
Thanks to Thomas Kellerer on the postgres message board for pointing us in the right direction with regards to the arguments we needed.
(Photo credits: jamesdale10 on Flickr)
February 3, 2009
Redfin is happy to announce SitemapGen4j 1.0. SitemapGen4j is a library to generate XML sitemaps in Java.
Download SitemapGen4j 1.0
What’s an XML sitemap?
Quoting from sitemaps.org:
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.
Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.
Sitemap 0.90 is offered under the terms of the Attribution-ShareAlike Creative Commons License and has wide adoption, including support from Google, Yahoo!, and Microsoft.
Getting started
The easiest way to get started is to just use the WebSitemapGenerator class, like this:
WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
wsg.addUrl("http://www.example.com/index.html"); // repeat multiple times
wsg.write();
Configuring options
But there are a lot of nifty options available for URLs and for the generator as a whole. To configure the generator, use a builder:
WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
.gzip(true).build(); // enable gzipped output
wsg.addUrl("http://www.example.com/index.html");
wsg.write();
To configure the URLs, construct a WebSitemapUrl with WebSitemapUrl.Options.
WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
WebSitemapUrl url = new WebSitemapUrl.Options("http://www.example.com/index.html")
.lastMod(new Date()).priority(1.0).changeFreq(ChangeFreq.HOURLY).build();
// this will configure the URL with lastmod=now, priority=1.0, changefreq=hourly
wsg.addUrl(url);
wsg.write();
Configuring the date format
One important configuration option for the sitemap generator is the date format. The W3C datetime standard allows you to choose the precision of your datetime (anything from just specifying the year like “1997″ to specifying the fraction of the second like “1997-07-16T19:20:30.45+01:00″); if you don’t specify one, we’ll try to guess which one you want, and we’ll use the default timezone of the local machine, which might not be what you prefer.
// Use DAY pattern (2009-02-07), Greenwich Mean Time timezone
W3CDateFormat dateFormat = new W3CDateFormat(Pattern.DAY);
dateFormat.setTimeZone(TimeZone.getTimeZone("GMT"));
WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
.dateFormat(dateFormat).build(); // actually use the configured dateFormat
wsg.addUrl("http://www.example.com/index.html");
wsg.write();
Lots of URLs: a sitemap index file
One sitemap can contain a maximum of 50,000 URLs. (Some sitemaps, like Google News sitemaps, can contain only 1,000 URLs.) If you need to put more URLs than that in a sitemap, you’ll have to use a sitemap index file. Fortunately, WebSitemapGenerator can manage the whole thing for you.
WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
for (int i = 0; i < 60000; i++) wsg.addUrl("http://www.example.com/doc"+i+".html");
wsg.write();
wsg.writeSitemapsWithIndex(); // generate the sitemap_index.xml
That will generate two sitemaps for 60K URLs: sitemap1.xml (with 50K urls) and sitemap2.xml (with the remaining 10K), and then generate a sitemap_index.xml file describing the two.
It’s also possible to carefully organize your sub-sitemaps. For example, it’s recommended to group URLs with the same changeFreq together (have one sitemap for changeFreq “daily” and another for changeFreq “yearly”), so you can modify the lastMod of the daily sitemap without modifying the lastMod of the yearly sitemap. To do that, just construct your sitemaps one at a time using the WebSitemapGenerator, then use the SitemapIndexGenerator to create a single index for all of them.
WebSitemapGenerator wsg;
// generate foo sitemap
wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
.fileNamePrefix("foo").build();
for (int i = 0; i < 5; i++) wsg.addUrl("http://www.example.com/foo"+i+".html");
wsg.write();
// generate bar sitemap
wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
.fileNamePrefix("bar").build();
for (int i = 0; i < 5; i++) wsg.addUrl("http://www.example.com/bar"+i+".html");
wsg.write();
// generate sitemap index for foo + bar
SitemapIndexGenerator sig = new SitemapIndexGenerator("http://www.example.com", myFile);
sig.addUrl("http://www.example.com/foo.xml");
sig.addUrl("http://www.example.com/bar.xml");
sig.write();
You could also use the SitemapIndexGenerator to incorporate sitemaps generated by other tools. For example, you might use Google’s official Python sitemap generator to generate some sitemaps, and use WebSitemapGenerator to generate some sitemaps, and use SitemapIndexGenerator to make an index of all of them.
Validate your sitemaps
SitemapGen4j can also validate your sitemaps using the official XML Schema Definition (XSD). If you used SitemapGen4j to make the sitemaps, you shouldn’t need to do this unless there’s a bug in our code. But you can use it to validate sitemaps generated by other tools, and it provides an extra level of safety.
It’s easy to configure the WebSitemapGenerator to automatically validate your sitemaps right after you write them (but this does slow things down, naturally).
WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
.autoValidate(true).build(); // validate the sitemap after writing
wsg.addUrl("http://www.example.com/index.html");
wsg.write();
You can also use the SitemapValidator directly to manage sitemaps. It has two methods: validateWebSitemap(File f) and validateSitemapIndex(File f).
Google-specific sitemaps
Google can understand a wide variety of custom sitemap formats that they made up, including a Mobile sitemaps, Geo sitemaps, Code sitemaps (for Google Code search), Google News sitemaps, and Video sitemaps. SitemapGen4j can generate any/all of these different types of sitemaps.
To generate a special type of sitemap, just use GoogleMobileSitemapGenerator, GoogleGeoSitemapGenerator, GoogleCodeSitemapGenerator, GoogleCodeSitemapGenerator, GoogleNewsSitemapGenerator, or GoogleVideoSitemapGenerator instead of WebSitemapGenerator.
You can’t mix-and-match regular URLs with Google-specific sitemaps, so you’ll also have to use a GoogleMobileSitemapUrl, GoogleGeoSitemapUrl, GoogleCodeSitemapUrl, GoogleNewsSitemapUrl, or GoogleVideoSitemapUrl instead of a WebSitemapUrl. Each of them has unique configurable options not available to regular web URLs.
January 16, 2009
Those of us on the Redfin data team see every data problem that gets reported by our users. Of course, if we were to respond directly to each of them, we’d never get anything else done (okay, our data’s not that bad), but luckily the product management team takes care of that for us. One of the most bothersome ones we see, partly due to getting at least a dozen each week, is where:
- The user is reporting we mapped something wrong;
- We know it’s because our mapping software isn’t perfect;
- No-of-course-we-don’t-check-each-one-of-our-450,000-listings-by-hand; and
- Hopefully the user provided enough information that we can correct that one listing
Well, we decided it was time to do something about it. In particular, we upgraded the geocoding algorithm we use to place listings on the map so that, for the listings in our system:
- Approximately 1.1% have been hand-mapped
- Our automated “point-level” (that is, mapped to the rooftop of the given address by a geocoder) mapping percentage went from about 53.3% to 69.1%
- Our percentage of listings that are mapped, but not necessarily to the exact rooftop, went from 35.7% to 23.5%
- Our unmapped percentage went from 9.9% to 6.3%
- Our percentage of listings that will never be mappable (due to the agent choosing to not disclose the address) is 5.7%, so this is a pretty big improvement.
How’d we do it? As I’m sure you’ve already noticed we made the switch to Google Maps in Mid-December, much to the dismay of our Birds-Eye loving users. One of the benefits of this was access to Google’s web-based geocoder. We investigated a wholesale replacement of our existing (super-secret) geocoder with the Google geocoder, but decided instead to enhance our geocoding rate and accuracy by integrating Google’s geocoder into our current system and using a feedback algorithm when we knew we weren’t getting the best result possible.
This seems fairly straightforward, but unfortunately there were a few classes of gotchas that we ran into. Many of these stem from us relying on our geocoder not only for geocoding, but for address parsing and normalization as well. Here are the biggest problems we found with the current version of the Google HTTP geocoder.
- If possible, don’t pass unit information (the “Unit 33″ part of the address) to Google’s geocoder
Currently it discards the unit information; it’s not returned in the parsed, corrected address, so you get no benefit from inputting it in the first place. The real problem though is that if it finds a better match to the address you input – say, 3000 Federal Avenue, Unit 33 – by replacing the street number with the unit number – for example, saying you live on 33 Federal Avenue when really you live on 3000 North Federal Avenue, Unit 33 – then that’s what it will return you.
- Google’s geocoder doesn’t warn you when it drastically changes an address – for example Google could do something that seems totally oddball like changing “822 Country Avenue, Quincy, Washington” to “822 North Quincy Street, Arlington, Virginia”, without telling you. Sometimes you’ll see these suggestions as a consumer when Google Maps asks, “Did you mean: 822 North Quincy Street, Arlington, Virginia?” But when you’re dealing with them programmatically, they don’t even ask.
We solved this in a couple of ways:
- If the state code changes, we disregard the results. If your input is clean you could actually be stricter about when to disregard the results, but unfortunately when dealing with real estate data it’s common to see a zip code fat-fingered, different city names for the same actual city, a mistyped street name, a missing directional, or the wrong street type.
- We use the string distance between the input address and our possible results to determine which result is best. Sometimes Google’s geocoder will provide multiple results, and we always have the results from our other geocoder, so this tends to filter out the most erroneous outliers. For example, let’s say our input way “Quincey Road” and we ended up with two results, “Quincey Street” and “Quincy Road”. We would take the second result, because there’s only a one-character difference rather than differing on an entire word.
- Sometimes Google’s geocoder over-simplifies complex street numbers – reducing “1421-1423 Hayes Street” to just the first address “1421 Hayes Street” (compound addresses like this are somewhat common for tenancy in common listings in San Francisco)
We got around this simply by checking that the street number of the input corresponds to the street number of the result, and disregarding the result if they don’t match. But it’s important to bear in mind the cases where a simplified version of your input address might be a valid address, albeit not the result you wanted – for the Hayes Street example, it’s very important for us to maintain that we’re talking about both 1421 and 1423 Hayes Street, not just one of them, even though each are valid addresses taken separately. Another great example of this in real estate is when we’re dealing with the historical form of Chicago addresses (which I only learned about because of this problem, and I must say, are completely sweet in their functionality).
There were also a few lessons that we learned and proved to be important:
- Google’s address level geocodes (indicated in their system as having accuracy code “8″) can be either point-level or street-interpolated. Their street level geocodes are only on the best-matching street, and don’t appear to take the street number into account when placing the coordinates on the given block. This also means that the street number is not returned as part of the normalized address at this accuracy.
- Their geocoder can return multiple results, and it’s not always the case that the first result returned is the one you want. Having a good filtering algorithm for choosing the best result is incredibly important.
- java.util.concurrent has some incredibly powerful and easy-to-use utility classes for multi-threading applications that can be broken up into independent units of work.

- Just because snow shuts down your city doesn’t mean you don’t have to work. Yes, that’s right, we cranked this out over the holidays!
Overall, we’ve been happy with the results of integrating Google’s geocoder (clearly, otherwise we wouldn’t be talking so much about it). Many of our initial concerns ended up being non-issues, partly because Google was so helpful when we brought them up. That being said, there’s still a short list of things we’d like to see added (who knows, maybe they’ll read this!):
- The ability to distinguish between point-level and street-interpolated geocodes. This is one of our largest remaining issues, since we take our data quality so seriously and we like to be able to measure it.
- More point-level data. All of the listings that are mapped directly onto a street rather than over a specific house are placed there because Google didn’t know exactly which house to put them over, only approximately how far down the street they are. We would love to see these directly over houses in the future.
- Componentized address parsing. Instead of just telling us the result is “710 2nd Avenue” we’d like to know that “710″ is the street number, “2nd” is the street name, and “Avenue” is the street type, without having to do any post-processing on our end.
- Less latency. Since it’s a web-based service, the network latency can add considerably to the time it takes us to get results back. A batch geocoder would be one possible solution.
- More throughput. Currently there are caps of 10 requests per second per IP address (so Google can protect against denial-of-service attacks). It would be nice if they could raise or eliminate these caps for customers.
- A pony. A big, shiny one.
There are still a few areas left where we know our geocoding could still be improved. One of the biggest remaining problems is with vacant land, which might not have a complete address yet, and neither of our geocoders supports partial addresses (such as “123XX Main St”). To take care of these cases, we’ll be looking to integrate a geocoder that can geocode by APN (essentially, a locally unique id that every property has).
Have you found any other bugs in Google’s geocoder that we might not have caught? Or know of any other cities with quirks that make geocoding difficult? Perhaps you know of a good APN geocoder? Let us know!
(Photo credits: tympsy and cheukiecfu on Flickr, respectively)
December 22, 2008
There comes a dreaded time in every developer’s life when the inevitable happens – you are forced to switch to a new machine, and re-configure the entire dev environment you spent months (or years, in my case) tweaking and perfecting. For me, that time unexpectedly came last week when the hard disk on my laptop decided it was time for a winter holiday.
Armed with a reimaged laptop and Redfin’s internal developer machine setup guide, I was making decent progress until I hit a stumbling block: Apache was crashing. No matter what I did – on the first browser hit, I was greeted with a friendly error:

The Event Log had a slightly more descriptive message:
Faulting application Apache.exe, version 2.0.63.200, faulting module ntdll.dll, version 5.1.2600.2180, fault address 0x00011e58.
Apache’s error log was mysterious in its own way:
[notice] Parent: child process exited with status 3221225477 — Restarting.
To save you hours of debugging, hair pulling, reinstalling, and commenting out of httpd.conf that I went through, I’ll just point you directly to the root cause of the problem: https://issues.apache.org/bugzilla/show_bug.cgi?id=44338: mod_deflate crashes and does not return response. If you have Apache 2.0.63 on Windows, are loading the mod_deflate.so module, and are using it by including AddOutputFilterByType/SetOutputFilter directives in your httpd.conf – Apache will crash. Currently there is no fix for this (although the bug above may be updated with a fix in the future) – you have a choice of taking out mod_deflate out of your Apache config, or upgrading to Apache 2.2.
Since mod_deflate is not essential in our developer setup, I happily chose the former option and got my Apache up and running.
December 11, 2008
Whenever I out myself as a member of the Redfin search team to someone who has used Redfin, one of the first questions I get is, “so why do you guys use the Microsoft Map? Why didn’t you choose Google?”. The full answer is a bit long, but the short answer is easy: speed.
Every few months, we’ve tested Google Maps against Virtual Earth, and the result was always the same: Google’s script and tiles loaded considerably faster, but Virtual Earth was as much as four times faster at adding a ton of items to the map. Since our user interface can add up to 500 houses at a time to a map, we just felt like Google wasn’t able to give us the performance we needed. To be fair, part of VE’s speed was due to a bulk add feature that we had lobbied for them to put in, but it worked well, and so we put Google aside, wistfully looking at those speedy script and tile load times.
A few months ago, though, we started contract renegotiations with Microsoft, and we decided to give Google Maps a closer look. One of my colleagues, the brilliant Dan Fabulich of Selenium fame, figured out that we could code our own custom GOverlay to make Google Maps display items much faster than it had previously. The question then became: how hard would it be to port our site from one platform to the other? And would it be worth it to do so? Read the rest of this entry »
October 14, 2008
What’s the difference between this HTML snippet:
<a href="http://www.google.com/search?q=html&foo=0">foo=0</a>
and this?
<a href="http://www.google.com/search?q=html©=0">copy=0</a>
Both of them look like simple Google searches (though they could have been anything; Google is just an example). One of them appends an extra “&foo=0″ to the end of the URL; the other appends “©=0″ instead.
Only the second snippet is valid in HTML 4.01 Strict, but that snippet doesn’t work the way you might expect. Neither snippet is valid in XHTML.
Give up? Click on these:
The first URL searches for “html,” but the other URL searches for “html©=0.”
Two weird things are happening here.
- Note that “©” is an HTML entity for the copyright symbol “©.” It would have been more obvious if the URL had used a semicolon, like this:
<a href="http://www.google.com/search?q=html©=0">copy;=0</a>
or if we’d used a more traditional HTML entity like this:
<a href="http://www.google.com/search?q=html"=0">quot;=0</a>
- The second weird thing is a quirk in the HTML specification on character references:
Note. In SGML, it is possible to eliminate the final “;” after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the “;” in all cases to avoid problems with user agents that require this character to be present.
As a result, all modern browsers (FF3, IE7, Opera 9, Safari 3.1) will helpfully notice possible entities like “©” and “<” and replace them with “©” and “<” … they assume you forgot the semicolon. This applies to all of the HTML entities, even the obscure ones like &empty “∅”, ¬ “¬”, ® “®”, &sub “⊂”, and &lang “〈”. (Bizarrely, &Copy is left alone as “&Copy” but © is replaced with “©”.)
We think there are two valuable lessons to learn from this story. The first lesson you may already know:
- The correct way to write an URL with a query parameter is to HTML escape the URL, replacing all &s with & like this:
<a href="http://www.google.com/search?q=html&copy=0">copy=0</a>
That’s also the only way to make the snippet XHTML compliant.
- Don’t use URL query parameters whose names are HTML entities. Never create a web service that accepts a query parameter like “&lang=en”. After all, there’s no way to know when your users might want to copy & paste your URLs into a blog, forum, or HTML email. Even if developers are clever enough to HTML escape href links, not everyone will be, and you can save everybody some trouble by avoiding the dangerous entities altogether.
September 16, 2008
Yesterday one of my co-workers spent hours reproducing and tracking down a bug that turned out to be a stray comma.
Unfortunately, Microsoft Internet Explorer doesn’t support trailing commas in JavaScript arrays and object literals. Code like this won’t parse:
var chord = ["do", "mi", "so",];
var json = { truth:"beauty", beauty:true, };
Those final commas will cause syntax errors in IE.
“How many thousands of developer hours have been lost to random IE bugs like this?” I asked myself. I decided that there had to be a good way to detect this problem in an automated way, without firing up a copy of IE and running a full test suite.
It turns out that these and other syntax errors can be detected automatically from the Windows command line, using the Windows Scripting Host (WSH). On Windows XP and higher, the command-line tool “cscript.exe” can be used to run JavaScript (ahem, JScript) headlessly (outside of any browser).
Just create a file called “wsh-parser.js” like this:
var fso = new ActiveXObject( "Scripting.FileSystemObject" );
function parse(fname) {
var file = fso.OpenTextFile( fname, 1 );
ret = file.ReadAll();
file.Close();
try {
eval("(function(){\\n"+ret+"\\n});");
} catch (e) {
WScript.Echo("Syntax error parsing " + fname);
WScript.Echo(" " + e.message);
}
}
function findJavaScript(folder) {
for (var fc = new Enumerator(folder.files); !fc.atEnd(); fc.moveNext()) {
var file=fc.item();
if (/.js$/.test(file.Name)) {
parse(file);
}
}
for (var fc = new Enumerator(folder.subfolders); !fc.atEnd(); fc.moveNext()) {
var subfolder = fc.item();
if (subfolder.Name == ".svn") continue; // ignore .svn folders
findJavaScript(subfolder);
}
}
var rootPath = "src/main/javascript";
var rootFolder = fso.GetFolder(rootPath);
findJavaScript(rootFolder);
Then run it like this:
cscript //E:javascript //nologo wsh-parser.js
The script will automatically examine every JavaScript file in the specified rootPath, reporting errors in any file that won’t parse correctly.
There’s a bit of magic going on here in the “parse” function to guarantee that we’re only verifying the syntax of our JavaScript, without actually running it. The magic line is:
eval("(function(){\\n"+ret+"\\n});");
The “eval” command would normally execute the provided string, but here we wrap the code in an anonymous function declaration. We improve the performance and the maintainability of our test by declaring the function without actually using it.
Note that you can also use JsLint to detect trailing commas, but it will probably report a bunch of other problems with your code that you may or may not care about.
Automated validators like these should be used together with headless unit tests as well as live browser tests. (I would recommend Selenium for automated browser tests. There are a ton of excellent headless JavaScript unit test tools, including Schaible’s JsUnit, RhinoUnit, and DOH. Note that Hieatt’s JsUnit is not headless.)
But I recommend running automated validators first, because they’re fast and the errors they report are easy to debug/fix. Once the automated validators are happy, run your unit tests. When your unit tests pass, run your live browser tests.
September 9, 2008
This morning I was scanning TechCrunch while still in bed, and I saw a news item about a hot new Silicon Valley startup, Adgregate Markets, that has either solved a really vexing problem in online security or has made the situation a lot worse. Given what we’ve been working on recently at Redfin, I’m pretty sure it’s the latter.
Backing up: for the last few days, we’ve been revamping our registration and login process at Redfin to make it easier for users, and I must admit that what we’ve found has depressed me to no end. We want to put an inline form that lets users log in without ever leaving the page, but it turns out that it’s very difficult to do this reasonably securely. Much more depressing has been the realization that many sites that I use daily, sites I respect, have abysmal security around login and registration. Our awesome summer intern Aditya said after a long discussion of the myriad ways to hack the login process: “Man, I’m never banking online again.” He then confessed that he’d never banked offline (and didn’t know how), which made me feel impossibly old.

The topic that most scared Aditya was “Man In The Middle” attacks and the impossibility of providing a reasonably secure experience from an HTTP page. The problem is this: suppose someone evil has set up a wifi router in your favorite coffee house and that he has control over that router. Because all of your traffic is going through that wifi router, Mr. Evil can not only listen to every packet that is transmitted between you and the Internet, he can also modify any packet. Now, on an HTTPS page, that’s not a problem because (a) the packets that the router can see are encrypted and (b) the browser will alert the user with a big scary warning if the any of the packets have been modified along the way. HTTP pages, on the other hand, will act completely normally if Mr. Evil mucks around with them.
What this means is that any form on an HTTP page is inherently unsafe, even if it posts to an HTTPS page.
Let me repeat that: ANY form on an HTTP page is inherently unsafe.
Why? Because as a user, you have no idea if the form at http://awesomesite.com came from awesomesite.com or from Mr. Evil’s nefarious lair. And if the form did come from Mr. Evil, chances are that it will send your credit card number back to Mr. Evil (and your username and your password and your SSN and your mother’s maiden name…). Worse, it could send the credit card info to Mr. Evil before continuing and sending it along to awesomesite.com. Neither you nor awesomesite.com would be any the wiser.
Many web developers mistakenly believe that an HTTP form (on http://awesomesite.com) that posts to an SSL URL (https://awesomesite.com) is safe, because the communication channel for the post is secure between client and server. But Mr. Evil can change http://awesomesite.com freely, so he can make the form post to anywhere he wants. This is why we tell users again and again rule #1 of web security: Never enter confidential information, especially credit card numbers, unless you see the SSL lock in your browser.
Which brings us back to Adgregate Markets. Adgregate is a company that launched at the TechCrunch 50 yesterday, and they have a product that creates interactive banner ads that allow the user to complete a shopping checkout process completely inside an ad, without leaving the host page at all. They say that this both increases conversion rates, which is a Good Thing for advertisers, and increases page stickiness, which is a Good Thing for publishers. The panel at TC50 reacted positively to Adgregate, and the fact that their business model takes a cut of every sale that happens through their ads is certainly attractive.
There’s a problem here, though, and it’s packaged neatly in Adgregate’s marketing copy: “provide(s) secure transactions on both secure and non-secure pages.” As I said above, I think this simply isn’t possible. If the host page is HTTP, then Mr. Evil can swap out Adgregate’s Flash ad for an ad that looks exactly like it but sends all of the info to Mr. Evil’s server, not to Adgregate. Users will be none the wiser until a laser death ray shows up on their MasterCard statement.
On the other hand, maybe Adgregate has figured out how to securely conduct a conversation on an HTTP page. Many, many smart people have tried for over a decade to solve this problem and failed, but it’s possible Adgregate cracked that nut. If they have, though, it would arguably be a better business model to patent and sell the intellectual property. There are a lot of people who would pay good money for it.
As I write this post, Adgregate’s running third in a prediction market for winning best in show. My amateur prediction is that they won’t win, but that they will probably come out of TC50 with a good boost in publicity. My fear, though, is that they will start to find success with their “secure” forms on HTTP pages, and that they will succeed at convincing users that those forms are safe. (To be fair, Adgregate is far from the only company that is attempting to convince users of the safety of HTTP forms, a fact we discovered in the course of our research on inline registration. To my horror, I found out that even my bank encourages users to enter their password into a non-SSL page.)
If users become inured to entering confidential info on to HTTP pages, then everyone in the consumer web space loses. Because when those Adgregate ads get hacked (and if they become at all successful, they will be hacked), people will lose some degree of faith in the consumer web. And after Mr. Evil has gotten you once, it’s hard to believe that he won’t get you again.
(Dr. Evil pose photo used under Creative Commons from flickr user caitlinator.)
May 6, 2008
A few months ago I attended the PostgreSQL conference in Portland, OR. There were a lot of talks ranging from hard-core stuff like Neil Conway’s talk about the internals of query execution, to random fun stuff like David Fetter’s discussion of procedural languages, including LOLCODE.

During their talks, a few people mentioned a handy function called generate_series. It took me a while to discover how useful this function really is. I thought I’d post an example. Here goes…
Let’s say that you have a table with sales information:
postgres=# select * from sales order by date;
date | sales_person | part_number
------------+--------------+-------------
2008-05-05 | Glenn | 1
2008-05-05 | Shahaf | 1
2008-05-06 | Mike | 1
2008-05-06 | Mike | 2
2008-05-08 | Glenn | 1
2008-05-08 | Shahaf | 1
2008-05-08 | Mike | 2
2008-05-09 | Mike | 1
2008-05-09 | Glenn | 1
(9 rows)
You might want to get an idea of how many sales happened on each day. You could try to do it with a query like this:
postgres=# select date, count(*) from sales group by date order by date;
date | count
------------+-------
2008-05-05 | 2
2008-05-06 | 2
2008-05-08 | 3
2008-05-09 | 2
(4 rows)
This basically works, but it hides one important fact — no sales happened on May 7.
To fix this, we can use generate_series. When you run this function normally, it just generates a series of numbers:
postgres=# select generate_series(0,4);
generate_series
-----------------
0
1
2
3
4
(5 rows)
However, you can easily change it to generate a series of dates:
postgres=# select generate_series(0,4) + date '2008-05-05' as date;
date
------------
2008-05-05
2008-05-06
2008-05-07
2008-05-08
2008-05-09
(5 rows)
Once you have this, you can now join against the sales table to generate the report:
postgres=# select series.date, count(sales.date)
from (select generate_series(0,4) + date '2008-05-05' as date) as series
left outer join sales on series.date=sales.date group by series.date
order by series.date;
date | count
------------+-------
2008-05-05 | 2
2008-05-06 | 2
2008-05-07 | 0
2008-05-08 | 3
2008-05-09 | 2
(5 rows)
The trick here is to do a left-outer join betwee the date sequence and the sales, and to count the rows that have a non-null sales date.
This is by no means the only use of generate_series, it’s just the most recent use I found. If you know of other ways to use this function, or if you know of other handy functions, drop a note below.
Image credits: Electric Vehicle Guide
March 19, 2008
Our very own PM, Matt Goyer, recently setup his Seattle condos WordPress blog to include a syndication of new Redfin listings. We thought it would be a great idea to share how we did this.
Required Plugins
Steps to Follow
- Install the plugins listed above
- Create a writable cache folder if you haven’t already (wp-content/cache)
- This problem is solved if you use WP-Cache
- Perform a search on Redfin
- To syndicate new listings, make sure you select the “New Listings” option (pictured below)
- Once you are satisfied with your search, click the “Save this Search” link above the map
- In the pop-up that will appear, click the “RSS 2.0 Feed” link (also pictured below)
- Follow the SimplePie Plugin usage instructions to include the Redfin RSS feed


Here’s an example of how syndication looks on Matt’s blog:

You’ll have to fiddle with the SimplePie templates and settings to get syndication to look like Matt’s. Again, refer to their usage instructions on how to do this.