<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Elephant versus Dolphin: Which is Faster?  Which is Smarter?</title>
	<atom:link href="http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html</link>
	<description>Redfin Developers\' Blog</description>
	<lastBuildDate>Sat, 21 Nov 2009 14:23:24 -0800</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Logan Bowers</title>
		<link>http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html/comment-page-1#comment-512</link>
		<dc:creator>Logan Bowers</dc:creator>
		<pubDate>Fri, 09 Nov 2007 05:23:45 +0000</pubDate>
		<guid isPermaLink="false">http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html#comment-512</guid>
		<description>On the Partitioning methods, you can also partition randomly, e.g. of n servers, each has 1/n properties.  Your overall load is higher since the query has to be prepared separately on each server, but your total throughput is N times greater so (theoretically) your query finishes in nearly 1/N the time.  For a small reduction in efficiency, you can significantly reduce visible latency to the user.</description>
		<content:encoded><![CDATA[<p>On the Partitioning methods, you can also partition randomly, e.g. of n servers, each has 1/n properties.  Your overall load is higher since the query has to be prepared separately on each server, but your total throughput is N times greater so (theoretically) your query finishes in nearly 1/N the time.  For a small reduction in efficiency, you can significantly reduce visible latency to the user.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Durch Marlowe</title>
		<link>http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html/comment-page-1#comment-504</link>
		<dc:creator>Durch Marlowe</dc:creator>
		<pubDate>Thu, 08 Nov 2007 20:15:32 +0000</pubDate>
		<guid isPermaLink="false">http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html#comment-504</guid>
		<description>#1 is easy. Oracle is a great database for this.  Unfortunately you have to deal with sales guys.
#2 isn&#039;t that every Web 2.0 company&#039;s knee-jerk reactions.  Don&#039;t they have a name: &quot;sharding&quot; for that.  Seems like a band aid.
#3 is cool.  If you can formulate your query in something other than SQL (i.e. XML), you can split the query quite simply.  The problem is assembling the result set.  That is a major flaw of something like this.  Databases are good at merging records.  No one, to my knowledge has written a thing that efficiently merges records asynchronously arriving from processing node (unless Google does that).
#4. Elastra -- interesting company, but still thinking small and database focused.  There are parallel problems like yours that need to be addressed and they are beyond the DB technologies that exist today.  Hopefully someone gets to it one of these days, especially now, with virtual computing and web-based CPU farms being all the rage.</description>
		<content:encoded><![CDATA[<p>#1 is easy. Oracle is a great database for this.  Unfortunately you have to deal with sales guys.<br />
#2 isn&#8217;t that every Web 2.0 company&#8217;s knee-jerk reactions.  Don&#8217;t they have a name: &#8220;sharding&#8221; for that.  Seems like a band aid.<br />
#3 is cool.  If you can formulate your query in something other than SQL (i.e. XML), you can split the query quite simply.  The problem is assembling the result set.  That is a major flaw of something like this.  Databases are good at merging records.  No one, to my knowledge has written a thing that efficiently merges records asynchronously arriving from processing node (unless Google does that).<br />
#4. Elastra &#8212; interesting company, but still thinking small and database focused.  There are parallel problems like yours that need to be addressed and they are beyond the DB technologies that exist today.  Hopefully someone gets to it one of these days, especially now, with virtual computing and web-based CPU farms being all the rage.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Smedberg</title>
		<link>http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html/comment-page-1#comment-501</link>
		<dc:creator>Michael Smedberg</dc:creator>
		<pubDate>Thu, 08 Nov 2007 17:16:46 +0000</pubDate>
		<guid isPermaLink="false">http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html#comment-501</guid>
		<description>Yup, we&#039;ve thought a little about these issues, and had some internal debates on how to do partitioning.

As you point out, our business naturally partitions geographically.  There are a few possible approaches to partitioning that I&#039;m aware of:

1. Let the DB do partitioning, as you mentioned
2. Partition by market.  We currently break the world into markets (the San Francisco Bay Area market, the Seattle market, etc.)  Real Estate agents are assigned to markets, most users are searching for a home within a single market, and we have &quot;home&quot; pages for each market (e.g. http://sfbay.redfin.com/ or http://seattle.redfin.com/.)  However, as you point out, markets are NOT of equal size- it&#039;s a convenient way to partition, but not necessarily a very smart way.
3. Handle partitioning in custom logic.  For example, we could write our own partitioner, and direct queries to the server that corresponds to the region being searched (or even to all partitions in parallel.)
4. Use a service like Elastra (http://www.elastra.com/), which provides something like the &quot;on-demand access to an arbitrarily large number of machines to perform the processing&quot; that you&#039;re proposing.

For now, Redfin probably doesn&#039;t need to do any of those things.  The combination of good GIS indexing, 64 bit machines, and cheap RAM lets us handle pretty big data volumes (tens of gigs) without sweating.  We can probably double the number of markets we serve without worrying too much about partitioning.  Once we approach hundreds of gigs of data, though, we&#039;ll definitely have to think about this more carefully.</description>
		<content:encoded><![CDATA[<p>Yup, we&#8217;ve thought a little about these issues, and had some internal debates on how to do partitioning.</p>
<p>As you point out, our business naturally partitions geographically.  There are a few possible approaches to partitioning that I&#8217;m aware of:</p>
<p>1. Let the DB do partitioning, as you mentioned<br />
2. Partition by market.  We currently break the world into markets (the San Francisco Bay Area market, the Seattle market, etc.)  Real Estate agents are assigned to markets, most users are searching for a home within a single market, and we have &#8220;home&#8221; pages for each market (e.g. <a href="http://sfbay.redfin.com/" rel="nofollow">http://sfbay.redfin.com/</a> or <a href="http://seattle.redfin.com/.)" rel="nofollow">http://seattle.redfin.com/.)</a>  However, as you point out, markets are NOT of equal size- it&#8217;s a convenient way to partition, but not necessarily a very smart way.<br />
3. Handle partitioning in custom logic.  For example, we could write our own partitioner, and direct queries to the server that corresponds to the region being searched (or even to all partitions in parallel.)<br />
4. Use a service like Elastra (<a href="http://www.elastra.com/)" rel="nofollow">http://www.elastra.com/)</a>, which provides something like the &#8220;on-demand access to an arbitrarily large number of machines to perform the processing&#8221; that you&#8217;re proposing.</p>
<p>For now, Redfin probably doesn&#8217;t need to do any of those things.  The combination of good GIS indexing, 64 bit machines, and cheap RAM lets us handle pretty big data volumes (tens of gigs) without sweating.  We can probably double the number of markets we serve without worrying too much about partitioning.  Once we approach hundreds of gigs of data, though, we&#8217;ll definitely have to think about this more carefully.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dutch Marlowe</title>
		<link>http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html/comment-page-1#comment-497</link>
		<dc:creator>Dutch Marlowe</dc:creator>
		<pubDate>Thu, 08 Nov 2007 07:00:21 +0000</pubDate>
		<guid isPermaLink="false">http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html#comment-497</guid>
		<description>First of all, good points on GIS support in Postgres.  I always liked a database that let me compute spheroid distances based on the Geodetic survey of my choice (big fan of the 1980 :) ).  Yeah, Oracle can do it, but who wants to blow $20K on a server license.

One interesting thing I never got to doing while playing with Postgres 8.1+ was the ability to mix Postgres query partitioning (http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html)
with spatial filters. 
 
In theory, your listing are much more dense in some areas than in others.  If you could partition on arbitrary geometry and then have the query split among resources as it executes, you could create this giant query computer that really focuses on geography as a way of keeping query times sane.  Your geographic partitions can be very large (&quot;The Midwest&quot;) or very small (&quot;The Upper West Side&quot;).  Now imagine data for the midwest sitting on one machine and data for the UWS sitting on another machine.  The machines have the same resources (memory, CPU), but cover vastly different areas because there are more listings in one than the other.  They individually perform well and can be easily re-partitioned (split Midwest into 3 &quot;query machines&quot; for example).  Performance is linear on the machines as long as they are similar.  Performance on an area that does not cross geographic boundaries split between machines is linear.  Performance when someone spans boundaries  is NOT #machines X individual machine peformance.  It looks more logarithmic.  Now you have the &quot;Google&quot; of geographic searches.


You should be able to arbitrarily scale this system ad nauseum.

Now, if you only had on-demand access to an arbitrarily large number of machines to perform the processing...</description>
		<content:encoded><![CDATA[<p>First of all, good points on GIS support in Postgres.  I always liked a database that let me compute spheroid distances based on the Geodetic survey of my choice (big fan of the 1980 <img src='http://blog.redfin.com/devblog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  ).  Yeah, Oracle can do it, but who wants to blow $20K on a server license.</p>
<p>One interesting thing I never got to doing while playing with Postgres 8.1+ was the ability to mix Postgres query partitioning (<a href="http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html" rel="nofollow">http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html</a>)<br />
with spatial filters. </p>
<p>In theory, your listing are much more dense in some areas than in others.  If you could partition on arbitrary geometry and then have the query split among resources as it executes, you could create this giant query computer that really focuses on geography as a way of keeping query times sane.  Your geographic partitions can be very large (&#8221;The Midwest&#8221;) or very small (&#8221;The Upper West Side&#8221;).  Now imagine data for the midwest sitting on one machine and data for the UWS sitting on another machine.  The machines have the same resources (memory, CPU), but cover vastly different areas because there are more listings in one than the other.  They individually perform well and can be easily re-partitioned (split Midwest into 3 &#8220;query machines&#8221; for example).  Performance is linear on the machines as long as they are similar.  Performance on an area that does not cross geographic boundaries split between machines is linear.  Performance when someone spans boundaries  is NOT #machines X individual machine peformance.  It looks more logarithmic.  Now you have the &#8220;Google&#8221; of geographic searches.</p>
<p>You should be able to arbitrarily scale this system ad nauseum.</p>
<p>Now, if you only had on-demand access to an arbitrarily large number of machines to perform the processing&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Smedberg</title>
		<link>http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html/comment-page-1#comment-482</link>
		<dc:creator>Michael Smedberg</dc:creator>
		<pubDate>Tue, 06 Nov 2007 18:05:52 +0000</pubDate>
		<guid isPermaLink="false">http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html#comment-482</guid>
		<description>Well, listing perf is already pretty good for most users (almost always under 200ms), so there isn&#039;t as much of room for improvement.  But yes, I do expect to see performance improvements on the listing end.
As the number of listings grows (i.e. as we move into more markets), this will be a bigger effect, and the effort of indexing listings well will be more apparent.</description>
		<content:encoded><![CDATA[<p>Well, listing perf is already pretty good for most users (almost always under 200ms), so there isn&#8217;t as much of room for improvement.  But yes, I do expect to see performance improvements on the listing end.<br />
As the number of listings grows (i.e. as we move into more markets), this will be a bigger effect, and the effort of indexing listings well will be more apparent.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Glenn Kelman</title>
		<link>http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html/comment-page-1#comment-481</link>
		<dc:creator>Glenn Kelman</dc:creator>
		<pubDate>Tue, 06 Nov 2007 17:55:59 +0000</pubDate>
		<guid isPermaLink="false">http://blog.redfin.com/devblog/2007/11/elephant_versus_dolphin_which_is_faster_which_is_smarter.html#comment-481</guid>
		<description>Michael, do you think performance will be improved much by having Postgres run LISTINGS search?</description>
		<content:encoded><![CDATA[<p>Michael, do you think performance will be improved much by having Postgres run LISTINGS search?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
