Archive for the ‘Performance’ Category
March 11, 2011
We use JDBC to connect to our database, but most of our code doesn’t connect directly to JDBC. Instead, we go through Hibernate, which is great for most purposes, but can make it difficult to do low level tweaks. We might want to do things like:
- Generate performance metrics per-thread, to get a SQL oriented performance profile for individual controllers
- Provide a single, central location to tweak SQL before running it
- Write unit tests that make assertions about the number or type of SQL statements that higher level code runs
- Trace Queries and ResultSet sizes
- Debug the SQL generated by third party libraries, and how those libraries use JDBC
I wrote wrappers for the most relevant interfaces (Driver, Connection, Statement, CallableStatement, PreparedStatement, ResultSet.) It’s 100% boilerplate (I didn’t implement any USEFUL functionality- that’s for you to do!) It took me a few hours, so I thought I’d share- no point in us all writing the same boilerplate over and over! Obviously, you’ll have to tweak this code for your own purposes.
To use it, you would use ‘redfin’ in your JDBC URL scheme, like this: ‘jdbc:redfin://blahblah’. You’d also set your JDBC driver class to ‘redfin.util.jdbc.DriverWrapper’. The exact mechanism you use to do this is obviously dependent on your environment.
[Click through to the full post for the code.]
Read the rest of this entry »
June 14, 2010
As we talked about before, Redfin uses Varnish to implement Edge Side Includes (ESI.) This involved breaking a single big (and expensive) page into individual chunks; each chunk would be generated by separate code, and would be cached on a different schedule.
Once we broke our expensive page into chunks that could be individually cached, it seemed pretty easy to have those chunks served up by different backend servers. Voilà, a monolithic app became “service oriented“! This would let us run the different software components on different machines (with different performance characteristics, different SLAs, even implementations in different languages/environments!)
Of course, nothing is actually that easy, and we made a number of mis-steps before we figured out how to do it.

How To
Varnish allows you to define multiple backends in your VCL. And in your vcl_recv function, you can decide which backend should handle a particular request. At Redfin, we added a new Varnish backend for each of our ESI endpoints, and we added logic to choose the relevant backend by URI. In practice, we actually only have one pool of machines handling our ESI requests, so all of our Varnish backends actually point to the same place.
So the first piece of the puzzle is on our main web servers. On the main web servers, requests go through Varnish. Requests for “normal” pages are sent through to Tomcat, but requests for ESIs are sent to one of the SOA backends. Here’s an example of what the VCL file might look like:
backend default {
.host = "localhost";
.port = "8080";
}
backend similars {
.host = "similars.redfin.com";
.port = "6081";
}
backend relevantlinks {
.host = "relevantlinks.redfin.com";
.port = "6081";
}
...
sub vcl_recv {
if (req.url ~ "^/esi-listing-similars" || req.url ~ "^/esi-property-similars") {
set req.backend = similars;
}
else if (req.url ~ "^/esi-listing-trackbacks") {
set req.backend = relevantlinks;
}
You might have noticed that the “localhost” backend is associated with port 8080 (where Tomcat is running), but the ESI backends are associated with port 6081 (where Varnish is running on those remote machines.)
We want the instance of Varnish on the main web server to cache content from the main web server, and the instances of Varnish on the ESI backends to cache the content from those backends. This has a few benefits:
- Our effective cache is bigger, since we have caches on multiple machines, each of which has fixed memory
- Having independent caches prevents one set of items from pushing another set out of the cache. If all the data were in a single cache, then cache entries holding similars information (which is small, but expensive to recreate) could be pushed out of the cache by cache entries of “main page” content (which is big and relatively cheap to recreate, but we’d still like to cache.)
- It’s easy to flush individual caches without having to worry about performance problems with other parts of the site
We have another design goal: we’d like to have a single distribution of our software. We’d like to have a single WAR that we can put on any machine; we do NOT want to have to deal with multiple builds, with figuring out which build has been installed on which machine, etc. We’d like to be able to switch a single machine from being a standard web server to being an ESI endpoint without having to redeploy or reconfigure.
This creates a conundrum. We want our main web servers and our ESI servers to be identical, but we also want them to act different. In particular, when an instance of Varnish on a web server gets a request for an ESI fragment, it should redirect that request to an ESI server (more precisely: to the Varnish instance running on an ESI server.) But when an instance of Varnish on an ESI server gets a request for an ESI fragment, it should forward the request to the local Tomcat instance. It should NOT forward the request to ITSELF. Forwarding port 6081 to port 6081 creates an infinite loop- not good.
We want to break the symmetry between the standard web servers and the ESI servers, and we do that by messing with the URIs.
We prepend our ESI URIs with a known prefix, which means “forward this to the ESI server.” But when we process the URI (while forwarding it), we strip off that prefix, so that the ESI server does not also forward it to itself. That’s harder to say than it is to code. The VCL code looks like this:
sub vcl_recv {
if (req.url ~ "^/backend/") {
set req.url = regsub(req.url, "^/backend/", "/");
if (req.url ~ "^/esi-listing-similars" || req.url ~ "^/esi-property-similars") {
set req.backend = similars;
}
else if (req.url ~ "^/esi-listing-trackbacks") {
set req.backend = relevantlinks;
}
This breaks the circularity. The path of requests looks like:
- A requests comes into Varnish on the standard web server for /path/to/a/page
- Varnish forwards the request to the local Tomcat instance
- Tomcat responds with HTML that includes <esi:include src=”/backend/esi-listing-similars” />
- Varnish processes the ESI, and must make a request for /backend/esi-listing-similars
- The Varnish instance on the standard web server strips off “/backend”, and sends a request for “/esi-listing-similars” to the ESI server
- The Varnish instance on the ESI server gets the request for “/esi-listing-similars”
- Since there’s no “/backend” prefix, the Varnish instance on the ESI server forwards the request to its local Tomcat instance
- The Tomcat instance on the ESI server processes the request, and responds with the relevant HTML fragment
- The Varnish instance on the ESI server caches the HTML fragment and returns it
- The Varnish instance on the standard web server parses the HTML fragment into the main page content and returns it to the browser
This example points out another tricky bit- how do we assure that the HTML fragment is cached by the Varnish service on the ESI server, but not by the Varnish service on the standard web server? To handle this correctly, we add a header to the response which indicates if it’s already been cached:
sub vcl_fetch {
if (req.url ~ "^/esi-") {
if (obj.http.X-RF-Cached ~ "true") {
pass;
}
set obj.http.X-RF-Cached = "true";
This code says “If there’s an X-RF-Cached header present, then don’t attempt to cache. If there is NOT an X-RF-Cached header present, then add one, and attempt to cache.” With this addition, the HTML fragments will only be cached on the first Varnish instance they pass through, which is on the ESI server in our case.
How NOT To
The solution described above works, and meets our requirements. But we also tried some solutions that did NOT work. Perhaps you can learn from our failures…
Putting Absolute URIs into ESI Includes
Our first thought was that we’d put absolute URIs into our ESI includes in the HTML. For instance, we tried to put <esi:include src=”http://similars.redfin.com:6081/esi-listing-similars” /> into the main HTML of our page. Varnish simply (and correctly, I think) ignores the host name and port. Including http://similars.redfin.com:6081/esi-listing-similars will cause Varnish to act as if you included /esi-listing-similars, and Varnish will use whichever backend it thinks is relevant, regardless of the host name or port in the URI.
Using a Single Server as both a Standard Web Server and an ESI Server
When doing testing, or when some of our servers were unavailable, we were tempted to use a single server as both the standard web server and the ESI server. It seemed like this should work- the trick with the “/backend” prefix should prevent infinite circularity. However, it didn’t work. It seems that Varnish is doing its own checks for circularity, and noticing that a single request passed through the same Varnish instance multiple times (which NORMALLY would be a problematic example of circularity, but we’ve got our clever symmetry breaker in there!) Anyway, Varnish doesn’t allow it, and causes those semi-circular requests to fail.
P.S.
Thanks to D’Arcy Norman for the photo!
May 7, 2010
When your webapp is serving up content that’s expensive to generate, you may want to serve it up asynchronously- via AJAX calls. This is particularly appealing when content is “below the fold.”
However when that content is cached, you want to serve it up as quickly as possible. If you’ve already calculated the content, you’d like to include it inline in the page, without requiring an AJAX roundtrip. That way, you avoid the latency of an unnecessary round-trip. You also allow the page to be fully rendered (so content doesn’t jump around), etc.
You can optimize for the empty cache, or you can optimize for the full cache, but it seems hard to optimize both experiences.
Redfin faces exactly this conundrum with our listing pages (e.g. http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622.) Calculating the Similar Listings and Similar Sales is expensive and performed in real time. We cut this Gordian Knot through the use of the Varnish caching reverse proxy, along with clever use of ESI (Edge Side Includes.) For an overview of how we use Varnish at Redfin, see our previous post.

We want to say “if there’s a cache miss, then do AJAX, but if there’s a cache hit, then just include the content.” We have to make sure that the AJAX calls will fill the cache, such that subsequent requests will see cache hits, of course!
I’ll outline what the requests/responses look like for us, then I’ll include some pseudocode that supports this.
At the beginning of time, the cache is empty, and the browser requests information on a Listing.
|
Step
|
Browser
|
Varnish
|
Backend Server
|
|
1
|
Requests http://www.redfin.com/…/home/604622
|
|
|
|
2
|
|
Passes request to server
|
|
|
3
|
|
|
Returns HTML including an ESI like <esi:include src=”/similars?property_id=604622″ />
|
|
4
|
|
Lookup /similars?property_id=604622 in cache
|
|
|
5
|
|
Cache lookup fails
|
|
|
6
|
|
Makes request to /similars?property_id=604622
|
|
|
7
|
|
|
Returns HTML for AJAX for Similars (e.g. a <script> block with a reference to http://www.redfin.com/extranet-similars?property_id=604622)
Response includes “no cache” headers
|
|
8
|
|
Injects the <script> block into the HTML to be returned
Does NOT cache the server response
|
|
|
9
|
|
Returns HTML to Browser
|
|
|
10
|
Displays HTML
|
|
|
|
11
|
Executes <script> block
|
|
|
|
12
|
Requests http://www.redfin.com/extranet-similars?property_id=604622, including a special header saying “gimme the real content”
|
|
|
|
13
|
|
Passes /extranet-similars?property_id=604622 request to server
|
|
|
14
|
|
|
Returns HTML including an ESI like <esi:include src=”/similars?property_id=604622″ />
|
|
15
|
|
Lookup /similars?property_id=604622 in cache
|
|
|
16
|
|
Cache lookup fails
|
|
|
17
|
|
Makes request to /similars?property_id=604622, passing along special “gimme the real content” header
|
|
|
18
|
|
|
Examines request, sees special “gimme the real content” header
|
|
19
|
|
|
Calculates correct HTML to display Similar Listings and Similar Sales
|
|
20
|
|
|
Returns HTML including “please cache this” headers
|
|
21
|
|
Injects the Similars block into the HTML to be returned
DOES cache the server response
|
|
|
22
|
|
Returns HTML to Browser
|
|
|
23
|
Client side Javascript injects Similars HTML into page
|
|
|
That’s all great, but we still haven’t used the cache! The cache entry will get used for subsequent requests for the same page, like this:
|
Step
|
Browser
|
Varnish
|
Backend Server
|
|
1
|
Requests http://www.redfin.com/…/home/604622
|
|
|
|
2
|
|
Passes request to server
|
|
|
3
|
|
|
Returns HTML including an ESI like <esi:include src=”/similars?property_id=604622″ />
|
|
4
|
|
Lookup /similars?property_id=604622 in cache
|
|
|
5
|
|
Cache lookup SUCCEEDS
|
|
|
6
|
|
Injects the Similars block into the HTML to be returned
|
|
|
7
|
|
Returns HTML to Browser
|
|
|
8
|
Displays HTML including Similars (no AJAX calls)
|
|
|
There are two things worth noting about this exchange.
First, when the backend server gets a request for /similars?property_id=604622, it has to decide if it should be returning the real HTML, or should be returning Javascript that will retrieve the HTML via AJAX. It makes this decision based on the value of a header passed in by the client. When the client is making an AJAX request, it knows it better NOT get back a response that generates AJAX requests (that’d be a death spiral.) Therefore, when it makes the AJAX request, it includes the special header. In all other cases, the special header is NOT included. When the header is included in a request, the server will generate the real HTML. When the header is not included, Varnish may answer the request from cache, or it may pass through to the backend server. If the request is fulfilled by the Varnish cache, then it’s the real HTML, but if it’s fulfilled by the backend server, it’ll be the AJAXy HTML.
Second, there are two URLs that have to do with similars.
/similars?property_id=604622 is an internal-use-only URL that returns the content (either the proper HTML or the AJAX code.)
/extranet-similars?property_id=604622 is an externally facing URL that only returns an ESI fragment (which will subsequently be filled in by Varnish. This way, the ESI endpoints are never available to the extranet; Varnish can get to them, but extranet clients have no need for them. This lets us be lazy with the ESI URLs. For example, URLs that are exposed to the extranet do extra validation to check if the user is logged in, etc. URLs for internal use only, such as the ESI URLs, can skip that work. This also lets us change the URLs when the property changes, to facilitate cache busting (see the “Cache busting” section in ESI and Caching Trickery in Varnish for more information.
Pseudocode
OK, so we know what we want the interaction to look like. What code will make this happen? Here’s some Javaish pseudocode that illustrates how it might work:
/*
Invoked for requests like http://www.redfin.com/[address]/home/[property id]
*/
public void handlePropertyRequest(Request request, Response response, long propId) {
Property property = getProperty(propId);
response.write("<html><head></head><body>" +
...
"<esi:include src='/extranet-similars?property_id=" +
propId +
"&last_mod=" +
property.getLastModified() +
"'/>" +
...
"</body></html>");
}
/*
Invoked for (extranet) requests like /extranet-similars?property_id=[property id]&last_mod=[date]
*/
public void handleExtranetSimilarsRequest(Request request, Response response, long propId) {
Property property = getProperty(propertyId);
response.write("<esi:include src='/extranet-similars?property_id=" +
propId +
"&last_mod=" +
property.getLastModified() +
"'/>");
}
/*
Invoked for (intranet) requests like /similars?property_id=[property id]&last_mod=[date]
*/
public void handleSimilarsRequest(Request request, Response response, long propId) {
if (null == request.getHeader("full_html")) {
//This request does NOT demand that we return the actual HTML.
// We will return a script block that will fetch the HTML via AJAX.
response.write("<script>" +
"dojo.addOnLoad(" +
"function() {" +
"dojo.xhrGet({" +
"url: 'http://www.redfin.com/extranet-similars?property_id=" + propId + "'," +
"load: function(response, ioArgs){" +
"dojo.byId('similar_homes').innerHTML = response;" +
"return response;" +
"}," +
"headers: {'full_html': 'true'}," +
"handleAs: 'text'" +
"});" +
"}" +
");" +
"</script>");
//Do NOT cache the script
response.setCacheable(false);
}
else {
//This request wants the actual HTML for similars
response.write(getSimilarsHTML(propId));
//The similars HTML is cacheable- that's the whole point!
response.setCacheable(true);
}
}
May 4, 2010
Varnish is a high performance, flexible, open source HTTP accelerator.
We started using Varnish at Redfin in our last major release, a few weeks ago. It’s pretty much invisible to our end users, but we’re so happy with it that we wanted to give the folks who made Varnish their props in public. It has really been great!
Varnish combines three technologies that are really useful at Redfin:
- A caching reverse proxy to reduce load on our backend servers
- ESI (Edge Side Includes) to break a page into snippets of HTML which can each have their own caching strategy
- VCL (Varnish Configuration Language) which enables fine grained control of Varnish
We use Varnish to accelerate the delivery of home details pages. When you visit the page for a home (e.g. http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622), parts of that page are cacheable but other parts can’t be easily cached. For example, the description of the home may be available to all users, but MLSs require us to hide some historical information from users who aren’t logged in. Further, while most of the page might be highly cacheable, the “Sites Linking to 830 El Camino Del Mar” section isn’t as easy to cache- a blog post that refers to our page (via a trackback) may come in at any time.
ESI nesting makes it easy to accomodate these vagaries.

Conceptually, here’s what the HTML for our main page looks like:
<html>
<body>
Some notes about this home
Sites Linking to 830 El Camino Del Mar:
<esi:include src="/esi-listing-trackbacks?listing-id=123" />
Median House Values:
<esi:include src="/esi-listing-regions?listing-id=123" />
</body>
</html>
Varnish will fill in the details of each of the esi:include sections with results from the “src” URL. In this example, a single HTTP request from the browser to Varnish will cause Varnish to make three HTTP requests to the backend server (one for the main page, one for the trackbacks, and one for the similars.)
Turning a single request into three requests doesn’t really help per-se, but it does enable caching. Previous to ESI, we were unable to cache the page as a whole since the “Sites Linking to” section was uncacheable. By breaking the page into three sections, we can support caching for some of the sections, while disallowing caching of the other sections.
The workflow of a request that’s partially answered from cache might look something like this:
1. The browser requests http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622
2. Varnish receives that request, and looks up the URL in its cache
3. Varnish finds a match in the cache, so it doesn’t send the request for /CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622 through to the backend. Instead it retrieves the content from the cache, and searches it for ESI tags.
4. Varnish finds the ESI include for /esi-listing-trackbacks?listing-id=123
5. Varnish looks up /esi-listing-trackbacks?listing-id=123 in the cache. There’s no entry, so Varnish requests /esi-listing-trackbacks?listing-id=123 from the backend.
6. The backend calculates the content for /esi-listing-trackbacks?listing-id=123 and returns it (along with cache control headers specifying that the results should not be cached)
7. Varnish likewise retrieves the results for /esi-listing-regions?listing-id=123
8. Varnish knits the three HTML snippets together and returns the results to the browser
The big win here is that ESI allows us to cache the main body of the page, even though the trackbacks cannot be cached. This is a tricky bit, so I’ll repeat it. The “outer” HTML, which is the main body of the page, is cached. But the “inner” HTML, the HTML for trackbacks, is NOT cached. The cache of the outer content doesn’t include the inner content- it just includes a token saying “fill in this inner content before you use this cache entry.”
Of course, that’s just the simplest case. In practice, we faced a number of minor challenges while implementing this.
1. Recording every hit
We have two conflicting goals. On the one hand, we’d like to serve content up from cache as often as reasonable- users get the content faster, and our backend systems scale better. On the other hand, we’d like to record every page hit. Whenever a user views a page describing a listing, we record various information. We would like every request to get through Varnish and into our backend, so that we can record this information.
As with nearly every problem in Computer Science, this is solved by adding a layer of code. In this case, the “outer” request is NEVER cached, but all it does is record the hit and generate an ESI include. The “inner” request does the heavy lifting, but responses are cached. For example, the user might request http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622 which would result in this “outer” response:
Cache-Control: max-age=0
<esi:include src="/esi-display-listing?cache-for-logged-out&listing-id=604622" />
which would in turn generate a cache lookup for /esi-display-listing?cache-for-logged-out&listing-id=123. If that’s cached, it’s fast. If it’s not cached, we gotta do all the work.
2. Caching public content without caching user-specific content
The main page content for a home (e.g. http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622) is the same for all anonymous users. However, users that are logged in will see additional details, such as whether or not that home is a “favorite.” Thus, it’s easy to cache for anonymous users, but harder to cache for logged in users (we don’t cache the main page content for logged in users.) It’s easy enough to set the cache-control response headers such that Varnish won’t cache content for logged in users. But we wanted to optimize a bit more- we wanted to avoid even attempting cache lookups when the user is logged in. We did this by adding VCL which examines the incoming request. If the request includes cookies that indicate the user is logged in, we skip the cache lookup. We also put a special token into the URL to make it easy for the VCL logic to know that it should do this magic for the request (since the URLs are ESI URLs, they’re not visible to the extranet.) Here’s what the VCL looks like:
sub vcl_recv {
...
if (req.http.Cookie ~ "RF_AUTH") {
set req.http._rf_login = regsub( req.http.Cookie, "^.*?RF_PARTY_ID=([^;]*?);*.*$", "\1" );
}
# cookies by default make requests in Varnish uncacheable
unset req.http.Cookie;
...
if (req.url ~ "cache-for-logged-out") {
#Directive says to use cache for logged out users, but not for logged in users
if (req.http._rf_login) {
#Since there's an RF_AUTH, the user is logged in- do not use cache
pass;
}
else {
#The user is NOT logged in- use cache (but do not look up based on cookies)
lookup;
}
}
...
}
3. Cache busting
We’d like to cache HTML describing a listing for a long time (24 hours), but when we get new listing data, we want to show that to users immediately.
One approach is to explicitly invalidate any cache entries that refer to the listing. We could identify all Varnish instances that might cache the data and individually invalidate the content in each one. However, that’s a little difficult to do from Java, it may be unreliable (it requires that we keep good records about all Varnish instances), and it’s generally a PITA.
Instead, we include the last modified time of the listing in the URL. Again, the ESI URLs are internal, so this doesn’t dirty our extranet URLs. My earlier example was incomplete. A request for http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622 might generate a response that looks like this:
<esi:include src="/esi-display-listing?cache-for-logged-out&listing-id=604622&last-mod=1272651333452" />
(note the “last-mod” argument, which represents that last modification date of the Listing.) That way, whenever the listing changes, the URL to the main ESI fragment will change- stale cache entries will be orphaned.
4. Tuning Varnish
When we initially deployed Varnish, we were seeing 503 errors- Varnish was returning 503 Service Unavailable errors. Michael Young (our intrepid CTO) changed many of the Varnish settings, including connect_timeout, sess_workspace, thread_pool_min, and thread_pool_max. The most important thing he did was match the Varnish threads to our expected traffic, and the 503 errors went away (pretty much.)
P.S. Thanks to Odalaigh for the gorgeous image
December 29, 2009
We use Hibernate for object-relational mapping (ORM) and organize our code into domain objects and data access objects.
Historically many of our domain objects have been marked with the “@Proxy(lazy = false)” Hibernate annotation. This annotation tells Hibernate that it should NOT create lazy proxies for the annotated class.
At Redfin, these were almost all bugs. We should never use “@Proxy(lazy = false)” without a big comment explaining why it’s necessary. Our default should be “@Proxy(lazy = true)”. Laziness is good!

Here’s my quick understanding of the effects of the @Proxy annotation. As with everything in Hibernate, each individual piece seems simple, but when you consider all the features that Hibernate exposes, and how they interact, it can become pretty complicated.
Hibernate Load Options
When Hibernate loads objects that refer to other objects (i.e. have member objects), it needs to do something about the associated objects. For example, suppose that Cat objects contain (optional) references to Owner objects. When Hibernate is loading a Cat object into memory, it has to decide what to do about the Owner member variable. There are a number of things it COULD do:
- When it constructs SQL to load the Cat, it could include the Owner table and columns in the SELECT clause, so that all the data is loaded at once
- It could load the Cat object, and subsequently load the Owner object (via a second SQL statement)
- It could load the Cat object, and set the Owner member to a placeholder (a proxy), which can be filled in later when the Owner information is needed
Note that it CANNOT simply do nothing about the Owner- if it instantiates a Cat and leaves the Owner member null when the DB says that the Cat DOES have an Owner, then consumers of the Cat will be misinformed- they’ll think that the Cat has no Owner, which is false.
Option 1 (get all the info in 1 SQL statement) is efficient when loading multiple Cats for which the Owner information is needed. For example, if some code needed to iterate over 1000 Cats, and get Owner information for each one, this approach would be efficient.
However, option 1 is inefficient in cases where the secondary information is not needed. E.g. if some code needed to iterate over 1000 Cats but did NOT need to get Owner information, then loading the Owner information is an obvious inefficiency.
Worse, taking option 1 to the extreme can cause an explosion in the data load. For example, a Cat might have an Owner, the Owner might have a Home, the Home might have a Address, which might have a City, which might have a State, etc. Loading the whole object graph into memory via SQL could be very inefficient. Further, every change to domain objects could cause many SQL statements to get hairier (e.g. adding a Country member to the State object would effectively add to the SQL needed to load Cat objects.)
Option 2 (load the Cat, then load the Owner) is simple, and often not bad, but never optimal (if you know you’ll need the Owner info, it’s more efficient to load it in a single SQL statement; if you know you won’t use it you should never load it; if you won’t know until later, delaying the load is better.)
However, option 2 is particularly bad when bulk operations are being performed. For instance, if some code were to load up every Cat object in the database to do some processing, this could be accomplished via a single SQL statement (though it’d probably be better to break it into chunks of, say, 10,000 Cat objects.) However, Hibernate would run a “SELECT * FROM owners” type statement for every Cat object that has an Owner- potentially millions of SQL statements.
Option 3 (load the Cat and set it’s Owner member variable to a proxy- load the Owner info on demand) is a compromise. It allows code to do bulk operations without loading the ancillary information (e.g. load all Cat objects without ever loading any Owner objects.) However, it requires additional SQL statements to load the secondary information IF that info is needed (e.g. if code loaded all Cat objects, then accessed the Owner for each Cat, option 3 would result in potentially millions of SQL statements.) Note that if the Owner information is never needed, then option 3 is most efficient- the information is never loaded.
Hibernate allows programmers to influence which strategy it will take. It offers (at least) two types of control: direct control over the SQL it generates, and control over the proxies.
See https://www.hibernate.org/315.html and https://www.hibernate.org/162.html for information on Hibernate fetching strategies and lazy loading.
Controlling SQL
When you’re implementing a DAO method, you can tell Hibernate whether it should proactively fetch information about member objects.
Under the Criteria API, Hibernate lets you call criteria.setFetchMode to tell Hibernate that it should load the additional info immediately, or should defer it. Hibernate uses the term “eager” to mean “load immediately”, and “lazy” to mean “defer loading.”
When using HQL, you can use the FETCH keyword to specify the fetch mode, which is equivalent.
When using SQL, you can use the query.addJoin method to tell Hibernate that you’ve written SQL which retrieves information for member objects. In this case, you’ll be responsible for writing the joins, etc., yourself.
Controlling Proxies
Hibernate also lets you control the existence and behavior of proxies via the tags mentioned above. Annotating a class with “@Proxy(lazy = false)” tells Hibernate to NOT support lazy proxies for that type of object (of course “@Proxy(lazy = true)” tells Hibernate to support lazy proxies.) This allows the writer of the domain object to essentially override the wishes of the writer of the DAO. If the DAO writer would like to load members in a lazy manner, but the domain object in question doesn’t support lazy loading, then Hibernate will NOT lazy load the object (since it cannot.)
If you’re writing a class for which lazy loading would be dangerous, then you SHOULD disallow lazy proxies, since DAO writers probably won’t understand the detailed load requirements of your class. However, this is unusual. In most cases, lazy proxies are safe.
Since the writer of the domain object can control what choices are available to the writer of the DAO object, they need to use that power judiciously. You CAN code all of your domain objects to disallow lazy loading, which will force all writers of DAOs to use load options 1 or 2 (load all members via fancy SQL, or load all members via secondary SQL statements.) But you generally should not. DAO writers often rely on option 3 (lazy loading), particularly when they know that the member objects will never be accessed (or when they’re not sure.) If you specify “@Proxy(lazy = false)”, you’ve made it impossible for DAO writers to use option 3, which means it may be difficult for them to get their code to perform well. Worse, the writer of the DAO may not realize that you did that, or may not understand the implications. Hibernate queries are actually kinda hard to view, so the writer of the DAO may have created a huge performance problem and not even known it (until you go into production.)
Only the client really knows
Even the writer of the DAO doesn’t know how the client will use the objects it returns. If you’re implementing the CatDAO, you might add a method like getBasementCatsAndOwners, which would return all black cats and pre-fetch the corresponding owners. You think you’re clever because you’ve avoided a major performance problem, but a caller might try to get the Home for each Owner, defeating your pre-fetching strategy. The DAO writer should do their best to anticipate the needs of their callers, and to name and document their methods such that callers can understand what they do, but ultimately the caller is in control, and can (unintentionally) defeat the optimizations of the DAO writer. If your database were large and you knew that you had clients that sometimes needed Owners, sometimes needed Owners and Homes, etc., you might make three methods: CatDAO.getBasementCats, CatDAO.getBasementCatsAndOwners, and CatDAO.getBasementCatsAndOwnersAndHomes.
Conclusion: @Proxy(lazy = false) is generally evil
As mentioned above, when you indicate that a domain object should not support lazy proxies, you make it hard for DAO writers to get their code to perform well. Worse, you disable a capability that they may be counting on, and they may not notice until there are major performance problems. Unless you have a good reason to, use “@Proxy(lazy = true)” on your domain objects.
P.S.
Lazy proxies do have some known problems.
First, the lazy proxy is NOT the same as the actual object. If you depend on the datatype of the object, you may have problems, since the type of the proxy isn’t the same as the type of the actual object (e.g. a proxy for an Owner is not actually an Owner- it’s a subclass.)
Second, you may have to think carefully about methods like equals() or hashCode(), since the proxies may not do what you expect.
P.P.S
Thanks to carloneworld for the great lazy kitty photo!
March 6, 2009
As part of our recent release, we added every survey submitted about our agents to our agent profiles. To prep for this, we needed to do a mail merge with each survey response, which we exported from a Postgres database. The problem was, when our users submitted their answers, they used the skills their grade 3 English teacher taught them and wrote in paragraphs. But all the common export formats indicate new records by using line-breaks, so we needed a way to clean the whitespace from these surveys.
As it turns out, Postgres has some very useful regex functions that make string operations a breeze. But since no one wants to have to reconstruct the appropriate syntax every time they need to clean whitespace, you can make a Postgres function that wraps the functionality:
CREATE OR REPLACE FUNCTION clean_whitespace(to_clean text) RETURNS text AS $$
BEGIN
RETURN regexp_replace(to_clean, E'[ \t\n\r]+', ' ', 'g');
END;
$$ LANGUAGE plpgsql IMMUTABLE;
This replaces each group of whitespace in the argument with a single space. The immutable flag indicates that the function will have no side-effects, and thus allows it to be used in indices. Also notice that we only want to match occurrences of length at least 1 (by using “+” rather than “*”), because otherwise you end up with a space between every character!
Thanks to Thomas Kellerer on the postgres message board for pointing us in the right direction with regards to the arguments we needed.
(Photo credits: jamesdale10 on Flickr)
November 5, 2007
vs.
Redfin recently switched some of our backend DB infrastructure from MySQL to Postgres, and we plan to wholly switch to Postgres in the near future. This wasn’t an easy decision; MySQL has a lot going for it, and switching has been a lot of work. However, we’ve already seen major benefits from choosing Postgres, and we expect to see more as we complete our transition. In particular, performance on certain geographic queries has improved dramatically.
A simple Google search shows that a lot of people have already opined about MySQL versus Postgres (e.g. here, here, here, here, here, and here) but we weren’t able to find much information that applied directly to the problem we were having. Specifically, we were having some major performance problems with queries that were constrained by both spatial and numeric columns, and all of our attempts to squeeze more performance out of MySQL (including hiring expensive outside consultants) had come to naught.
GIS Indexes
Redfin is an online real estate company, and our map based UI is the most-used part of our web site (as well as being the biggest performance hog.) When a user views the map, we use SQL to find the relevant listings or past sales. Users typically constrain a search by numerous criteria, such as maximum price or minimum square footage. Since the UI is map based, users are ALWAYS constraining by geography, though that constraint might be weak.
How We Did It In MySQL
In MySQL, the queries might look something like:
SELECT
*
FROM
listings
WHERE
price <= 400000 AND
num_bedrooms >= 2 AND
num_bathrooms >= 1.5 AND
type = 'condo' AND
MBRContains(GeomFromText('POLYGON((X1 Y1,X1 Y2,X2 Y2,X2 Y1,X1 Y1))'), centroid_col)
LIMIT
101
where X1/Y1 and X2/Y2 are lat/long pairs that describe the region to be searched. To improve performance, we create indexes on the relevant columns. In MySQL, a normal index cannot include spatial columns, and spatial indexes cannot include normal columns. In this example, we might have one multi-column index on price, num_bedrooms, and num_bathrooms, and another single-column index on centroid_col. In many cases, this performs great. Examples include:
- When the table is small (we have hundreds of thousands of listings, but tens of millions of past sales records)
- When the geographic constraint is very selective (i.e. when the map is zoomed very far in)
- When the geographic constraint is the only constraint (i.e. the user doesn’t care about price, bedrooms, etc.)
- When the constraints are poor, but the LIMIT amount is hit quickly (e.g. search for all listings in the the world; MySQL can quickly find the first 101 rows in the table, and once it's found 101, it can give up)
However, there were also cases where it performed terribly, particularly when the table was big, the geographic constraints were relatively weak, and other constraints were relatively strong. For example, a search for all past sales in the San Francisco Bay Area that had 1 bedroom, but sold for over $10,000,000 resulted in a “killer” query. This is a little counterintuitive, but was definitely a problem for our customers (though my example is a very extreme case.) The problem with this query is that:
- The table is large (tens of millions of rows)
- The geographic index is the best index to use, but still isn’t great (might return 500,000 rows, or ~1% of the table)
- MySQL would “short circuit” the query when 101 records were found, but the query returns less than 101 records (there are few 1 bedroom condos that sold for more than $10M), so MySQL examines all 500,000 rows that match the geographic constraint
This does happen in real life.
For example, a user might be looking at homes in a small neighborhood. She's looking for a 2 bedroom condo between $350k and $375k with a view (a fairly heavily constrainted query.) Then she zooms the map out a few levels (maybe she wants to see a lot of the map to pick out other neighborhoods of interest.) She has just unwittingly made a killer query- she's searching a large geography with tight constraints on other attributes.
Another example is an investor- someone who wants to search large geographic areas for "fixer" properties that have a low asking price and large living area. Again, this results in a query that's tightly constrained by some criteria, but relatively loosely constrained by geography.
Postgres and PostGIS
Jeff Yee, our intrepid head of QA, pointed out that geographic indexes in Postgres are supported through the feature-rich PostGIS plug-in. PostGIS supports all sorts of goodies (such as polygon containment, distance calculations, projection conversion, etc.), but the biggest gain is support for indexes on multiple, mixed-type columns. Using PostGIS, we could create an index on centroid_col, price, and num_bedrooms. These indexes turned many of our “killer” queries into pussycats. It was immediately obvious that for Redfin, PostGIS is a Very Good Thing. PostGIS offers us more than just a huge performance improvement and robust, sophisticated geographic functionality. It also offers an active community- there are lots of users available to answer silly newbie questions, and the software is being actively developed. On top of that, there’s a great Windows installer.
Other Considerations
MyISAM and Data Corruption
In MySQL, our tables were MyISAM, since the geometric indexes we used were only supported on MyISAM tables. MyISAM generally offers very good performance, but unfortunately we’ve experienced data corruption on our production systems a number of times. It’s VERY painful, but we can live with occasional corruption if that’s the only way to deliver the performance we seek. PostGIS has given us another option, and we expect the advanced locking and data protection in Postgres to make data corruption a thing of the past.
Replication
We use a “single master-multiple slave” configuration in production, which requires data replication. The MySQL replication options are not super flexible, but they did exactly what we needed them to do, and they did it really well. Replication was easy to set up, easy to monitor, and proved to be very reliable. In Postgres, we had more options, and more confusion. It took us a while to work out exactly how we would do replication; validating and implementing that plan took considerable effort. It’s in production now, and it is working fine, but it was certainly a lot more effort than in MySQL. There’s also an ongoing cost- replicating DDL changes is more complicated under Postgres than it was under MySQL.
Advanced Features
Advanced PostGIS features such as polygon matching and distance calculation have already helped us move much more quickly on Redfin features. Most of these things CAN be done in MySQL (e.g. by post-processing query results in Java using the excellent JTS Topology Suite library from Vivid Solutions), but it’s significantly more work, and in some cases would degrade performance. Hopefully, you’ll see new Redfin features in the near future, and think to yourself “Aah, they’re making PostGIS do the heavy lifting- the lazy bastards.” Postgres also contains advanced features that we were able to immediately benefit from. In particular, we use the CLUSTER command to optimize our table for access via the multicolumn geographic index.
Conclusion
Switching to Postgres was a lot of work. This was compounded by the fact that we chose to “toe-dip” into Postgres- most of our tables are still in MySQL- so our Java code is cluttered with logic to choose the correct DB connection for each query, to construct the “correct” SQL for each DB (most Redfin developers were not required to use Postgres during the development cycle, and we wanted to be able to fall back to MySQL if Postgres turned out to be a disaster), etc. We use Hibernate for persistence, which added another layer of complexity. However, when I see the performance gains we’ve made, I know it’s all worth while. The best cases probably aren’t much better, but the worst cases are startlingly better. Postgres and PostGIS let me feel good about telling my friends to use “past sales” searches on Redfin- I’m confident they won’t be waiting long for their results!
Dolphins may be smarter than elephants, but in the end, elephants are domesticable and can carry a heavy load.

October 9, 2007
One of the main goals of our latest release was to improve the overall performance of the user interface, so we’re starting an intermittent series today on the dev blog talking about what we learned along the way.
Firstly, we had to recognize (channeling Steve Souders of YSlow and High Performance Web Sites fame) that the largest part of our performance problems were on the client-side, and the problems were most severe in IE6.
We set about trying to optimize client performance independent of network latencies and server query times. For the most part, this meant reducing the time the browser spent running our JavaScript. One reason this is particularly important is that the browser does not do anything else while JavaScript is running; the UI is completely locked up. No events, no back button, no browser menus.
If you do a lot of heavy processing on the client side in JavaScript, this can be a real problem since it causes visible delays and makes a web application generally clunky and unresponsive. Accepting that there were times when our website was sluggish, we set out to improve the overall performance of our user interface. More about how we figured out where to start after the jump. Read the rest of this entry »