Stoyan is totally right and I’m totally wrong (see his comment below, which reads “The thing about google maps you load is that it’s an html page. When you load html page in object tag it’s as if you put it in an iframe. It includes all markup and extra css/js/img resources.”) My test was incorrect. I was testing with a Google Maps URL, but I should have been testing with a Google Maps API URL. I can’t explain how I used the wrong URL- I THOUGHT I copied that URL directly from our web site, but apparently not.
I’m sorry for the mistake and any confusion it may have created.
We use JDBC to connect to our database, but most of our code doesn’t connect directly to JDBC. Instead, we go through Hibernate, which is great for most purposes, but can make it difficult to do low level tweaks. We might want to do things like:
Generate performance metrics per-thread, to get a SQL oriented performance profile for individual controllers
Provide a single, central location to tweak SQL before running it
Write unit tests that make assertions about the number or type of SQL statements that higher level code runs
Trace Queries and ResultSet sizes
Debug the SQL generated by third party libraries, and how those libraries use JDBC
I wrote wrappers for the most relevant interfaces (Driver, Connection, Statement, CallableStatement, PreparedStatement, ResultSet.) It’s 100% boilerplate (I didn’t implement any USEFUL functionality- that’s for you to do!) It took me a few hours, so I thought I’d share- no point in us all writing the same boilerplate over and over! Obviously, you’ll have to tweak this code for your own purposes.
To use it, you would use ‘redfin’ in your JDBC URL scheme, like this: ‘jdbc:redfin://blahblah’. You’d also set your JDBC driver class to ‘redfin.util.jdbc.DriverWrapper’. The exact mechanism you use to do this is obviously dependent on your environment.
As we talked about before, Redfin uses Varnish to implement Edge Side Includes (ESI.) This involved breaking a single big (and expensive) page into individual chunks; each chunk would be generated by separate code, and would be cached on a different schedule.
Once we broke our expensive page into chunks that could be individually cached, it seemed pretty easy to have those chunks served up by different backend servers. Voilà, a monolithic app became “service oriented“! This would let us run the different software components on different machines (with different performance characteristics, different SLAs, even implementations in different languages/environments!)
Of course, nothing is actually that easy, and we made a number of mis-steps before we figured out how to do it.
How To
Varnish allows you to define multiple backends in your VCL. And in your vcl_recv function, you can decide which backend should handle a particular request. At Redfin, we added a new Varnish backend for each of our ESI endpoints, and we added logic to choose the relevant backend by URI. In practice, we actually only have one pool of machines handling our ESI requests, so all of our Varnish backends actually point to the same place.
So the first piece of the puzzle is on our main web servers. On the main web servers, requests go through Varnish. Requests for “normal” pages are sent through to Tomcat, but requests for ESIs are sent to one of the SOA backends. Here’s an example of what the VCL file might look like:
sub vcl_recv {
if (req.url ~ "^/esi-listing-similars" || req.url ~ "^/esi-property-similars") {
set req.backend = similars;
}
else if (req.url ~ "^/esi-listing-trackbacks") {
set req.backend = relevantlinks;
}
You might have noticed that the “localhost” backend is associated with port 8080 (where Tomcat is running), but the ESI backends are associated with port 6081 (where Varnish is running on those remote machines.)
We want the instance of Varnish on the main web server to cache content from the main web server, and the instances of Varnish on the ESI backends to cache the content from those backends. This has a few benefits:
Our effective cache is bigger, since we have caches on multiple machines, each of which has fixed memory
Having independent caches prevents one set of items from pushing another set out of the cache. If all the data were in a single cache, then cache entries holding similars information (which is small, but expensive to recreate) could be pushed out of the cache by cache entries of “main page” content (which is big and relatively cheap to recreate, but we’d still like to cache.)
It’s easy to flush individual caches without having to worry about performance problems with other parts of the site
We have another design goal: we’d like to have a single distribution of our software. We’d like to have a single WAR that we can put on any machine; we do NOT want to have to deal with multiple builds, with figuring out which build has been installed on which machine, etc. We’d like to be able to switch a single machine from being a standard web server to being an ESI endpoint without having to redeploy or reconfigure.
This creates a conundrum. We want our main web servers and our ESI servers to be identical, but we also want them to act different. In particular, when an instance of Varnish on a web server gets a request for an ESI fragment, it should redirect that request to an ESI server (more precisely: to the Varnish instance running on an ESI server.) But when an instance of Varnish on an ESI server gets a request for an ESI fragment, it should forward the request to the local Tomcat instance. It should NOT forward the request to ITSELF. Forwarding port 6081 to port 6081 creates an infinite loop- not good.
We want to break the symmetry between the standard web servers and the ESI servers, and we do that by messing with the URIs.
We prepend our ESI URIs with a known prefix, which means “forward this to the ESI server.” But when we process the URI (while forwarding it), we strip off that prefix, so that the ESI server does not also forward it to itself. That’s harder to say than it is to code. The VCL code looks like this:
sub vcl_recv {
if (req.url ~ "^/backend/") {
set req.url = regsub(req.url, "^/backend/", "/");
if (req.url ~ "^/esi-listing-similars" || req.url ~ "^/esi-property-similars") {
set req.backend = similars;
}
else if (req.url ~ "^/esi-listing-trackbacks") {
set req.backend = relevantlinks;
}
This breaks the circularity. The path of requests looks like:
A requests comes into Varnish on the standard web server for /path/to/a/page
Varnish forwards the request to the local Tomcat instance
Tomcat responds with HTML that includes <esi:include src=”/backend/esi-listing-similars” />
Varnish processes the ESI, and must make a request for /backend/esi-listing-similars
The Varnish instance on the standard web server strips off “/backend”, and sends a request for “/esi-listing-similars” to the ESI server
The Varnish instance on the ESI server gets the request for “/esi-listing-similars”
Since there’s no “/backend” prefix, the Varnish instance on the ESI server forwards the request to its local Tomcat instance
The Tomcat instance on the ESI server processes the request, and responds with the relevant HTML fragment
The Varnish instance on the ESI server caches the HTML fragment and returns it
The Varnish instance on the standard web server parses the HTML fragment into the main page content and returns it to the browser
This example points out another tricky bit- how do we assure that the HTML fragment is cached by the Varnish service on the ESI server, but not by the Varnish service on the standard web server? To handle this correctly, we add a header to the response which indicates if it’s already been cached:
sub vcl_fetch {
if (req.url ~ "^/esi-") {
if (obj.http.X-RF-Cached ~ "true") {
pass;
}
set obj.http.X-RF-Cached = "true";
This code says “If there’s an X-RF-Cached header present, then don’t attempt to cache. If there is NOT an X-RF-Cached header present, then add one, and attempt to cache.” With this addition, the HTML fragments will only be cached on the first Varnish instance they pass through, which is on the ESI server in our case.
How NOT To
The solution described above works, and meets our requirements. But we also tried some solutions that did NOT work. Perhaps you can learn from our failures…
Putting Absolute URIs into ESI Includes
Our first thought was that we’d put absolute URIs into our ESI includes in the HTML. For instance, we tried to put <esi:include src=”http://similars.redfin.com:6081/esi-listing-similars” /> into the main HTML of our page. Varnish simply (and correctly, I think) ignores the host name and port. Including http://similars.redfin.com:6081/esi-listing-similars will cause Varnish to act as if you included /esi-listing-similars, and Varnish will use whichever backend it thinks is relevant, regardless of the host name or port in the URI.
Using a Single Server as both a Standard Web Server and an ESI Server
When doing testing, or when some of our servers were unavailable, we were tempted to use a single server as both the standard web server and the ESI server. It seemed like this should work- the trick with the “/backend” prefix should prevent infinite circularity. However, it didn’t work. It seems that Varnish is doing its own checks for circularity, and noticing that a single request passed through the same Varnish instance multiple times (which NORMALLY would be a problematic example of circularity, but we’ve got our clever symmetry breaker in there!) Anyway, Varnish doesn’t allow it, and causes those semi-circular requests to fail.
When your webapp is serving up content that’s expensive to generate, you may want to serve it up asynchronously- via AJAX calls. This is particularly appealing when content is “below the fold.”
However when that content is cached, you want to serve it up as quickly as possible. If you’ve already calculated the content, you’d like to include it inline in the page, without requiring an AJAX roundtrip. That way, you avoid the latency of an unnecessary round-trip. You also allow the page to be fully rendered (so content doesn’t jump around), etc.
You can optimize for the empty cache, or you can optimize for the full cache, but it seems hard to optimize both experiences.
We want to say “if there’s a cache miss, then do AJAX, but if there’s a cache hit, then just include the content.” We have to make sure that the AJAX calls will fill the cache, such that subsequent requests will see cache hits, of course!
I’ll outline what the requests/responses look like for us, then I’ll include some pseudocode that supports this.
At the beginning of time, the cache is empty, and the browser requests information on a Listing.
Returns HTML including an ESI like <esi:include src=”/similars?property_id=604622″ />
4
Lookup /similars?property_id=604622 in cache
5
Cache lookup fails
6
Makes request to /similars?property_id=604622
7
Returns HTML for AJAX for Similars (e.g. a <script> block with a reference to http://www.redfin.com/extranet-similars?property_id=604622)
Response includes “no cache” headers
8
Injects the <script> block into the HTML to be returned
Does NOT cache the server response
9
Returns HTML to Browser
10
Displays HTML
11
Executes <script> block
12
Requests http://www.redfin.com/extranet-similars?property_id=604622, including a special header saying “gimme the real content”
13
Passes /extranet-similars?property_id=604622 request to server
14
Returns HTML including an ESI like <esi:include src=”/similars?property_id=604622″ />
15
Lookup /similars?property_id=604622 in cache
16
Cache lookup fails
17
Makes request to /similars?property_id=604622, passing along special “gimme the real content” header
18
Examines request, sees special “gimme the real content” header
19
Calculates correct HTML to display Similar Listings and Similar Sales
20
Returns HTML including “please cache this” headers
21
Injects the Similars block into the HTML to be returned
DOES cache the server response
22
Returns HTML to Browser
23
Client side Javascript injects Similars HTML into page
That’s all great, but we still haven’t used the cache! The cache entry will get used for subsequent requests for the same page, like this:
Returns HTML including an ESI like <esi:include src=”/similars?property_id=604622″ />
4
Lookup /similars?property_id=604622 in cache
5
Cache lookup SUCCEEDS
6
Injects the Similars block into the HTML to be returned
7
Returns HTML to Browser
8
Displays HTML including Similars (no AJAX calls)
There are two things worth noting about this exchange.
First, when the backend server gets a request for /similars?property_id=604622, it has to decide if it should be returning the real HTML, or should be returning Javascript that will retrieve the HTML via AJAX. It makes this decision based on the value of a header passed in by the client. When the client is making an AJAX request, it knows it better NOT get back a response that generates AJAX requests (that’d be a death spiral.) Therefore, when it makes the AJAX request, it includes the special header. In all other cases, the special header is NOT included. When the header is included in a request, the server will generate the real HTML. When the header is not included, Varnish may answer the request from cache, or it may pass through to the backend server. If the request is fulfilled by the Varnish cache, then it’s the real HTML, but if it’s fulfilled by the backend server, it’ll be the AJAXy HTML.
Second, there are two URLs that have to do with similars.
/similars?property_id=604622 is an internal-use-only URL that returns the content (either the proper HTML or the AJAX code.)
/extranet-similars?property_id=604622 is an externally facing URL that only returns an ESI fragment (which will subsequently be filled in by Varnish. This way, the ESI endpoints are never available to the extranet; Varnish can get to them, but extranet clients have no need for them. This lets us be lazy with the ESI URLs. For example, URLs that are exposed to the extranet do extra validation to check if the user is logged in, etc. URLs for internal use only, such as the ESI URLs, can skip that work. This also lets us change the URLs when the property changes, to facilitate cache busting (see the “Cache busting” section in ESI and Caching Trickery in Varnish for more information.
Pseudocode
OK, so we know what we want the interaction to look like. What code will make this happen? Here’s some Javaish pseudocode that illustrates how it might work:
/*
Invoked for requests like http://www.redfin.com/[address]/home/[property id]
*/
public void handlePropertyRequest(Request request, Response response, long propId) {
Property property = getProperty(propId);
response.write("<html><head></head><body>" +
...
"<esi:include src='/extranet-similars?property_id=" +
propId +
"&last_mod=" +
property.getLastModified() +
"'/>" +
...
"</body></html>");
}
/*
Invoked for (extranet) requests like /extranet-similars?property_id=[property id]&last_mod=[date]
*/
public void handleExtranetSimilarsRequest(Request request, Response response, long propId) {
Property property = getProperty(propertyId);
response.write("<esi:include src='/extranet-similars?property_id=" +
propId +
"&last_mod=" +
property.getLastModified() +
"'/>");
}
/*
Invoked for (intranet) requests like /similars?property_id=[property id]&last_mod=[date]
*/
public void handleSimilarsRequest(Request request, Response response, long propId) {
if (null == request.getHeader("full_html")) {
//This request does NOT demand that we return the actual HTML.
// We will return a script block that will fetch the HTML via AJAX.
response.write("<script>" +
"dojo.addOnLoad(" +
"function() {" +
"dojo.xhrGet({" +
"url: 'http://www.redfin.com/extranet-similars?property_id=" + propId + "'," +
"load: function(response, ioArgs){" +
"dojo.byId('similar_homes').innerHTML = response;" +
"return response;" +
"}," +
"headers: {'full_html': 'true'}," +
"handleAs: 'text'" +
"});" +
"}" +
");" +
"</script>");
//Do NOT cache the script
response.setCacheable(false);
}
else {
//This request wants the actual HTML for similars
response.write(getSimilarsHTML(propId));
//The similars HTML is cacheable- that's the whole point!
response.setCacheable(true);
}
}
Varnish is a high performance, flexible, open source HTTP accelerator.
We started using Varnish at Redfin in our last major release, a few weeks ago. It’s pretty much invisible to our end users, but we’re so happy with it that we wanted to give the folks who made Varnish their props in public. It has really been great!
Varnish combines three technologies that are really useful at Redfin:
We use Varnish to accelerate the delivery of home details pages. When you visit the page for a home (e.g. http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622), parts of that page are cacheable but other parts can’t be easily cached. For example, the description of the home may be available to all users, but MLSs require us to hide some historical information from users who aren’t logged in. Further, while most of the page might be highly cacheable, the “Sites Linking to 830 El Camino Del Mar” section isn’t as easy to cache- a blog post that refers to our page (via a trackback) may come in at any time.
ESI nesting makes it easy to accomodate these vagaries.
Conceptually, here’s what the HTML for our main page looks like:
<html>
<body>
Some notes about this home
Sites Linking to 830 El Camino Del Mar:
<esi:include src="/esi-listing-trackbacks?listing-id=123" />
Median House Values:
<esi:include src="/esi-listing-regions?listing-id=123" />
</body>
</html>
Varnish will fill in the details of each of the esi:include sections with results from the “src” URL. In this example, a single HTTP request from the browser to Varnish will cause Varnish to make three HTTP requests to the backend server (one for the main page, one for the trackbacks, and one for the similars.)
Turning a single request into three requests doesn’t really help per-se, but it does enable caching. Previous to ESI, we were unable to cache the page as a whole since the “Sites Linking to” section was uncacheable. By breaking the page into three sections, we can support caching for some of the sections, while disallowing caching of the other sections.
The workflow of a request that’s partially answered from cache might look something like this:
1. The browser requests http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622
2. Varnish receives that request, and looks up the URL in its cache
3. Varnish finds a match in the cache, so it doesn’t send the request for /CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622 through to the backend. Instead it retrieves the content from the cache, and searches it for ESI tags.
4. Varnish finds the ESI include for /esi-listing-trackbacks?listing-id=123
5. Varnish looks up /esi-listing-trackbacks?listing-id=123 in the cache. There’s no entry, so Varnish requests /esi-listing-trackbacks?listing-id=123 from the backend.
6. The backend calculates the content for /esi-listing-trackbacks?listing-id=123 and returns it (along with cache control headers specifying that the results should not be cached)
7. Varnish likewise retrieves the results for /esi-listing-regions?listing-id=123
8. Varnish knits the three HTML snippets together and returns the results to the browser
The big win here is that ESI allows us to cache the main body of the page, even though the trackbacks cannot be cached. This is a tricky bit, so I’ll repeat it. The “outer” HTML, which is the main body of the page, is cached. But the “inner” HTML, the HTML for trackbacks, is NOT cached. The cache of the outer content doesn’t include the inner content- it just includes a token saying “fill in this inner content before you use this cache entry.”
Of course, that’s just the simplest case. In practice, we faced a number of minor challenges while implementing this.
1. Recording every hit
We have two conflicting goals. On the one hand, we’d like to serve content up from cache as often as reasonable- users get the content faster, and our backend systems scale better. On the other hand, we’d like to record every page hit. Whenever a user views a page describing a listing, we record various information. We would like every request to get through Varnish and into our backend, so that we can record this information.
As with nearly every problem in Computer Science, this is solved by adding a layer of code. In this case, the “outer” request is NEVER cached, but all it does is record the hit and generate an ESI include. The “inner” request does the heavy lifting, but responses are cached. For example, the user might request http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622 which would result in this “outer” response:
Cache-Control: max-age=0
which would in turn generate a cache lookup for /esi-display-listing?cache-for-logged-out&listing-id=123. If that’s cached, it’s fast. If it’s not cached, we gotta do all the work.
2. Caching public content without caching user-specific content
The main page content for a home (e.g. http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622) is the same for all anonymous users. However, users that are logged in will see additional details, such as whether or not that home is a “favorite.” Thus, it’s easy to cache for anonymous users, but harder to cache for logged in users (we don’t cache the main page content for logged in users.) It’s easy enough to set the cache-control response headers such that Varnish won’t cache content for logged in users. But we wanted to optimize a bit more- we wanted to avoid even attempting cache lookups when the user is logged in. We did this by adding VCL which examines the incoming request. If the request includes cookies that indicate the user is logged in, we skip the cache lookup. We also put a special token into the URL to make it easy for the VCL logic to know that it should do this magic for the request (since the URLs are ESI URLs, they’re not visible to the extranet.) Here’s what the VCL looks like:
sub vcl_recv {
...
if (req.http.Cookie ~ "RF_AUTH") {
set req.http._rf_login = regsub( req.http.Cookie, "^.*?RF_PARTY_ID=([^;]*?);*.*$", "\1" );
}
# cookies by default make requests in Varnish uncacheable
unset req.http.Cookie;
...
if (req.url ~ "cache-for-logged-out") {
#Directive says to use cache for logged out users, but not for logged in users
if (req.http._rf_login) {
#Since there's an RF_AUTH, the user is logged in- do not use cache
pass;
}
else {
#The user is NOT logged in- use cache (but do not look up based on cookies)
lookup;
}
}
...
}
3. Cache busting
We’d like to cache HTML describing a listing for a long time (24 hours), but when we get new listing data, we want to show that to users immediately.
One approach is to explicitly invalidate any cache entries that refer to the listing. We could identify all Varnish instances that might cache the data and individually invalidate the content in each one. However, that’s a little difficult to do from Java, it may be unreliable (it requires that we keep good records about all Varnish instances), and it’s generally a PITA.
Instead, we include the last modified time of the listing in the URL. Again, the ESI URLs are internal, so this doesn’t dirty our extranet URLs. My earlier example was incomplete. A request for http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622 might generate a response that looks like this:
<esi:include src="/esi-display-listing?cache-for-logged-out&listing-id=604622&last-mod=1272651333452" />
(note the “last-mod” argument, which represents that last modification date of the Listing.) That way, whenever the listing changes, the URL to the main ESI fragment will change- stale cache entries will be orphaned.
4. Tuning Varnish
When we initially deployed Varnish, we were seeing 503 errors- Varnish was returning 503 Service Unavailable errors. Michael Young (our intrepid CTO) changed many of the Varnish settings, including connect_timeout, sess_workspace, thread_pool_min, and thread_pool_max. The most important thing he did was match the Varnish threads to our expected traffic, and the 503 errors went away (pretty much.)
Historically many of our domain objects have been marked with the “@Proxy(lazy = false)” Hibernate annotation. This annotation tells Hibernate that it should NOT create lazy proxies for the annotated class.
At Redfin, these were almost all bugs. We should never use “@Proxy(lazy = false)” without a big comment explaining why it’s necessary. Our default should be “@Proxy(lazy = true)”. Laziness is good!
Here’s my quick understanding of the effects of the @Proxy annotation. As with everything in Hibernate, each individual piece seems simple, but when you consider all the features that Hibernate exposes, and how they interact, it can become pretty complicated.
Hibernate Load Options
When Hibernate loads objects that refer to other objects (i.e. have member objects), it needs to do something about the associated objects. For example, suppose that Cat objects contain (optional) references to Owner objects. When Hibernate is loading a Cat object into memory, it has to decide what to do about the Owner member variable. There are a number of things it COULD do:
When it constructs SQL to load the Cat, it could include the Owner table and columns in the SELECT clause, so that all the data is loaded at once
It could load the Cat object, and subsequently load the Owner object (via a second SQL statement)
It could load the Cat object, and set the Owner member to a placeholder (a proxy), which can be filled in later when the Owner information is needed
Note that it CANNOT simply do nothing about the Owner- if it instantiates a Cat and leaves the Owner member null when the DB says that the Cat DOES have an Owner, then consumers of the Cat will be misinformed- they’ll think that the Cat has no Owner, which is false.
Option 1 (get all the info in 1 SQL statement) is efficient when loading multiple Cats for which the Owner information is needed. For example, if some code needed to iterate over 1000 Cats, and get Owner information for each one, this approach would be efficient.
However, option 1 is inefficient in cases where the secondary information is not needed. E.g. if some code needed to iterate over 1000 Cats but did NOT need to get Owner information, then loading the Owner information is an obvious inefficiency.
Worse, taking option 1 to the extreme can cause an explosion in the data load. For example, a Cat might have an Owner, the Owner might have a Home, the Home might have a Address, which might have a City, which might have a State, etc. Loading the whole object graph into memory via SQL could be very inefficient. Further, every change to domain objects could cause many SQL statements to get hairier (e.g. adding a Country member to the State object would effectively add to the SQL needed to load Cat objects.)
Option 2 (load the Cat, then load the Owner) is simple, and often not bad, but never optimal (if you know you’ll need the Owner info, it’s more efficient to load it in a single SQL statement; if you know you won’t use it you should never load it; if you won’t know until later, delaying the load is better.)
However, option 2 is particularly bad when bulk operations are being performed. For instance, if some code were to load up every Cat object in the database to do some processing, this could be accomplished via a single SQL statement (though it’d probably be better to break it into chunks of, say, 10,000 Cat objects.) However, Hibernate would run a “SELECT * FROM owners” type statement for every Cat object that has an Owner- potentially millions of SQL statements.
Option 3 (load the Cat and set it’s Owner member variable to a proxy- load the Owner info on demand) is a compromise. It allows code to do bulk operations without loading the ancillary information (e.g. load all Cat objects without ever loading any Owner objects.) However, it requires additional SQL statements to load the secondary information IF that info is needed (e.g. if code loaded all Cat objects, then accessed the Owner for each Cat, option 3 would result in potentially millions of SQL statements.) Note that if the Owner information is never needed, then option 3 is most efficient- the information is never loaded.
Hibernate allows programmers to influence which strategy it will take. It offers (at least) two types of control: direct control over the SQL it generates, and control over the proxies.
When you’re implementing a DAO method, you can tell Hibernate whether it should proactively fetch information about member objects.
Under the Criteria API, Hibernate lets you call criteria.setFetchMode to tell Hibernate that it should load the additional info immediately, or should defer it. Hibernate uses the term “eager” to mean “load immediately”, and “lazy” to mean “defer loading.”
When using HQL, you can use the FETCH keyword to specify the fetch mode, which is equivalent.
When using SQL, you can use the query.addJoin method to tell Hibernate that you’ve written SQL which retrieves information for member objects. In this case, you’ll be responsible for writing the joins, etc., yourself.
Controlling Proxies
Hibernate also lets you control the existence and behavior of proxies via the tags mentioned above. Annotating a class with “@Proxy(lazy = false)” tells Hibernate to NOT support lazy proxies for that type of object (of course “@Proxy(lazy = true)” tells Hibernate to support lazy proxies.) This allows the writer of the domain object to essentially override the wishes of the writer of the DAO. If the DAO writer would like to load members in a lazy manner, but the domain object in question doesn’t support lazy loading, then Hibernate will NOT lazy load the object (since it cannot.)
If you’re writing a class for which lazy loading would be dangerous, then you SHOULD disallow lazy proxies, since DAO writers probably won’t understand the detailed load requirements of your class. However, this is unusual. In most cases, lazy proxies are safe.
Since the writer of the domain object can control what choices are available to the writer of the DAO object, they need to use that power judiciously. You CAN code all of your domain objects to disallow lazy loading, which will force all writers of DAOs to use load options 1 or 2 (load all members via fancy SQL, or load all members via secondary SQL statements.) But you generally should not. DAO writers often rely on option 3 (lazy loading), particularly when they know that the member objects will never be accessed (or when they’re not sure.) If you specify “@Proxy(lazy = false)”, you’ve made it impossible for DAO writers to use option 3, which means it may be difficult for them to get their code to perform well. Worse, the writer of the DAO may not realize that you did that, or may not understand the implications. Hibernate queries are actually kinda hard to view, so the writer of the DAO may have created a huge performance problem and not even known it (until you go into production.)
Only the client really knows
Even the writer of the DAO doesn’t know how the client will use the objects it returns. If you’re implementing the CatDAO, you might add a method like getBasementCatsAndOwners, which would return all black cats and pre-fetch the corresponding owners. You think you’re clever because you’ve avoided a major performance problem, but a caller might try to get the Home for each Owner, defeating your pre-fetching strategy. The DAO writer should do their best to anticipate the needs of their callers, and to name and document their methods such that callers can understand what they do, but ultimately the caller is in control, and can (unintentionally) defeat the optimizations of the DAO writer. If your database were large and you knew that you had clients that sometimes needed Owners, sometimes needed Owners and Homes, etc., you might make three methods: CatDAO.getBasementCats, CatDAO.getBasementCatsAndOwners, and CatDAO.getBasementCatsAndOwnersAndHomes.
As mentioned above, when you indicate that a domain object should not support lazy proxies, you make it hard for DAO writers to get their code to perform well. Worse, you disable a capability that they may be counting on, and they may not notice until there are major performance problems. Unless you have a good reason to, use “@Proxy(lazy = true)” on your domain objects.
P.S.
Lazy proxies do have some known problems.
First, the lazy proxy is NOT the same as the actual object. If you depend on the datatype of the object, you may have problems, since the type of the proxy isn’t the same as the type of the actual object (e.g. a proxy for an Owner is not actually an Owner- it’s a subclass.)
Second, you may have to think carefully about methods like equals() or hashCode(), since the proxies may not do what you expect.
P.P.S
Thanks to carloneworld for the great lazy kitty photo!
I recently heard that some loyal Redfin customers were using Google to do address searches. That’s a shame, since Redfin does a pretty decent job of searching for addresses, MLS IDs, cities, etc. I wanted to see if I could help those power users get to Redfin search results with fewer clicks.
Naturally, I wanted to see if I could make a Redfin search available. It was surprisingly easy- I didn’t have to change any Redfin code. I just wrote an XML file, hosted it on a Web server, and put a link into an HTML page (this page, in fact!)
If you’re using Firefox or Internet Explorer 7, you should be able to enable Redfin search by choosing the relevant option in the search dropdown while viewing this page. Here’s a screenshot of Firefox:
Here’s how it looks in IE7:
Once you’ve added Redfin, you can do address searches on Redfin using the search control:
Another way to do searches in Redfin is to use a keyword bookmark. In Firefox, make a new bookmark, and edit it to look like this:
The important parts are the location and the keyword. The location got cut off in my screenshot, but it should be http://www.redfin.com/stingray/do/listings-search#search_location=%s
(the only tricky part is the “%s” part, which will get replaced with whatever you search for.)
To use it, you type the keyword and the search terms into your location bar, like this:
Redfin recently switched some of our backend DB infrastructure from MySQL to Postgres, and we plan to wholly switch to Postgres in the near future. This wasn’t an easy decision; MySQL has a lot going for it, and switching has been a lot of work. However, we’ve already seen major benefits from choosing Postgres, and we expect to see more as we complete our transition. In particular, performance on certain geographic queries has improved dramatically.
A simple Google search shows that a lot of people have already opined about MySQL versus Postgres (e.g. here, here, here, here, here, and here) but we weren’t able to find much information that applied directly to the problem we were having. Specifically, we were having some major performance problems with queries that were constrained by both spatial and numeric columns, and all of our attempts to squeeze more performance out of MySQL (including hiring expensive outside consultants) had come to naught.
GIS Indexes
Redfin is an online real estate company, and our map based UI is the most-used part of our web site (as well as being the biggest performance hog.) When a user views the map, we use SQL to find the relevant listings or past sales. Users typically constrain a search by numerous criteria, such as maximum price or minimum square footage. Since the UI is map based, users are ALWAYS constraining by geography, though that constraint might be weak.
How We Did It In MySQL
In MySQL, the queries might look something like:
SELECT * FROM listings WHERE price <= 400000 AND num_bedrooms >= 2 AND num_bathrooms >= 1.5 AND type = 'condo' AND MBRContains(GeomFromText('POLYGON((X1 Y1,X1 Y2,X2 Y2,X2 Y1,X1 Y1))'), centroid_col) LIMIT 101
where X1/Y1 and X2/Y2 are lat/long pairs that describe the region to be searched. To improve performance, we create indexes on the relevant columns. In MySQL, a normal index cannot include spatial columns, and spatial indexes cannot include normal columns. In this example, we might have one multi-column index on price, num_bedrooms, and num_bathrooms, and another single-column index on centroid_col. In many cases, this performs great. Examples include:
When the table is small (we have hundreds of thousands of listings, but tens of millions of past sales records)
When the geographic constraint is very selective (i.e. when the map is zoomed very far in)
When the geographic constraint is the only constraint (i.e. the user doesn’t care about price, bedrooms, etc.)
When the constraints are poor, but the LIMIT amount is hit quickly (e.g. search for all listings in the the world; MySQL can quickly find the first 101 rows in the table, and once it's found 101, it can give up)
However, there were also cases where it performed terribly, particularly when the table was big, the geographic constraints were relatively weak, and other constraints were relatively strong. For example, a search for all past sales in the San Francisco Bay Area that had 1 bedroom, but sold for over $10,000,000 resulted in a “killer” query. This is a little counterintuitive, but was definitely a problem for our customers (though my example is a very extreme case.) The problem with this query is that:
The table is large (tens of millions of rows)
The geographic index is the best index to use, but still isn’t great (might return 500,000 rows, or ~1% of the table)
MySQL would “short circuit” the query when 101 records were found, but the query returns less than 101 records (there are few 1 bedroom condos that sold for more than $10M), so MySQL examines all 500,000 rows that match the geographic constraint
This does happen in real life.
For example, a user might be looking at homes in a small neighborhood. She's looking for a 2 bedroom condo between $350k and $375k with a view (a fairly heavily constrainted query.) Then she zooms the map out a few levels (maybe she wants to see a lot of the map to pick out other neighborhoods of interest.) She has just unwittingly made a killer query- she's searching a large geography with tight constraints on other attributes.
Another example is an investor- someone who wants to search large geographic areas for "fixer" properties that have a low asking price and large living area. Again, this results in a query that's tightly constrained by some criteria, but relatively loosely constrained by geography.
Postgres and PostGIS
Jeff Yee, our intrepid head of QA, pointed out that geographic indexes in Postgres are supported through the feature-rich PostGIS plug-in. PostGIS supports all sorts of goodies (such as polygon containment, distance calculations, projection conversion, etc.), but the biggest gain is support for indexes on multiple, mixed-type columns. Using PostGIS, we could create an index on centroid_col, price, and num_bedrooms. These indexes turned many of our “killer” queries into pussycats. It was immediately obvious that for Redfin, PostGIS is a Very Good Thing. PostGIS offers us more than just a huge performance improvement and robust, sophisticated geographic functionality. It also offers an active community- there are lots of users available to answer silly newbie questions, and the software is being actively developed. On top of that, there’s a great Windows installer.
Other Considerations
MyISAM and Data Corruption
In MySQL, our tables were MyISAM, since the geometric indexes we used were only supported on MyISAM tables. MyISAM generally offers very good performance, but unfortunately we’ve experienced data corruption on our production systems a number of times. It’s VERY painful, but we can live with occasional corruption if that’s the only way to deliver the performance we seek. PostGIS has given us another option, and we expect the advanced locking and data protection in Postgres to make data corruption a thing of the past.
Replication
We use a “single master-multiple slave” configuration in production, which requires data replication. The MySQL replication options are not super flexible, but they did exactly what we needed them to do, and they did it really well. Replication was easy to set up, easy to monitor, and proved to be very reliable. In Postgres, we had more options, and more confusion. It took us a while to work out exactly how we would do replication; validating and implementing that plan took considerable effort. It’s in production now, and it is working fine, but it was certainly a lot more effort than in MySQL. There’s also an ongoing cost- replicating DDL changes is more complicated under Postgres than it was under MySQL.
Advanced Features
Advanced PostGIS features such as polygon matching and distance calculation have already helped us move much more quickly on Redfin features. Most of these things CAN be done in MySQL (e.g. by post-processing query results in Java using the excellent JTS Topology Suite library from Vivid Solutions), but it’s significantly more work, and in some cases would degrade performance. Hopefully, you’ll see new Redfin features in the near future, and think to yourself “Aah, they’re making PostGIS do the heavy lifting- the lazy bastards.” Postgres also contains advanced features that we were able to immediately benefit from. In particular, we use the CLUSTER command to optimize our table for access via the multicolumn geographic index.
Conclusion
Switching to Postgres was a lot of work. This was compounded by the fact that we chose to “toe-dip” into Postgres- most of our tables are still in MySQL- so our Java code is cluttered with logic to choose the correct DB connection for each query, to construct the “correct” SQL for each DB (most Redfin developers were not required to use Postgres during the development cycle, and we wanted to be able to fall back to MySQL if Postgres turned out to be a disaster), etc. We use Hibernate for persistence, which added another layer of complexity. However, when I see the performance gains we’ve made, I know it’s all worth while. The best cases probably aren’t much better, but the worst cases are startlingly better. Postgres and PostGIS let me feel good about telling my friends to use “past sales” searches on Redfin- I’m confident they won’t be waiting long for their results!
Dolphins may be smarter than elephants, but in the end, elephants are domesticable and can carry a heavy load.
Writing rich date/time features in a web app can be a pain. Apps (such as schedulers) that do math on times (e.g. ordering times) should pay attention to time zones for those times, but it’s difficult to know which time zone should be used to display the times to the user. Asking the user to explicitly choose a time zone is natural and often necessary, but a long list of time zone choices can be intimidating to the user. I’ll discuss one method you can use to detect the probable time zone of a browser. It’s not perfect, but it offers a good default (and it’s easy to code.)
For apps that don’t have rich date/time functionality, times can be represented as simple numbers or strings. For example, if I wanted to meet you in San Francisco to go to the late showing of The Bourne Ultimatum at the Metreon theater, it’s probaby fine to say “let’s meet at the theater at 11:15 PM on Friday; the show starts at 11:30 PM.” Since we’re in the same time zone as each other, and as the theater, we don’t really care about time zones. An application that’s facilitating this interaction could store and display the times “11:15 PM” and “11:30 PM” without regard to time zones.
If we want to do date/time processing on the back end (e.g. ordering events in time), it’s more general to store absolute times in the database. For example, instead of storing something like “hour => 11, minute => 15″, which might mean different absolute times to different users, it is convenient to store a canonical time such as the number of seconds since January 1, 1970 in GMT. That way, we can compare all of the times in our database without having to worry about the time zone for each.
If we store “absolute” times in the database, then displaying times to users becomes a localization issue. It’s pretty similar to localizing web site content into the language of the user. You start with a notion of what you want to present to the user, then you identify how the user wants to receive it (i.e. their language/locale), and then you localize the content at runtime (i.e. you show them the content in their language.)
As with language, you need to know about the user. The HTTP protocol specifies that the “Accept-Language” header can be used by servers to find out which language(s) the user prefers. The “Accept-Language” header is nice because it lets websites show content in the “correct” language without having to explicitly ask the user. A user who only speaks French doesn’t have to puzzle through an English language page that says “click here for French” somewhere in a footer- they just see the content in French. Even better, it’s one less setting that the server has to manage, and that the user has to set and keep up-to-date.
Unfortunately, there is no corresponding “Accept-Timezone” header- the HTTP standard does not contain any facility to allow the browser to automatically tell the server what time zone the user cares about.
There are two standard ways for developers to deal with this.
First, they can ignore it. For many apps, this is a decent approach- just store “11:30 PM”, and don’t worry about the time zone. As long as all the users who care about that time know what time zone it’s in, then the app doesn’t have to keep track of it.
Second, they can ask the user to make an explicit choice. For example, when setting up Google Calendar, you are asked to choose a time zone. That’s fine for the developer, but finding the “right” time zone in a long list can be a pain for the user.
I wanted to let users choose a time zone on my site, but I also wanted to have an intelligent default- for most users, they shouldn’t have to take any action; the choice I make for them should be correct.
This calls for Javascript on the client. I wanted to write some Javascript that would choose the right option in a time zone dropdown.
This is slightly harder than it seems because Javascript ALSO does not contain a way to get the time zone of the user. Javascript DOES, however, provide a way to get the offset from GMT for any particular time. A time zone can be thought of as a rule that says what the GMT offset is for different times. We can therefore do a reverse mapping- if we know the GMT offset for a few times, we can figure out the time zone for the user. Time zones can be quite complicated (some include Daylight Savings Time, some start or end DST on different dates than others, sometimes the DST offset isn’t a full hour, etc. There are even time zones that are identical for all FUTURE times, but had differences in the PAST.)
In theory, we could deal with all of these cases by doing many probes- we could check the GMT offset for many times, and get an exact time zone match. In practice, this really isn’t necessary- most users are in the more populous time zones, and the cost of failure (defaulting to a time zone that’s similar but not quite right) is not terribly high. Instead, we can probe two times (one in the summer and one in the winter) to find out the normal GMT offset, whether the time zone has Daylight Savings Time, and the DST offset.
In terms of implementation, I wanted to basically make a list of recognized offsets. That is, a list that says “if the summer offset is -7 hours, and the winter offset is -8 hours, then the time zone is probably US/Pacific.”
I like hacking in Ruby, so I grabbed the TZInfo Ruby library, and wrote some code to run through the known time zones, figuring out the winter and summer offsets for each. After grouping by offsets, I had to choose a winner in the case of duplicates. When multiple time zones had the same summer and winter offsets, I searched for each of them on Google. I figured that the time zone with the most hits was probably the most popular one, so I chose that one. Here’s the Javascript code that I came up with:
function getTimezoneName() {
tmSummer = new Date(Date.UTC(2005, 6, 30, 0, 0, 0, 0));
so = -1 * tmSummer.getTimezoneOffset();
tmWinter = new Date(Date.UTC(2005, 12, 30, 0, 0, 0, 0));
wo = -1 * tmWinter.getTimezoneOffset();
if (-660 == so && -660 == wo) return 'Pacific/Midway';
if (-600 == so && -600 == wo) return 'Pacific/Tahiti';
if (-570 == so && -570 == wo) return 'Pacific/Marquesas';
if (-540 == so && -600 == wo) return 'America/Adak';
if (-540 == so && -540 == wo) return 'Pacific/Gambier';
if (-480 == so && -540 == wo) return 'US/Alaska';
if (-480 == so && -480 == wo) return 'Pacific/Pitcairn';
if (-420 == so && -480 == wo) return 'US/Pacific';
if (-420 == so && -420 == wo) return 'US/Arizona';
if (-360 == so && -420 == wo) return 'US/Mountain';
if (-360 == so && -360 == wo) return 'America/Guatemala';
if (-360 == so && -300 == wo) return 'Pacific/Easter';
if (-300 == so && -360 == wo) return 'US/Central';
if (-300 == so && -300 == wo) return 'America/Bogota';
if (-240 == so && -300 == wo) return 'US/Eastern';
if (-240 == so && -240 == wo) return 'America/Caracas';
if (-240 == so && -180 == wo) return 'America/Santiago';
if (-180 == so && -240 == wo) return 'Canada/Atlantic';
if (-180 == so && -180 == wo) return 'America/Montevideo';
if (-180 == so && -120 == wo) return 'America/Sao_Paulo';
if (-150 == so && -210 == wo) return 'America/St_Johns';
if (-120 == so && -180 == wo) return 'America/Godthab';
if (-120 == so && -120 == wo) return 'America/Noronha';
if (-60 == so && -60 == wo) return 'Atlantic/Cape_Verde';
if (0 == so && -60 == wo) return 'Atlantic/Azores';
if (0 == so && 0 == wo) return 'Africa/Casablanca';
if (60 == so && 0 == wo) return 'Europe/London';
if (60 == so && 60 == wo) return 'Africa/Algiers';
if (60 == so && 120 == wo) return 'Africa/Windhoek';
if (120 == so && 60 == wo) return 'Europe/Amsterdam';
if (120 == so && 120 == wo) return 'Africa/Harare';
if (180 == so && 120 == wo) return 'Europe/Athens';
if (180 == so && 180 == wo) return 'Africa/Nairobi';
if (240 == so && 180 == wo) return 'Europe/Moscow';
if (240 == so && 240 == wo) return 'Asia/Dubai';
if (270 == so && 210 == wo) return 'Asia/Tehran';
if (270 == so && 270 == wo) return 'Asia/Kabul';
if (300 == so && 240 == wo) return 'Asia/Baku';
if (300 == so && 300 == wo) return 'Asia/Karachi';
if (330 == so && 330 == wo) return 'Asia/Calcutta';
if (345 == so && 345 == wo) return 'Asia/Katmandu';
if (360 == so && 300 == wo) return 'Asia/Yekaterinburg';
if (360 == so && 360 == wo) return 'Asia/Colombo';
if (390 == so && 390 == wo) return 'Asia/Rangoon';
if (420 == so && 360 == wo) return 'Asia/Almaty';
if (420 == so && 420 == wo) return 'Asia/Bangkok';
if (480 == so && 420 == wo) return 'Asia/Krasnoyarsk';
if (480 == so && 480 == wo) return 'Australia/Perth';
if (540 == so && 480 == wo) return 'Asia/Irkutsk';
if (540 == so && 540 == wo) return 'Asia/Tokyo';
if (570 == so && 570 == wo) return 'Australia/Darwin';
if (570 == so && 630 == wo) return 'Australia/Adelaide';
if (600 == so && 540 == wo) return 'Asia/Yakutsk';
if (600 == so && 600 == wo) return 'Australia/Brisbane';
if (600 == so && 660 == wo) return 'Australia/Sydney';
if (630 == so && 660 == wo) return 'Australia/Lord_Howe';
if (660 == so && 600 == wo) return 'Asia/Vladivostok';
if (660 == so && 660 == wo) return 'Pacific/Guadalcanal';
if (690 == so && 690 == wo) return 'Pacific/Norfolk';
if (720 == so && 660 == wo) return 'Asia/Magadan';
if (720 == so && 720 == wo) return 'Pacific/Fiji';
if (720 == so && 780 == wo) return 'Pacific/Auckland';
if (765 == so && 825 == wo) return 'Pacific/Chatham';
if (780 == so && 780 == wo) return 'Pacific/Enderbury'
if (840 == so && 840 == wo) return 'Pacific/Kiritimati';
return 'US/Pacific';
}