ESI and Caching Trickery in Varnish

Varnish is a high performance, flexible, open source HTTP accelerator.

We started using Varnish at Redfin in our last major release, a few weeks ago. It’s pretty much invisible to our end users, but we’re so happy with it that we wanted to give the folks who made Varnish their props in public. It has really been great!

Varnish combines three technologies that are really useful at Redfin:

  1. A caching reverse proxy to reduce load on our backend servers
  2. ESI (Edge Side Includes) to break a page into snippets of HTML which can each have their own caching strategy
  3. VCL (Varnish Configuration Language) which enables fine grained control of Varnish

We use Varnish to accelerate the delivery of home details pages. When you visit the page for a home (e.g. http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622), parts of that page are cacheable but other parts can’t be easily cached. For example, the description of the home may be available to all users, but MLSs require us to hide some historical information from users who aren’t logged in. Further, while most of the page might be highly cacheable, the “Sites Linking to 830 El Camino Del Mar” section isn’t as easy to cache- a blog post that refers to our page (via a trackback) may come in at any time.

ESI nesting makes it easy to accomodate these vagaries.

Conceptually, here’s what the HTML for our main page looks like:

<html>
<body>
Some notes about this home

Sites Linking to 830 El Camino Del Mar:
<esi:include src="/esi-listing-trackbacks?listing-id=123" />

Median House Values:
<esi:include src="/esi-listing-regions?listing-id=123" />
</body>
</html>

Varnish will fill in the details of each of the esi:include sections with results from the “src” URL. In this example, a single HTTP request from the browser to Varnish will cause Varnish to make three HTTP requests to the backend server (one for the main page, one for the trackbacks, and one for the similars.)

Turning a single request into three requests doesn’t really help per-se, but it does enable caching. Previous to ESI, we were unable to cache the page as a whole since the “Sites Linking to” section was uncacheable. By breaking the page into three sections, we can support caching for some of the sections, while disallowing caching of the other sections.

The workflow of a request that’s partially answered from cache might look something like this:

1. The browser requests http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622
2. Varnish receives that request, and looks up the URL in its cache
3. Varnish finds a match in the cache, so it doesn’t send the request for /CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622 through to the backend. Instead it retrieves the content from the cache, and searches it for ESI tags.
4. Varnish finds the ESI include for /esi-listing-trackbacks?listing-id=123
5. Varnish looks up /esi-listing-trackbacks?listing-id=123 in the cache. There’s no entry, so Varnish requests /esi-listing-trackbacks?listing-id=123 from the backend.
6. The backend calculates the content for /esi-listing-trackbacks?listing-id=123 and returns it (along with cache control headers specifying that the results should not be cached)
7. Varnish likewise retrieves the results for /esi-listing-regions?listing-id=123
8. Varnish knits the three HTML snippets together and returns the results to the browser

The big win here is that ESI allows us to cache the main body of the page, even though the trackbacks cannot be cached. This is a tricky bit, so I’ll repeat it. The “outer” HTML, which is the main body of the page, is cached. But the “inner” HTML, the HTML for trackbacks, is NOT cached. The cache of the outer content doesn’t include the inner content- it just includes a token saying “fill in this inner content before you use this cache entry.”

Of course, that’s just the simplest case. In practice, we faced a number of minor challenges while implementing this.

1. Recording every hit

We have two conflicting goals. On the one hand, we’d like to serve content up from cache as often as reasonable- users get the content faster, and our backend systems scale better. On the other hand, we’d like to record every page hit. Whenever a user views a page describing a listing, we record various information. We would like every request to get through Varnish and into our backend, so that we can record this information.
As with nearly every problem in Computer Science, this is solved by adding a layer of code. In this case, the “outer” request is NEVER cached, but all it does is record the hit and generate an ESI include. The “inner” request does the heavy lifting, but responses are cached. For example, the user might request http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622 which would result in this “outer” response:

Cache-Control: max-age=0

<esi:include src="/esi-display-listing?cache-for-logged-out&listing-id=604622" />

which would in turn generate a cache lookup for /esi-display-listing?cache-for-logged-out&listing-id=123. If that’s cached, it’s fast. If it’s not cached, we gotta do all the work.

2. Caching public content without caching user-specific content

The main page content for a home (e.g. http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622) is the same for all anonymous users. However, users that are logged in will see additional details, such as whether or not that home is a “favorite.” Thus, it’s easy to cache for anonymous users, but harder to cache for logged in users (we don’t cache the main page content for logged in users.) It’s easy enough to set the cache-control response headers such that Varnish won’t cache content for logged in users. But we wanted to optimize a bit more- we wanted to avoid even attempting cache lookups when the user is logged in. We did this by adding VCL which examines the incoming request. If the request includes cookies that indicate the user is logged in, we skip the cache lookup. We also put a special token into the URL to make it easy for the VCL logic to know that it should do this magic for the request (since the URLs are ESI URLs, they’re not visible to the extranet.) Here’s what the VCL looks like:

sub vcl_recv {
    ...
    if (req.http.Cookie ~ "RF_AUTH") {
        set req.http._rf_login = regsub( req.http.Cookie, "^.*?RF_PARTY_ID=([^;]*?);*.*$", "1" );
    }

    # cookies by default make requests in Varnish uncacheable
    unset req.http.Cookie;
    ...
    if (req.url ~ "cache-for-logged-out") {
        #Directive says to use cache for logged out users, but not for logged in users
        if (req.http._rf_login) {
            #Since there's an RF_AUTH, the user is logged in- do not use cache
            pass;
        }
        else {
            #The user is NOT logged in- use cache (but do not look up based on cookies)
            lookup;
        }
    }
    ...
}

3. Cache busting

We’d like to cache HTML describing a listing for a long time (24 hours), but when we get new listing data, we want to show that to users immediately.

One approach is to explicitly invalidate any cache entries that refer to the listing. We could identify all Varnish instances that might cache the data and individually invalidate the content in each one. However, that’s a little difficult to do from Java, it may be unreliable (it requires that we keep good records about all Varnish instances), and it’s generally a PITA.

Instead, we include the last modified time of the listing in the URL. Again, the ESI URLs are internal, so this doesn’t dirty our extranet URLs. My earlier example was incomplete. A request for http://www.redfin.com/CA/San-Francisco/830-El-Camino-Del-Mar-94121/home/604622 might generate a response that looks like this:

<esi:include src="/esi-display-listing?cache-for-logged-out&listing-id=604622&last-mod=1272651333452" />

(note the “last-mod” argument, which represents that last modification date of the Listing.) That way, whenever the listing changes, the URL to the main ESI fragment will change- stale cache entries will be orphaned.

4. Tuning Varnish

When we initially deployed Varnish, we were seeing 503 errors- Varnish was returning 503 Service Unavailable errors. Michael Young (our intrepid CTO) changed many of the Varnish settings, including connect_timeout, sess_workspace, thread_pool_min, and thread_pool_max. The most important thing he did was match the Varnish threads to our expected traffic, and the 503 errors went away (pretty much.)

P.S. Thanks to Odalaigh for the gorgeous image

Discussion