January 22, 2012
First off, let me disclaim one thing: I’m not a developer at Redfin. I’m actually just a member of the product management and design team. I have been a dev in my past life, but I’m rusty to say the least. Even though it’s not part of my job, I still love to code a little on the side when I can. I’m always afraid my coding muscles will atrophy and I’ll end up being like this guy from the movie Beetlejuice. And, well, that wouldn’t be a good look for me.
If you’re a Redfin fan, you can rest easy though. My code is usually either for internal use or just some automation to complete a task. It’s almost never something you’d interact with on the website or our mobile apps. I’m not sure I’d let me do that and I’m definitely sure our developers would be rightfully concerned if I did.
The Problem
Just this past week, I had a task to accomplish: I needed to download just over 2,000 photos from our website to a local directory with a very specific naming convention on the final file. This was for an internal project, so it wasn’t worth distracting a real developer with so I decided to tackle it myself.
Google turned up a few folks who had written single calls to download a file using CURL and Bash, but I didn’t find one source that had all the pieces. So, using Applescript, I wrote a small Mac OS X application that accomplished what I needed. I probably used a trick or two I found via all that Googling, so I figured I’d better pay it forward and post my solution in case someone else is in my boat.
The Solution
My app takes in a text file that has one row for each photo to get. Each line contains just two things: the photo source URL and the exact output file name for the local copy. It then loops over that list and downloads that URL in the file name specified in the row. If that local file already exists, it skips it and moves to the next one. I had some issues with some of my output file names around row 500 that made the app exit. But since it skips the ones it already has, you can just restart it from the beginning and it will get to where it left off super quickly.
And without further ado, here’s the full script for your downloading pleasure:
on theSplit(theString, theDelimiter)
-- save delimiters to restore old settings
set oldDelimiters to AppleScript's text item delimiters
-- set delimiters to delimiter to be used
set AppleScript's text item delimiters to theDelimiter
-- create the array
set theArray to every text item of theString
-- restore the old setting
set AppleScript's text item delimiters to oldDelimiters
-- return the result
return theArray
end theSplit
on run
tell application "Finder"
set Names to paragraphs of (read (choose file with prompt ¬
"Pick text file containing emails and output file names"))
set myFilePath to path to me
set myFolder to (container of myFilePath) as string
set nTotalRows to count of Names
set nCurrentRow to 0
set nCount to 0
repeat with nextLine in Names
if length of nextLine is greater than 0 then
--Do something with the next name, which is stored in "nextLine"
end if
set nCurrentRow to nCurrentRow + 1
set myArray to my theSplit(nextLine, " ")
if (count of myArray) is greater than 0 then
set itemURL to item 1 of myArray
set itemfileName to item 2 of myArray
set itemFullPath to POSIX path of myFolder & itemfileName
set msg to 0
tell application "Finder" to if exists (itemFullPath as POSIX file) ¬
then set msg to 1
if msg = 0 then
set strCommand to "curl " & quoted form of itemURL ¬
& " -o " & quoted form of itemFullPath
do shell script strCommand
set nCount to nCount + 1
else
end if
end if
end repeat
end tell
return
end run
Download: DownloadListOfPhotoURLsToLocalFile.zip
I’m sure there’s probably some more elegant version that could have been done, but as I said, this isn’t my day job. This worked great for my 2,000+ photo download so I was satisfied, but I’m always up for learning how to be better. But if you have a tweak to the script that would make it better, please let us know!
Let us know if you found it useful as well!
March 17, 2011
Stoyan is totally right and I’m totally wrong (see his comment below, which reads “The thing about google maps you load is that it’s an html page. When you load html page in object tag it’s as if you put it in an iframe. It includes all markup and extra css/js/img resources.”) My test was incorrect. I was testing with a Google Maps URL, but I should have been testing with a Google Maps API URL. I can’t explain how I used the wrong URL- I THOUGHT I copied that URL directly from our web site, but apparently not.
I’m sorry for the mistake and any confusion it may have created.
[Click below for the full content of the original post.]
Read the rest of this entry »
March 11, 2011
We use JDBC to connect to our database, but most of our code doesn’t connect directly to JDBC. Instead, we go through Hibernate, which is great for most purposes, but can make it difficult to do low level tweaks. We might want to do things like:
- Generate performance metrics per-thread, to get a SQL oriented performance profile for individual controllers
- Provide a single, central location to tweak SQL before running it
- Write unit tests that make assertions about the number or type of SQL statements that higher level code runs
- Trace Queries and ResultSet sizes
- Debug the SQL generated by third party libraries, and how those libraries use JDBC
I wrote wrappers for the most relevant interfaces (Driver, Connection, Statement, CallableStatement, PreparedStatement, ResultSet.) It’s 100% boilerplate (I didn’t implement any USEFUL functionality- that’s for you to do!) It took me a few hours, so I thought I’d share- no point in us all writing the same boilerplate over and over! Obviously, you’ll have to tweak this code for your own purposes.
To use it, you would use ‘redfin’ in your JDBC URL scheme, like this: ‘jdbc:redfin://blahblah’. You’d also set your JDBC driver class to ‘redfin.util.jdbc.DriverWrapper’. The exact mechanism you use to do this is obviously dependent on your environment.
[Click through to the full post for the code.]
Read the rest of this entry »
January 21, 2011
As of today, I am a (minor) contributor to Apache Pig! My bug report and patch were accepted into Pig, for version 0.9.0.
The issue was that Pig’s contrib(uted extension) function ISOToDay would truncate datetimes based on Coordinated Universal Time (UTC), with no way to specify a preference for local time zone that better matches a natural “day” in the source data. For example, if my friend in California browsed http://redfin.com from 3pm through 6pm yesterday in Pacific Time, that’s 11pm yesterday through 2am today in UTC. If we want to count the number of unique website visitors each day, then it would be bad to count my friend’s session as visits on 2 different days.
My patch allows Pig users to tag data with a timezone, and to have that timezone respected when datetimes are truncated to whole days, so that a visitor to a website on a California afternoon gets counted as a single-day visit.
There’s still a problem of miscounting a person who visits the site from 2am to 4am in New York (which spans two days in Pacific Time), but that is a much smaller problem than using 5pm Pacific Time as the daily cut-off. (The best we could do would be to find the hour of the day when the fewest visits are happening, and do our processing in the timezone where that is midnight. That would turn out to be pretty close to Pacific time, anyway.)
On a slightly political note, it is nice that my current employer has a policy of allowing open source contributions. I needed this bug fix to solve a real work problem, and since I was permitted to contribute the fix with no legal department overhead, I can continue to use the official common version of Apache Pig, instead of maintaining our own forked version.
More technical details at the Apache Pig Issues site. Happy hacking!
November 4, 2010
In Dojo 1.4, the Dojo Toolkit team introduced a new “dojo.hash” library for managing the back button in AJAX applications. It’s a replacement for “dojo.back,” which was available in Dojo 1.0. If you’re deciding whether to use dojo.hash vs. dojo.back for your next web application, you should use dojo.hash.
Background: Back Button in AJAX
AJAX applications can update a page inline without navigating to a new page. This makes the page much more responsive, but since everything happens in one page, it makes the back button much less useful.
Way back in 2005, someone figured out a clever trick that would allow AJAX applications to support the back button. If a page at http://www.example.com/#beds=4 changed the URL to http://www.example.com#beds=3, the “#” in the URL would guarantee that the page wouldn’t reload, but the user could still click the back button to go back to “#beds=4″. The user could then click the forward button to go forward to “#beds=3″, navigating back and forth through the browser’s history.
In general, modifying the URL of a page would add an entry to the browser’s history, even if the change was in the URL’s “hash fragment”, the part of the URL that appears after the “#” sign. Since hash fragment changes don’t cause the page to reload, AJAX applications could use the hash fragment to add browser history entries dynamically.
To detect these changes, the page would use a “setInterval” timer to automatically poll the URL for updates every 100ms or so. (In IE8, Microsoft introduced the “onhashchange” event, eliminating the need to poll for changes in the hash fragment; all modern browsers now support “onhashchange”.)
It wasn’t quite that easy, of course. On some browsers, the page must use a hidden iframe to add entries to the browser’s history. In 2007, Brad Neuberg worked out most of the thorny details and wrote the Really Simple History library, which is now in wide use across many JavaScript libraries. His insights were incorporated into the dojo.back library, which was released with Dojo v1.0.
What’s Wrong with dojo.back
There are two problems with dojo.back: first, dojo.back loses history information when the user refreshes the page, and second, dojo.back uses document.write, which makes it difficult to use dojo.back correctly.
1) Refreshing the Page and Bookmarking the URL
dojo.back allows us to pass it a JavaScript object, storing the object in a hashmap in memory. dojo.back modifies the URL to include a random unique string. When the user clicks the back button, dojo.back fires a “back” event; when the user clicks the “forward” button, dojo.back fires a “forward” event.
So, if a user navigates to our site at http://www.example.com/ and performs some action that the user should be allowed to undo, we can pass a memento to dojo.back, e.g. {beds: 4, baths: 2}. dojo.back will modify the URL to http://www.example.com/#1288732596876 and keep a record in memory that “#1288732596876″ corresponds to {beds: 4, baths: 2}. If the user clicks the back button, the URL will revert to http://www.example.com/, and dojo will notify our code.
But that introduces a problem: what if the user refreshes the page? The hashmap in memory is then erased, and all of those entries in the browser’s history are now useless.
A similar problem occurs if the user wants to bookmark your URL for later, or share the URL with another user over email. Since the URL doesn’t contain the data we need to re-create the original object, the data is lost when the current window closes.
There seems to be an obvious fix for this problem: couldn’t we just use “#beds=4&baths=2″ in the URL instead of a random string? Instead of purely unique hash values, we could use a hash value that has meaning, i.e. a “semantically-named” hash value.
That is the correct fix, but there’s no way to do this correctly with dojo.back. dojo.back allows us to configure the “unique” string of the hash value, but we can’t instruct dojo.back to reconstruct mementos from the hash. If the user refreshes the page, dojo.back will erase its in-memory hashmap; it’s not smart enough to read “#beds=4&baths=2″ and re-inflate that into {beds: 4, baths: 2}.
Worse, if we use “#beds=4&baths=2″ as our semantically-named hash value, there’s a good chance that it won’t be unique. So if the user starts at “#beds=4&baths=2″ and then goes on to “#beds=4&baths=1″ and then finally to “#beds=4&baths=2″, dojo.back has no way to know whether the user went FORWARD to “baths=2″ or BACKWARD to “baths=2″
For this reason, dojo.back includes this cryptic warning at the top of its documentation:
NOTE: There are problems with using dojo.back with semantically-named fragment identifiers (“hash values” on an URL). In most browsers it will be hard for dojo.back to know distinguish a back from a forward event in those cases. For back/forward support to work best, the fragment ID should always be a unique value (something using new Date().getTime() for example). If you want to detect hash changes using semantic fragment IDs, then consider using dojo.hash instead (in Dojo 1.4+).
In other words, if we use dojo.back, our hash values need to be dates, e.g. “#1288732596876″, not meaningful strings like “#beds=4&baths=2″, or Dojo will confuse “back” navigation with “forward” navigation.
So, if you want your back button to keep working after refresh, or if you want users to share your URLs with each other, you should use dojo.hash instead of dojo.back.
2) dojo.back is harder to deploy because it uses document.write
Most of what we now know about back-button support was figured out in 2005, when the web was younger. At that time, no browser supported the onhashchange event; the only way to get AJAX back button to work properly in some browsers was to include the embedded iframe before the onload event.
Brad Neuberg figured out that the magic trick was to add the iframe using document.write; this technique was incorporated into dojo.back. document.write is a very dangerous technique on modern browsers, because the specified behavior is for document.write to modify the page if it’s called before the onload event, or to completely erase the entire page if it’s called after the onload event.
As a result, dojo.back has another cryptic note in its documentation:
WARNING: dojo.back.init() must be called before the page’s DOM is finished loading. Otherwise it will not work. Be careful with xdomain loading or djConfig.debugAtAllCosts scenarios, in order for this method to work, dojo.back will need to be part of a build layer.
Fortunately, this document.write hack is not required in any of Dojo’s currently supported browsers; notably, it’s not required in IE6+ or Safari 3+. In fact, if you’re reading this blog post in IE, you’re probably using IE8 or IE9, which has “onhashchange” built-in and requires no iframe at all.
dojo.hash uses an iframe, but does not attempt to create it using document.write; as a result, it’s a lot safer to use than dojo.back.
dojo.hash Is the Way of the Future
dojo.hash is a very simple library compared to dojo.back. In fact, if the browser supports the “onhashchange” event, then it does almost nothing more than attach the “onhashchange” event to the “/dojo/hashchange” topic. (You can also use dojo.hash to get/set the hash fragment in a convenient way.)
dojo.hash makes no attempt to store state objects in memory; instead, anyone who uses dojo.hash must serialize their memento into a hash string before passing it to dojo.hash. (dojo.objectToQuery is particularly useful for for this serialization.) When subscribing to the “/dojo/hashchange” topic, Dojo will invoke a callback function, passing it the current hash fragment.
We think you will find dojo.hash more useful than dojo.back for almost all situations. If dojo.back works for you today, you may not need to upgrade, but if you’re building something new and deciding which library to use, we strongly recommend dojo.hash.
October 26, 2010
Internet Explorer 9 Beta sometimes gives the wrong answer when we ask for the size of the viewport (viewable area of the browser window). On an HTML document with “X-UA compatible” set to IE=7 and Windows font DPI set to 125% or 150% with the browser window maximized, IE9 claims the viewport is a few pixels larger than it actually is.
Soon after Microsoft released IE9 Beta, we began receiving numerous tech support e-mails regarding our map page being stuck in an endless loop where the map continually refreshes. None of our engineers were able to reproduce this problem, so we classified this as a non-issue because the problem only occurred in IE9 beta, which is not an official release. But the tech support e-mails kept coming in, so we decided to find out exactly what was going on. We did not know if it was a problem in our code or a bug on Microsoft’s part.
To get a better understanding of the situation, we turned to our troubled users for help. Our first lead came from Nitin G. who mentioned that setting the display font very small cured his problem. This helped, but we needed more information. Mike P., another Redfin user, noticed that the problem only occurred when the browser window is maximized. He also reported seeing scroll bars appear and disappear during the infinite loop, and the loop stopping when he opened developer tools (F12). Not only did he provide us with screenshots, but he sent us a video capture of the bug as it was happening. Here are a couple of the screenshots:

(Figure 1)

(Figure 2)
The first thing we wanted to do was to match his exact search and display. We tried changing the display resolution, but with no luck. Then we remembered Nitin’s comment about shrinking the font size, and it was clear what we needed to do to reproduce the problem. Change the font DPI settings in Windows!
We successfully reproduced the problem by setting Window’s display text DPI to a percentage larger than standard and maximizing the window. By simply setting the DPI to 125% or 150%, this bug caused scroll bars to appear (Figure 1) and disappear (Figure 2) on the map page, causing our map to resize, which in-turn executes a new search. Scroll bars only appeared during a search, not when a new search is kicked-off. So now we have a reason for the infinite loop, but what is the underlying cause? It turns out that during a search, we lock our UI by creating an HTML element spanning the entire browser’s viewport, preventing users from searching the map as listings are being populated. It is when we lock our UI that the scroll bars appear, and this only happens when “X-UA compatible” is set to IE=7 in the HTML document.
We use Dojo extensively in our site, and one of its uses is to obtain the browser’s viewport size by calling dojo.window.getBox(), which contains the viewport’s height and width. These properties return values larger than the actual size of the viewport when the browser window is maximized, which cause scroll bars to appear when we call on the UI locker. We created a reduced test case where we used dojo.window.getBox() to make a <div> that spans the entire browser viewport:
| <!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN” “http://www.w3.org/TR/html4/loose.dtd”>
<html>
<head>
<meta http-equiv=”X-UA-Compatible” content=”IE=7″ >
<title>Dojo Viewport and Popup Test</title>
<style type=”text/css”>
@import “http://ajax.googleapis.com/ajax/libs/dojo/1.5/dojo/resources/dojo.css”;
@import “http://ajax.googleapis.com/ajax/libs/dojo/1.5/dijit/themes/nihilo/Dialog.css”;
@import “http://ajax.googleapis.com/ajax/libs/dojo/1.5/dijit/themes/dijit.css”;
@import “http://ajax.googleapis.com/ajax/libs/dojo/1.5/dijit/themes/nihilo/nihilo.css”
</style>
<script src=”http://ajax.googleapis.com/ajax/libs/dojo/1.5/dojo/dojo.xd.js”></script>
<script>
dojo.require(“dijit.form.Button”);
dojo.require(“dijit.Dialog”);
var viewportShowing = false;
var dialog = null;
function viewportPressed() {
var viewport = dijit.getViewport();
var viewportNode = dojo.byId(“viewportID”);
if(!this.viewportShowing) {
viewportNode.style.backgroundColor = “gray”;
viewportNode.style.width = viewport.w + “px”;
viewportNode.style.height = viewport.h + “px”;
dojo.style(viewportNode, “opacity”, 1);
}
else {
viewportNode.style.backgroundColor = “white”;
viewportNode.style.width = “150px”;
viewportNode.style.height = “”;
}
this.viewportShowing = !this.viewportShowing;
}
function popupPressed() {
if(!this.dialog) {
this.dialog = new dijit.Dialog({
title: “Testing Dojo Pop-up”,
content: “test content”,
style: “width: 200px”
});
this.dialog.startup();
}
this.dialog.show();
}
</script>
</head>
<body class=”nihilo”>
<div style=”background-color:white;width:150px”>
<button dojoType=”dijit.form.Button” onClick=”viewportPressed();”>Viewport</button>
<button dojoType=”dijit.form.Button” onClick=”popupPressed();”>Pop-up</button>
</div>
</body>
</html> |
Sure enough, we reproduced the problem. This led us to believe that this is a bug with Dojo, but we needed to be certain. Digging deeper into Dojo’s code, dojo.window.getBox() uses document.documentElement and calls on the element’s clientHeight and clientWidth properties to get the viewport dimensions. We created another, similar reduced test case, but without Dojo:
| <!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN” “http://www.w3.org/TR/html4/loose.dtd”>
<html>
<head>
<meta http-equiv=”X-UA-Compatible” content=”IE=7″ >
<title>IE9 documentElement height/width Test</title>
<script type=”text/javascript”>
function buttonPressed() {
var docElement = document.documentElement;
var viewportNode = document.getElementById(“viewportID”);
if(!this.pressedButton) {
viewportNode.style.backgroundColor = “gray”;
viewportNode.style.width = docElement.clientWidth + “px”;
viewportNode.style.height = docElement.clientHeight + “px”;
}
else {
viewportNode.style.backgroundColor = “white”;
viewportNode.style.width = “150px”;
viewportNode.style.height = “”;
}
this.pressedButton = !this.pressedButton;
}
</script>
</head>
<body style=”margin:0px”>
<div style=”background-color:white;width:150px”>
<button onClick=”buttonPressed();”>Click me!</button>
</div>
</body>
</html> |
The problem was still there! This confirms that there is a bug with IE9’s viewport code, and not Dojo. Although the problem is not with Dojo, we fixed this bug by modifying Dojo’s method for returning the viewport because Dojo calls on the DOM for the dimensions. Specifically, we subtract a few pixels from the height and width.
Special thanks goes to Nitin G. and Mike P. for helping us solve the mysterious IE9 infinite loop bug. Without their help and the help of our other avid Redfin users, this bug would have gone unnoticed. We use and consider many feedbacks and tech support problems to improve our site time and time again. We appreciate everyone’s help.
July 29, 2010
Ryan Dahl, the creator of a high-performing web server written in JavaScript, came by Redfin’s San Francisco office to talk about his creation, Node.JS. It was a very funny, thoughtful talk, particularly because Ryan is somehow both opinionated and careful with the truth. He is the latest in a long line of speakers for Engineer-to-Engineer, a series of technical talks hosted by Redfin, Digg, Pandora and Greylock on topics such as Hadoop, Scala, HTML5, Cassandra and Clusto.
Ryan’s presentation is here, and below is a summary of what he said.
This is going to be very introduction-level, with apologies to anyone who has dived deeper.
The goal of Node is to do easy network programming, to be able to create servers and clients that can be thrown together in a fairly simple way, using JavaScript.
Node.JS is a set of C++ bindings for network I/O and socket I/O. The strong focus is on putting together network servers.
Node is a command-line tool. You need to compile it. There are no binaries available. It’s something that runs from your terminal. It doesn’t have any dependencies other than Python to build it.
Let’s understand it by example… The first example is a program that prints Hello and then in 2 seconds, says World.
1. setTimeout(function () {
2. console.log(’world’);
3. }, 2000);
4. console.log(’hello’);
Node has a lot of browser-like APIs. When you’re in JavaScript, you expect it to be Browser-ey, even if it’s not Browser…ish, that is, even if it doesn’t run in the browser.
Node exits automatically. The program drops out when there’s nothing else to do. If there’s a callback pending it keeps running. In the example, after World, the program exits.
Now let’s make this more complicated. What if we want Hello every half second, then on an interrupt signal we want the program to print Bye?
1. setInterval(function () {
2. console.log(’hello’);
3. }, 500);
4.
5. process.on(’SIGINT’, function () {
6. console.log(’bye’);
7. process.exit(0)
8. });
In the browser your central object is the window; in node it’s a process. This global variable exists always.
It’s like a browser listening for a click event. And it’s also like a UNIX program in that you have to end the program. The process object emits an emit when it receives a signal; you only have to listen for it. You can get the pid, the program arguments, you can grab memory usage, you can get the executable path.
A TCP server emits a connection event, whenever someone connects, it says connect, and then it connects.
Now let’s create an event…
1. net = require(“net”);
2.
3. s = net.createServer();
4.
5. net.on(’connection’, function (c) {
6. c.end(’hello!’);
7. });
8.
9. s.listen(8000);
You can load a module; browser-based JavaScript doesn’t support this. You create a server in line 3, in line 5 – 7, we add an event listener, and then finally on line 9 you set up a port so the server is actually listening.
File I/O is non-blocking too. Node does File I/O. Here’s a program that outputs the last time /etc/passwd was modified.
1. var stat = require(’fs’).stat;
2.
3. stat(’/etc/passwd’, function (err, s) {
4. if (err) throw err;
5. console.log(’modified: %s’, s.mtime);
6. });
If you’re on a server being hit by thousands of people, you can’t just wait for the disk to spin, so Node takes the pragmatic view that you should never wait for something to happen. Set up the action to occur, but don’t wait for this action to occur. Give a callback and then drop back. There are two parameters. There’s an error object if the file is not there. Otherwise, you print out the time modified.
Node can do HTTP too. If it was just TCP and file stuff, that would be very limiting. Load the HTTP module; it is called every time you have a request, it writes to the response the header and Hello and World.
1. var http = require(’http’);
2.
3. var server = http.createServer(function (req,res) {
4. res.writeHead(200, {’Content-Type’: ’text/plain’});
5. res.write(’Hello\r\n’);
6. res.end(’World\r\n’);
7. })
8.
9. server.listen(8000);
The HTTP response is chunked because we don’t know how long it will end up being, so we can’t put a Content-Length header at the top. Node is very good at streaming: we’re not limited to “Here’s this movie, buffer it all.” Node streams up to memory, down to disk.
Here’s a streaming HTTP server… it can stream responses without introducing a large amount of weight, you don’t use a thread for each of these. If you curl it, you get Hello, then two seconds later, you get World.
1. var http = require(’http’);
2. server = http.createServer(function (req,res) {
3. res.writeHead(200, {’Content-Type’: ’text/plain’});
4.
5. res.write(“Hello\r\n”);
6.
7. setTimeout(function () {
8. res.end(’World\r\n’);
9. }, 2000);
10.});
11.
12.server.listen(8000);
This is low-level. It allows streaming requests, and requests can be hung while waiting for other things. With AJAX, connections are continually asking “Do you have anything new?, which can be very taxing on the server. Long polling, on the other hand, only involves asking once and then getting a response when the server wants to send you one.
Node’s HTTP server is enabled by the HTTP parser. You can check out http://github.com/ry/http-parser
You might be thinking: “HTTP, Jeez, how hard could it be, it’s a simple protocol.” You’re wrong. HTTP in the real world is extremely complicated. It’s difficult to be able to parse the headers and be able to expose this streaming nature without buffering. This HTTP server buffers nothing. It’s totally callback-based.
The HTTP server only uses 28 bytes per HTTP connection, which is important when you have 1,000 people chatting on a server. 28 bytes is acceptable for overhead; 4 megabytes isn’t.
Now let’s do inter-process communication with other processes. In this example, you pull out the child process. This is something that can spin the disk. Your CPU is much, much faster than the disk. Don’t wait for the disk.
1. exec = require(’child_process’).exec;
2. exec(’ls /’, function (err, output) {
3. if (err) throw err;
4. console.log(output);
5. });
It’s worth nothing that Node never forces output buffer. You can also stream data through the standard in and out of a child process.
Now we spawn the program cat, and we get a reference to that program. Whatever you send to cat, it sends back. You type in Hello, wait 2 seconds, then type Bye. You get Hello, then wait 2 seconds, then get Bye.
1. spawn = require(’child_process’).spawn;
2.
3. cat = spawn(’cat’);
4.
5. cat.stdin.write(’hello\n’);
6.
7. setTimeout(function () {
8. cat.stdin.end(’bye\n’);
9. }, 2000);
10.
11. cat.stdout.on(’data’, function (d) {
12. console.log(d);
13. });
Connecting streams is common. Where I want to go with Node is thinking of everything in terms of streams. There’s standard in and out, there’s file streams, HTTP connections. But mainly we deal a lot with streams. Generally we’re proxying streams and modifying them in the middle.
So this is JavaScript outside the browser. Yes! That’s almost what everybody wants. We’re interacting with the OS in a browser-like way.
We have an HTTP library for streaming. But wait there’s more… here’s a contrived but interesting web-server benchmark. We’ve set up four web servers. They’re all going to respond with a 1 megabyte file. 100 concurrent clients connect.
- Node can handle 822 reqests per second
- Nginx (web server written in C, popular with the Ruby crowd, consider this as good as it gets): 708
- Thin: 85
- Mongrel: 4
This should be shocking to you. You should be urinating right now. Or getting angry. It shocks me.
There are some caveats. NGINX peaked at 4mb of memory, and Node 60mb of memory. I also didn’t sit down for hours and try to make NGINX fast, as I did with Node.
There are a lot of places in Node where the opposite is true, where it sucks while everything else is good. SSL for example.
Node is written on Google’s V8, the JavaScript engine in Chrome. V8 is a masterpiece of engineering. Google took the 14 best VM engineers and locked them in a closet in Denmark. They were given the JavaScript spec and then told to make it fast.
It’s an amazing VM. Much better than Ruby or Python. Incomparable. Or comparable I guess… All these callbacks must seem weird to you but that is where our speed increase comes from.
Result = query (‘select * from T’); //use result
If you’ve done traditional web programming, you’ve probably used activerecord and you access some record. You use a function to do the I/O, but what does your software do while it’s accessing the database. In many cases, nothing. It’s the year 2010, we’re using Rails, and when you access a database, it stops, the world stops for who knows how long, the database might be in LA, and it takes 2 seconds to respond.
To mitigate that, we load balance with multiple processes, all waiting 2 seconds. That’s a form of concurrency to be sure, I guess that’s what processes are for.
When you access stuff in the CPU, it’s very fast. You can assume any operation to take zero amount of time, until you access the disk or the network. It’s not appropriate to treat operations in the CPU in the same way as operations on disk or I/O. Abstracting I/O as a function doesn’t make sense when the time-frames are so different.
- 3 cycles for L1
- 14 cycles for L2
- 250 cycles for RAM
- 41M cycles for disk
- 240M cycles for network
It’s unacceptable to wait for the database when you’re serving many clients. You can fork a thread – it’s hard in Ruby because its threading system is utterly crap, but Java can – so when one thread blocks while accessing the database, you can start new threads. That’s fine. But you can’t use an OS thread for each socket when you want good concurrency. Threads have weight to them, and context-switching is costly too. Each thread takes 4 meg of memory, which is a lot when you have 1,000 concurrent users.
The alternative to using threads is to structure your code like this:
Query (‘select..’ function (result) { //use result });
Node is fast because it never blocks on I/O. And JavaScript is great for this. In Ruby there’s EventMachine, in Python there’s Twisted, somehow it doesn’t jive, you sit down to write the code and it doesn’t work the way that programming language is meant to work, it doesn’t work with all the modules out there – like a MySQL library — to do I/O. But the browser was already set up to be an event loop. Brendan Eich was a genius. Yes it does one thing at a time, but also many things very quickly, because you never block on I/O.
And there’s a culture of JavaScript, an entire generation of programmers who grew up programming browsers, and now they can code on a server, without forking a thread and blocking on except. Java people on the other hand find this callback concept difficult to grasp. “What do you mean? What is it doing while it’s doing nothing?”
Node jails you into this evented-style programming. You can’t do things in a blocking way, you can’t write slow programs.
Node consists of 3 C libraries: V8; event loop (libev) so you don’t have to write something for every OS; a thread pool (libeio), which is necessary for file I/O. There’s a layer for bindings, C++ glue, then the standard library is written in JavaScript. It’s not a thin binding to a C web server, it actually goes through a lot of JavaScript – that’s impressive – V8 is up to the task. I used to write web servers in Ruby, it was awful, every line of Ruby hurts performance; it’s a beautiful language, but a crappy virtual machine. V8 is not that way.
JavaScript can only access the main thread, the C layer has access to blocking functions – we don’t want to have a global interpreter lock – let’s let the experts have access to the threads. To use the threads, program in C.
I wouldn’t use Node.JS to make big websites, but it is one of the only solutions for making real-time, long-polling things. You’ll probably have a bunch of Rails servers and one Node server for a specialized function. As frameworks mature, you can use Node to build the whole website. You won’t have to load-balance it because it’s very fast but you’ll probably have to put it behind a web server, because you don’t trust it, or because SSL support still sucks. The bottleneck will be your gigabit connection into that machine, not memory or anything else.
And that was it! Many thanks to Ryan for a dazzling talk, and to everyone who came. Thanks too to Greylock, Digg and Pandora for helping us put on the event…
July 8, 2010
At Google IO 2009, our fearless leader Sasha Aickin (my boss) demonstrated our high-performance Google Maps utility library to the world; we provided directions explaining how to code it, but we didn’t actually ship the code.
Today, I’m proud to announce that we’ve made our MultiMarker utility library code (formerly known as SuperMarker) available to everyone under the Apache License 2.0. As far as we know, it’s the fastest way to add many hundreds or thousands of markers on Google Maps.
HOW FAST IS IT?
Below, we compare the time required to add 1000 markers in four ways:
- V2 map using the default GMarker
- V2 map using Pamela Fox’s MarkerLight
- V3 map using the default google.maps.Marker
- Using our MultiMarker library
Note that the GMarker timings you see here were recorded by stopwatch, NOT by automated timer. Also note that these browsers were running on different machines, so you shouldn’t use this table to compare browsers with each other; just compare the columns within each row.
UPDATE: This table was updated on April 22nd to reflect the faster markers available in Google Maps v3.4 nightly.
|
V2 Gmarker |
V2 MarkerLight |
V3 Gmarker |
V3 MultiMarker |
| IE6 |
44 seconds |
9.0 seconds |
56 seconds |
1.5s |
| IE7 |
42 seconds |
6 seconds |
3.5 seconds |
1s |
| IE8 |
32 seconds |
3.1 seconds |
1.5 seconds |
<1s |
| IE9 |
3 seconds |
0.9 seconds |
<1s |
<1s |
| FF4 |
5.1 seconds |
1.2 seconds |
<1s |
<1s |
| GC10 |
2.4 seconds |
<1s |
<1s |
<1s |
| iPhone 4 |
40 seconds |
6 seconds |
3 seconds |
2 seconds |
As you can see, a speedup of 10-100x is possible using the MultiMarker technique, depending on which version of GMaps you’re using.
Try the examples for yourself:
Google Maps V2 Comparison Test
Google Maps V3 Comparison Test
Enjoy!
June 14, 2010
As we talked about before, Redfin uses Varnish to implement Edge Side Includes (ESI.) This involved breaking a single big (and expensive) page into individual chunks; each chunk would be generated by separate code, and would be cached on a different schedule.
Once we broke our expensive page into chunks that could be individually cached, it seemed pretty easy to have those chunks served up by different backend servers. Voilà, a monolithic app became “service oriented“! This would let us run the different software components on different machines (with different performance characteristics, different SLAs, even implementations in different languages/environments!)
Of course, nothing is actually that easy, and we made a number of mis-steps before we figured out how to do it.

How To
Varnish allows you to define multiple backends in your VCL. And in your vcl_recv function, you can decide which backend should handle a particular request. At Redfin, we added a new Varnish backend for each of our ESI endpoints, and we added logic to choose the relevant backend by URI. In practice, we actually only have one pool of machines handling our ESI requests, so all of our Varnish backends actually point to the same place.
So the first piece of the puzzle is on our main web servers. On the main web servers, requests go through Varnish. Requests for “normal” pages are sent through to Tomcat, but requests for ESIs are sent to one of the SOA backends. Here’s an example of what the VCL file might look like:
backend default {
.host = "localhost";
.port = "8080";
}
backend similars {
.host = "similars.redfin.com";
.port = "6081";
}
backend relevantlinks {
.host = "relevantlinks.redfin.com";
.port = "6081";
}
...
sub vcl_recv {
if (req.url ~ "^/esi-listing-similars" || req.url ~ "^/esi-property-similars") {
set req.backend = similars;
}
else if (req.url ~ "^/esi-listing-trackbacks") {
set req.backend = relevantlinks;
}
You might have noticed that the “localhost” backend is associated with port 8080 (where Tomcat is running), but the ESI backends are associated with port 6081 (where Varnish is running on those remote machines.)
We want the instance of Varnish on the main web server to cache content from the main web server, and the instances of Varnish on the ESI backends to cache the content from those backends. This has a few benefits:
- Our effective cache is bigger, since we have caches on multiple machines, each of which has fixed memory
- Having independent caches prevents one set of items from pushing another set out of the cache. If all the data were in a single cache, then cache entries holding similars information (which is small, but expensive to recreate) could be pushed out of the cache by cache entries of “main page” content (which is big and relatively cheap to recreate, but we’d still like to cache.)
- It’s easy to flush individual caches without having to worry about performance problems with other parts of the site
We have another design goal: we’d like to have a single distribution of our software. We’d like to have a single WAR that we can put on any machine; we do NOT want to have to deal with multiple builds, with figuring out which build has been installed on which machine, etc. We’d like to be able to switch a single machine from being a standard web server to being an ESI endpoint without having to redeploy or reconfigure.
This creates a conundrum. We want our main web servers and our ESI servers to be identical, but we also want them to act different. In particular, when an instance of Varnish on a web server gets a request for an ESI fragment, it should redirect that request to an ESI server (more precisely: to the Varnish instance running on an ESI server.) But when an instance of Varnish on an ESI server gets a request for an ESI fragment, it should forward the request to the local Tomcat instance. It should NOT forward the request to ITSELF. Forwarding port 6081 to port 6081 creates an infinite loop- not good.
We want to break the symmetry between the standard web servers and the ESI servers, and we do that by messing with the URIs.
We prepend our ESI URIs with a known prefix, which means “forward this to the ESI server.” But when we process the URI (while forwarding it), we strip off that prefix, so that the ESI server does not also forward it to itself. That’s harder to say than it is to code. The VCL code looks like this:
sub vcl_recv {
if (req.url ~ "^/backend/") {
set req.url = regsub(req.url, "^/backend/", "/");
if (req.url ~ "^/esi-listing-similars" || req.url ~ "^/esi-property-similars") {
set req.backend = similars;
}
else if (req.url ~ "^/esi-listing-trackbacks") {
set req.backend = relevantlinks;
}
This breaks the circularity. The path of requests looks like:
- A requests comes into Varnish on the standard web server for /path/to/a/page
- Varnish forwards the request to the local Tomcat instance
- Tomcat responds with HTML that includes <esi:include src=”/backend/esi-listing-similars” />
- Varnish processes the ESI, and must make a request for /backend/esi-listing-similars
- The Varnish instance on the standard web server strips off “/backend”, and sends a request for “/esi-listing-similars” to the ESI server
- The Varnish instance on the ESI server gets the request for “/esi-listing-similars”
- Since there’s no “/backend” prefix, the Varnish instance on the ESI server forwards the request to its local Tomcat instance
- The Tomcat instance on the ESI server processes the request, and responds with the relevant HTML fragment
- The Varnish instance on the ESI server caches the HTML fragment and returns it
- The Varnish instance on the standard web server parses the HTML fragment into the main page content and returns it to the browser
This example points out another tricky bit- how do we assure that the HTML fragment is cached by the Varnish service on the ESI server, but not by the Varnish service on the standard web server? To handle this correctly, we add a header to the response which indicates if it’s already been cached:
sub vcl_fetch {
if (req.url ~ "^/esi-") {
if (obj.http.X-RF-Cached ~ "true") {
pass;
}
set obj.http.X-RF-Cached = "true";
This code says “If there’s an X-RF-Cached header present, then don’t attempt to cache. If there is NOT an X-RF-Cached header present, then add one, and attempt to cache.” With this addition, the HTML fragments will only be cached on the first Varnish instance they pass through, which is on the ESI server in our case.
How NOT To
The solution described above works, and meets our requirements. But we also tried some solutions that did NOT work. Perhaps you can learn from our failures…
Putting Absolute URIs into ESI Includes
Our first thought was that we’d put absolute URIs into our ESI includes in the HTML. For instance, we tried to put <esi:include src=”http://similars.redfin.com:6081/esi-listing-similars” /> into the main HTML of our page. Varnish simply (and correctly, I think) ignores the host name and port. Including http://similars.redfin.com:6081/esi-listing-similars will cause Varnish to act as if you included /esi-listing-similars, and Varnish will use whichever backend it thinks is relevant, regardless of the host name or port in the URI.
Using a Single Server as both a Standard Web Server and an ESI Server
When doing testing, or when some of our servers were unavailable, we were tempted to use a single server as both the standard web server and the ESI server. It seemed like this should work- the trick with the “/backend” prefix should prevent infinite circularity. However, it didn’t work. It seems that Varnish is doing its own checks for circularity, and noticing that a single request passed through the same Varnish instance multiple times (which NORMALLY would be a problematic example of circularity, but we’ve got our clever symmetry breaker in there!) Anyway, Varnish doesn’t allow it, and causes those semi-circular requests to fail.
P.S.
Thanks to D’Arcy Norman for the photo!
June 4, 2010
In the second installment of our San Francisco series of engineer-to-engineer lectures, Jeff Hammerbacher described the challenges of building data-intensive, distributed applications and how using Hadoop saved the day at Facebook. Speaking to an audience of approximately thirty Hadoop experts and enthusiasts hailing from all around the Bay Area, the Valley, and even Seattle, he also discussed what’s wrong with today’s analytical platforms and what will shape the platform of the future.
And Jeff should know. After studying Mathematics at Harvard and wearing a suit as a quantitative analyst on Wall Street, he conceived, built, and led the data team at Facebook. He then went on to start Cloudera, the leader in commercializing Apache Hadoop, where he currently works as Chief Scientist and VP of Products. Jeff also served as Contributing Editor for a book: Beautiful Data: The Stories Behind Elegant Data Solutions, the proceeds of which are split between Creative Commons and Sunlight Labs.
The Scoop on Hadoop
Hadoop is an open source framework that enables data-intensive distributed applications to efficiently process gigantic amounts of data. It’s an open source implementation of the MapReduce approach to processing data. MapReduce was invented at Google to deal with the massive quantities of data necessary to index the web. There are two main components to the system: the Hadoop Distributed File System (HDFS) which stores and maintains data across many machines, and the MapReduce engine which processes the data.
But the talk didn’t really go into Hadoop internals — as Jeff pointed out, the documentation is readily available online. Rather, the talk was about how and why Hadoop will provide the foundation on which the next generation platform for analytics will be built. Making bold predictions about technology is hard. Jeff quoted Larry Ellison’s quip that “the computer industry is the only industry that is more fashion-driven than women’s fashion.” And yet, using real-world examples from his experience at Facebook, Jeff makes a compelling sell.
Bottlenecks, Costs, the Black Box, and the Kitchen Sink
A typical architecture for large-scale data analysis includes a data source, a data warehouse, ETL (aka: “extract-transform-load”; the step that gets data out of and into RDBMSs and converts source data to the data warehouse’s format), and business intelligence and analytics systems – all of which are usually centered around relational databases. However, Jeff stressed that a relational database is a specialty and not a foundation, arguing that the abstractions provided by them are no longer useful on their own for analytical data management.
One reason is that over the past few years, there has been an explosion in data volume primarily originating from machine-generated logs. By simply tweaking an Apache log, you can grow your data volume and complexity by several orders of magnitude. As we’ll see in Facebook’s case, their relational database approach simply didn’t scale and they soon needed new tools to handle the load.
Another point Jeff made was that the percentage of data that actually gets stored in a relational database is shrinking. What do you do with all the unstructured data (accounting for 95% of the digital universe) that doesn’t necessarily make sense to persist relationally? Do you still need expensive relational data warehouses and proprietary boutique servers? Jeff’s team at Facebook made a bet on commodity hardware which turned out to be a good move, ultimately pushing the complexity out of hardware and into the software layer.
They also bet on open source data stores built by consumer web firms, arguing that web properties have the most representative problems: scalability and unstructured data management. Jeff stated that most production-quality data stores came from enterprise software firms in the mid-1990s, but now a growing percentage of the world’s data is persisted in open source data stores. He also mentioned that a nice side-effect of adopting open source solutions is that it’s much better to have a modular collection of open tools rather than an opaque abstraction. Why? Because there’s great benefit in being able to pick and choose solutions and understand what’s going on under the hood.
Jeff noted that another problem is that, in many cases, enterprise software does not service developers well. Many relational data warehouses simply just expose SQL; but to get real traction/adoption from developers, you need more than that… You need open applications for analysis, not just a SQL interface. He feels that “in addition, these data stores often expose a proprietary interface for application programming (e.g. PL/SQL or TSQL), but not the full power of procedural programming. More programmer-friendly parallel dataflow languages await discovery, I think. MapReduce is one (small) step in that direction.”
Where is this new platform going to come from? Any new platform must be centered around addressing these new user needs, which is hard to achieve by re-implementing an old spec in a new, clever way. He cautioned that implementing a new, successful cut of the ANSI SQL spec would be a real undertaking. Not only would it take ages before you had anything to show, but it would likely suffer the same scalability problems of previous implementations.
Facebook and Hadoop are Now Friends
Using Facebook as a real-world example, Jeff described the challenge of measuring how changes to the site improved or impaired user experience. Their original data analysis system featured source data living on a horizontally partitioned MySQL tier and a cron job running Python scripts that pinged stats back to a central MySQL database. The main problem with this setup was that it made intensive historical analysis difficult since the source data was spread over many machines and aggregating the data to the analytics database was a slow, inefficient process. Plus, when it barfed, it took three days to replay the edit logs in order to diagnose the problem.
So Facebook hired a data warehouse engineer to build a 10TB Oracle warehouse. This worked for a bit and would’ve been fine for small and medium-sized businesses, but ultimately didn’t scale — particularly when they turned on impression logging which generated over 400GB of data on the first day! This quickly grew to 1TB of data per day in 2007.
You might suggest that since disks are cheap, why not throw more storage at the warehouse? It turns out that, in addition to the problem of data volume, there was also a bottlenecking CPU utilization problem. The ETL process ended up taking more than a day to aggregate, import, and load the necessary data for analysis. Jeff went on to explain that proprietary ETL vendors have lots of downsides and generally don’t scale well for large sets of databases (on the order of thousands, in Facebook’s case). In addition, when “warts” start to show up in proprietary vendors, the closed nature of the software prohibits developers from tinkering with the source to diagnose and resolve problems.
Meanwhile, his team started to play with Hadoop on the side as an open source alternative. They got a Hadoop cluster to replace the data collection and processing tiers. So the new architecture still has multiple data sources (log files, MySQL) but is now fed into HDFS instead. Work is done via MapReduce and the artifacts are then published to Oracle RAC servers for consumption by business intelligence and analysis. It also simultaneously publishes results back to the MySQL tier.

From “Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop“, slide 21
Initially, this shift was met with a lot of resistance mostly because Hadoop is Java-based and, since the majority of Facebook’s services were written in C++, the developers there weren’t comfortable in Java. But it wasn’t long before the new platform showed its strengths:
- Switching to this system greatly reduced latency because the ETL process is no longer done in flight – it’s done after persistence in Hadoop.
- Hadoop enabled Facebook to efficiently crunch extremely large data sets on the order of multi-petabytes, previously impractical under the old system.
- The Hadoop data warehouse became easily accessible to developers which turned out to be a real bonus. Developers previously found SQL to be an unfriendly environment because they couldn’t predict the impact of running SQL (it was easy for them to hose themselves and others) and because the dev environment for SQL was crude. After switching, however, they found that a lot of Facebook’s developers started freely playing with the data set which fostered innovation and led to new features.
Shaping a New Platform
Jeff emphasized that while Hadoop provides a great foundation for data analysis, it’s not the whole story. Today, there are many technologies built on top of Hadoop that need to be considered for your system. For example: there is Hive, a system for offline analysis; there is HBase, an open source implementation of Google’s BigTable to name a couple. He remarked that the abstraction layer needs to be redrawn to include the functionality provided by ETL, master data management (MDM), stream management, reporting, online analytical processing (OLAP), and search tools; all with a unified UI.
Jeff explained that SQL Server 2008 R2 is a good model. SQL Server is no longer just a database – there are a bunch of associated products in the box offering a full suite of features. You still have the old features like SQL Server Integration Services (ETL), SQL Server (data warehouse), SQL Server Reporting Services, SQL Server Analysis Services, and full-text search. But now you also get a bunch of new features such as stream management (StreamInsight) providing real-time analytics, OLAP (PowerPivot) enabling rapid navigation of subsets of data, collaboration via SharePoint, MDM for integrating disparate data sources and entity resolution, and features that aid in scaling your servers out to a many-node SQL solution. Jeff remarked that it’s “kind of scary that Microsoft has started to do a lot right within the last 5 years.”
Providing a full suite of features is also what Cloudera does well, but for Hadoop. They’re not the primary developers of this stuff (currently only 3 out of 17 contributors on HDFS), but they do an excellent job at packaging and polishing Hadoop and make their money in training, services, and support. And, like Microsoft, they eat their own dogfood: using the tools they build to solve their own business problems. Jeff joked that it’s “interesting being a vendor now – I can see what we put these other vendors through [while at Facebook].”
Many thanks to Jeff for the great talk, Greylock for helping with the logistics (and providing the delicious pizza and beer), and to everybody that came out! Be sure to check out the next talk on June 10th when our own Sasha Aickin, Redfin’s head of user experience, will weigh HTML 5 vs. Native Apps.