Archive for the ‘Engineer-to-Engineer Series’ Category

July 29, 2010

Ryan Dahl Introduces Node.JS

Ryan Dahl, the creator of a high-performing web server written in JavaScript, came by Redfin’s San Francisco office to talk about his creation, Node.JS. It was a very funny, thoughtful talk, particularly because Ryan is somehow both opinionated and careful with the truth. He is the latest in a long line of speakers for Engineer-to-Engineer, a series of technical talks hosted by Redfin, Digg, Pandora and Greylock on topics such as Hadoop, Scala, HTML5, Cassandra and Clusto.

Ryan’s presentation is here, and below is a summary of what he said.

This is going to be very introduction-level, with apologies to anyone who has dived deeper.

The goal of Node is to do easy network programming, to be able to create servers and clients that can be thrown together in a fairly simple way, using JavaScript.

Node.JS is a set of C++ bindings for network I/O and socket I/O. The strong focus is on putting together network servers.RDahl sw Ryan Dahl Introduces Node.JS

Node is a command-line tool. You need to compile it. There are no binaries available. It’s something that runs from your terminal. It doesn’t have any dependencies other than Python to build it.

Let’s understand it by example… The first example is a program that prints Hello and then in 2 seconds, says World.

1. setTimeout(function () {
2. console.log(’world’);
3. }, 2000);
4. console.log(’hello’);

Node has a lot of browser-like APIs. When you’re in JavaScript, you expect it to be Browser-ey, even if it’s not Browser…ish, that is, even if it doesn’t run in the browser.

Node exits automatically. The program drops out when there’s nothing else to do. If there’s a callback pending it keeps running. In the example, after World, the program exits.

Now let’s make this more complicated. What if we want Hello every half second, then on an interrupt signal we want the program to print Bye?

1. setInterval(function () {
2. console.log(’hello’);
3. }, 500);
4.
5. process.on(’SIGINT’, function () {
6. console.log(’bye’);
7. process.exit(0)
8. });

In the browser your central object is the window; in node it’s a process. This global variable exists always.

It’s like a browser listening for a click event. And it’s also like a UNIX program in that you have to end the program. The process object emits an emit when it receives a signal; you only have to listen for it. You can get the pid, the program arguments, you can grab memory usage, you can get the executable path.

A TCP server emits a connection event, whenever someone connects, it says connect, and then it connects.

Now let’s create an event…

1. net = require(“net”);
2.
3. s = net.createServer();
4.
5. net.on(’connection’, function (c) {
6. c.end(’hello!’);
7. });
8.
9. s.listen(8000);

You can load a module; browser-based JavaScript doesn’t support this. You create a server in line 3, in line 5 – 7, we add an event listener, and then finally on line 9 you set up a port so the server is actually listening.

File I/O is non-blocking too. Node does File I/O. Here’s a program that outputs the last time /etc/passwd was modified.

1. var stat = require(’fs’).stat;
2.
3. stat(’/etc/passwd’, function (err, s) {
4. if (err) throw err;
5. console.log(’modified: %s’, s.mtime);
6. });

If you’re on a server being hit by thousands of people, you can’t just wait for the disk to spin, so Node takes the pragmatic view that you should never wait for something to happen. Set up the action to occur, but don’t wait for this action to occur. Give a callback and then drop back. There are two parameters. There’s an error object if the file is not there. Otherwise, you print out the time modified.

Node can do HTTP too. If it was just TCP and file stuff, that would be very limiting. Load the HTTP module; it is called every time you have a request, it writes to the response the header and Hello and World.

1. var http = require(’http’);
2.
3. var server = http.createServer(function (req,res) {
4. res.writeHead(200, {’Content-Type’: ’text/plain’});
5. res.write(’Hello\r\n’);
6. res.end(’World\r\n’);
7. })
8.
9. server.listen(8000);

The HTTP response is chunked because we don’t know how long it will end up being, so we can’t put a Content-Length header at the top. Node is very good at streaming: we’re not limited to “Here’s this movie, buffer it all.” Node streams up to memory, down to disk.

Here’s a streaming HTTP server… it can stream responses without introducing a large amount of weight, you don’t use a thread for each of these. If you curl it, you get Hello, then two seconds later, you get World.

1. var http = require(’http’);
2. server = http.createServer(function (req,res) {
3. res.writeHead(200, {’Content-Type’: ’text/plain’});
4.
5. res.write(“Hello\r\n”);
6.
7. setTimeout(function () {
8. res.end(’World\r\n’);
9. }, 2000);
10.});
11.
12.server.listen(8000);

This is low-level. It allows streaming requests, and requests can be hung while waiting for other things. With AJAX, connections are continually asking “Do you have anything new?, which can be very taxing on the server. Long polling, on the other hand, only involves asking once and then getting a response when the server wants to send you one.

Node’s HTTP server is enabled by the HTTP parser. You can check out http://github.com/ry/http-parser

You might be thinking: “HTTP, Jeez, how hard could it be, it’s a simple protocol.” You’re wrong. HTTP in the real world is extremely complicated. It’s difficult to be able to parse the headers and be able to expose this streaming nature without buffering. This HTTP server buffers nothing. It’s totally callback-based.

The HTTP server only uses 28 bytes per HTTP connection, which is important when you have 1,000 people chatting on a server. 28 bytes is acceptable for overhead; 4 megabytes isn’t.

Now let’s do inter-process communication with other processes. In this example, you pull out the child process. This is something that can spin the disk. Your CPU is much, much faster than the disk. Don’t wait for the disk.

1. exec = require(’child_process’).exec;
2. exec(’ls /’, function (err, output) {
3. if (err) throw err;
4. console.log(output);
5. });

It’s worth nothing that Node never forces output buffer. You can also stream data through the standard in and out of a child process.

Now we spawn the program cat, and we get a reference to that program. Whatever you send to cat, it sends back. You type in Hello, wait 2 seconds, then type Bye. You get Hello, then wait 2 seconds, then get Bye.

1. spawn = require(’child_process’).spawn;
2.
3. cat = spawn(’cat’);
4.
5. cat.stdin.write(’hello\n’);
6.
7. setTimeout(function () {
8. cat.stdin.end(’bye\n’);
9. }, 2000);
10.
11. cat.stdout.on(’data’, function (d) {
12. console.log(d);
13. });

Connecting streams is common. Where I want to go with Node is thinking of everything in terms of streams. There’s standard in and out, there’s file streams, HTTP connections. But mainly we deal a lot with streams. Generally we’re proxying streams and modifying them in the middle.

So this is JavaScript outside the browser. Yes! That’s almost what everybody wants. We’re interacting with the OS in a browser-like way.
We have an HTTP library for streaming. But wait there’s more… here’s a contrived but interesting web-server benchmark. We’ve set up four web servers. They’re all going to respond with a 1 megabyte file. 100 concurrent clients connect.

  • Node can handle 822 reqests per second
  • Nginx (web server written in C, popular with the Ruby crowd, consider this as good as it gets): 708
  • Thin: 85
  • Mongrel: 4

This should be shocking to you. You should be urinating right now. Or getting angry. It shocks me.

There are some caveats. NGINX peaked at 4mb of memory, and Node 60mb of memory. I also didn’t sit down for hours and try to make NGINX fast, as I did with Node.

There are a lot of places in Node where the opposite is true, where it sucks while everything else is good. SSL for example.

Node is written on Google’s V8, the JavaScript engine in Chrome. V8 is a masterpiece of engineering. Google took the 14 best VM engineers and locked them in a closet in Denmark. They were given the JavaScript spec and then told to make it fast.

It’s an amazing VM. Much better than Ruby or Python. Incomparable. Or comparable I guess… All these callbacks must seem weird to you but that is where our speed increase comes from.

Result = query (‘select * from T’); //use result

If you’ve done traditional web programming, you’ve probably used activerecord and you access some record. You use a function to do the I/O, but what does your software do while it’s accessing the database. In many cases, nothing. It’s the year 2010, we’re using Rails, and when you access a database, it stops, the world stops for who knows how long, the database might be in LA, and it takes 2 seconds to respond.

To mitigate that, we load balance with multiple processes, all waiting 2 seconds. That’s a form of concurrency to be sure, I guess that’s what processes are for.

When you access stuff in the CPU, it’s very fast. You can assume any operation to take zero amount of time, until you access the disk or the network. It’s not appropriate to treat operations in the CPU in the same way as operations on disk or I/O. Abstracting I/O as a function doesn’t make sense when the time-frames are so different.

  • 3 cycles for L1
  • 14 cycles for L2
  • 250 cycles for RAM
  • 41M cycles for disk
  • 240M cycles for network

It’s unacceptable to wait for the database when you’re serving many clients. You can fork a thread – it’s hard in Ruby because its threading system is utterly crap, but Java can – so when one thread blocks while accessing the database, you can start new threads. That’s fine. But you can’t use an OS thread for each socket when you want good concurrency. Threads have weight to them, and context-switching is costly too. Each thread takes 4 meg of memory, which is a lot when you have 1,000 concurrent users.

The alternative to using threads is to structure your code like this:

Query (‘select..’ function (result) { //use result });

Node is fast because it never blocks on I/O. And JavaScript is great for this. In Ruby there’s EventMachine, in Python there’s Twisted, somehow it doesn’t jive, you sit down to write the code and it doesn’t work the way that programming language is meant to work, it doesn’t work with all the modules out there – like a MySQL library — to do I/O. But the browser was already set up to be an event loop. Brendan Eich was a genius. Yes it does one thing at a time, but also many things very quickly, because you never block on I/O.

And there’s a culture of JavaScript, an entire generation of programmers who grew up programming browsers, and now they can code on a server, without forking a thread and blocking on except. Java people on the other hand find this callback concept difficult to grasp. “What do you mean? What is it doing while it’s doing nothing?”

Node jails you into this evented-style programming. You can’t do things in a blocking way, you can’t write slow programs.

Node consists of 3 C libraries: V8; event loop (libev) so you don’t have to write something for every OS; a thread pool (libeio), which is necessary for file I/O. There’s a layer for bindings, C++ glue, then the standard library is written in JavaScript. It’s not a thin binding to a C web server, it actually goes through a lot of JavaScript – that’s impressive – V8 is up to the task. I used to write web servers in Ruby, it was awful, every line of Ruby hurts performance; it’s a beautiful language, but a crappy virtual machine. V8 is not that way.

JavaScript can only access the main thread, the C layer has access to blocking functions – we don’t want to have a global interpreter lock – let’s let the experts have access to the threads. To use the threads, program in C.

I wouldn’t use Node.JS to make big websites, but it is one of the only solutions for making real-time, long-polling things. You’ll probably have a bunch of Rails servers and one Node server for a specialized function. As frameworks mature, you can use Node to build the whole website. You won’t have to load-balance it because it’s very fast but you’ll probably have to put it behind a web server, because you don’t trust it, or because SSL support still sucks. The bottleneck will be your gigabit connection into that machine, not memory or anything else.

And that was it! Many thanks to Ryan for a dazzling talk, and to everyone who came. Thanks too to Greylock, Digg and Pandora for helping us put on the event…


June 4, 2010

Engineer-to-Engineer: Evolving a New Analytical Platform with Hadoop

In the second installment of our San Francisco series of engineer-to-engineer lectures, Jeff Hammerbacher described the challenges of building data-intensive, distributed applications and how using Hadoop saved the day at Facebook.  Speaking to an audience of approximately thirty Hadoop experts and enthusiasts hailing from all around the Bay Area, the Valley, and even Seattle, he also discussed what’s wrong with today’s analytical platforms and what will shape the platform of the future.

And Jeff should know.  After studying Mathematics at Harvard and wearing a suit as a quantitative analyst on Wall Street, he conceived, built, and led the data team at Facebook.  He then went on to start Cloudera, the leader in commercializing Apache Hadoop, where he currently works as Chief Scientist and VP of Products.  Jeff also served as Contributing Editor for a book: Beautiful Data: The Stories Behind Elegant Data Solutions, the proceeds of which are split between Creative Commons and Sunlight Labs.

The Scoop on Hadoop

Hadoop is an open source framework that enables data-intensive distributed applications to efficiently process gigantic amounts of data.  It’s an open source implementation of the MapReduce approach to processing data.  MapReduce was invented at Google to deal with the massive quantities of data necessary to index the web.  There are two main components to the system: the Hadoop Distributed File System (HDFS) which stores and maintains data across many machines, and the MapReduce engine which processes the data.

But the talk didn’t really go into Hadoop internals — as Jeff pointed out, the documentation is readily available online.  Rather, the talk was about how and why Hadoop will provide the foundation on which the next generation platform for analytics will be built.  Making bold predictions about technology is hard.  Jeff quoted Larry Ellison’s quip that “the computer industry is the only industry that is more fashion-driven than women’s fashion.”  And yet, using real-world examples from his experience at Facebook, Jeff makes a compelling sell.

Bottlenecks, Costs, the Black Box, and the Kitchen Sink

A typical architecture for large-scale data analysis includes a data source, a data warehouse, ETL (aka: “extract-transform-load”; the step that gets data out of and into RDBMSs and converts source data to the data warehouse’s format), and business intelligence and analytics systems – all of which are usually centered around relational databases.  However, Jeff stressed that a relational database is a specialty and not a foundation, arguing that the abstractions provided by them are no longer useful on their own for analytical data management.

One reason is that over the past few years, there has been an explosion in data volume primarily originating from machine-generated logs.  By simply tweaking an Apache log, you can grow your data volume and complexity by several orders of magnitude.  As we’ll see in Facebook’s case, their relational database approach simply didn’t scale and they soon needed new tools to handle the load.

Another point Jeff made was that the percentage of data that actually gets stored in a relational database is shrinking.  What do you do with all the unstructured data (accounting for 95% of the digital universe) that doesn’t necessarily make sense to persist relationally?  Do you still need expensive relational data warehouses and proprietary boutique servers?  Jeff’s team at Facebook made a bet on commodity hardware which turned out to be a good move, ultimately pushing the complexity out of hardware and into the software layer.

They also bet on open source data stores built by consumer web firms, arguing that web properties have the most representative problems: scalability and unstructured data management.  Jeff stated that most production-quality data stores came from enterprise software firms in the mid-1990s, but now a growing percentage of the world’s data is persisted in open source data stores.  He also mentioned that a nice side-effect of adopting open source solutions is that it’s much better to have a modular collection of open tools rather than an opaque abstraction.  Why?  Because there’s great benefit in being able to pick and choose solutions and understand what’s going on under the hood.

Jeff noted that another problem is that, in many cases, enterprise software does not service developers well. Many relational data warehouses simply just expose SQL; but to get real traction/adoption from developers, you need more than that…  You need open applications for analysis, not just a SQL interface.  He feels that “in addition, these data stores often expose a proprietary interface for application programming (e.g. PL/SQL or TSQL), but not the full power of procedural programming.  More programmer-friendly parallel dataflow languages await discovery, I think.  MapReduce is one (small) step in that direction.”

Where is this new platform going to come from?  Any new platform must be centered around addressing these new user needs, which is hard to achieve by re-implementing an old spec in a new, clever way.  He cautioned that implementing a new, successful cut of the ANSI SQL spec would be a real undertaking.  Not only would it take ages before you had anything to show, but it would likely suffer the same scalability problems of previous implementations.

Facebook and Hadoop are Now Friends

Using Facebook as a real-world example, Jeff described the challenge of measuring how changes to the site improved or impaired user experience.  Their original data analysis system featured source data living on a horizontally partitioned MySQL tier and a cron job running Python scripts that pinged stats back to a central MySQL database.  The main problem with this setup was that it made intensive historical analysis difficult since the source data was spread over many machines and aggregating the data to the analytics database was a slow, inefficient process.  Plus, when it barfed, it took three days to replay the edit logs in order to diagnose the problem.

So Facebook hired a data warehouse engineer to build a 10TB Oracle warehouse.  This worked for a bit and would’ve been fine for small and medium-sized businesses, but ultimately didn’t scale — particularly when they turned on impression logging which generated over 400GB of data on the first day!  This quickly grew to 1TB of data per day in 2007.

You might suggest that since disks are cheap, why not throw more storage at the warehouse?  It turns out that, in addition to the problem of data volume, there was also a bottlenecking CPU utilization problem. The ETL process ended up taking more than a day to aggregate, import, and load the necessary data for analysis.  Jeff went on to explain that proprietary ETL vendors have lots of downsides and generally don’t scale well for large sets of databases (on the order of thousands, in Facebook’s case).  In addition, when “warts” start to show up in proprietary vendors, the closed nature of the software prohibits developers from tinkering with the source to diagnose and resolve problems.

Meanwhile, his team started to play with Hadoop on the side as an open source alternative.  They got a Hadoop cluster to replace the data collection and processing tiers.  So the new architecture still has multiple data sources (log files, MySQL) but is now fed into HDFS instead.  Work is done via MapReduce and the artifacts are then published to Oracle RAC servers for consumption by business intelligence and analysis.  It also simultaneously publishes results back to the MySQL tier.

Data Flow Architecture at Facebook

From “Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop“, slide 21

Initially, this shift was met with a lot of resistance mostly because Hadoop is Java-based and, since the majority of Facebook’s services were written in C++, the developers there weren’t comfortable in Java. But it wasn’t long before the new platform showed its strengths:

  • Switching to this system greatly reduced latency because the ETL process is no longer done in flight – it’s done after persistence in Hadoop.
  • Hadoop enabled Facebook to efficiently crunch extremely large data sets on the order of multi-petabytes, previously impractical under the old system.
  • The Hadoop data warehouse became easily accessible to developers which turned out to be a real bonus.  Developers previously found SQL to be an unfriendly environment because they couldn’t predict the impact of running SQL (it was easy for them to hose themselves and others) and because the dev environment for SQL was crude.  After switching, however, they found that a lot of Facebook’s developers started freely playing with the data set which fostered innovation and led to new features.

Shaping a New Platform

Jeff emphasized that while Hadoop provides a great foundation for data analysis, it’s not the whole story.  Today, there are many technologies built on top of Hadoop that need to be considered for your system.  For example: there is Hive, a system for offline analysis; there is HBase, an open source implementation of Google’s BigTable to name a couple.  He remarked that the abstraction layer needs to be redrawn to include the functionality provided by ETL, master data management (MDM), stream management, reporting, online analytical processing (OLAP), and search tools; all with a unified UI.

Jeff explained that SQL Server 2008 R2 is a good model.  SQL Server is no longer just a database – there are a bunch of associated products in the box offering a full suite of features.  You still have the old features like SQL Server Integration Services (ETL), SQL Server (data warehouse), SQL Server Reporting Services, SQL Server Analysis Services, and full-text search.  But now you also get a bunch of new features such as stream management (StreamInsight) providing real-time analytics, OLAP (PowerPivot) enabling rapid navigation of subsets of data, collaboration via SharePoint, MDM for integrating disparate data sources and entity resolution, and features that aid in scaling your servers out to a many-node SQL solution.  Jeff remarked that it’s “kind of scary that Microsoft has started to do a lot right within the last 5 years.”

Providing a full suite of features is also what Cloudera does well, but for Hadoop.  They’re not the primary developers of this stuff (currently only 3 out of 17 contributors on HDFS), but they do an excellent job at packaging and polishing Hadoop and make their money in training, services, and support.  And, like Microsoft, they eat their own dogfood: using the tools they build to solve their own business problems.  Jeff joked that it’s “interesting being a vendor now – I can see what we put these other vendors through [while at Facebook].”

Many thanks to Jeff for the great talk, Greylock for helping with the logistics (and providing the delicious pizza and beer), and to everybody that came out!  Be sure to check out the next talk on June 10th when our own Sasha Aickin, Redfin’s head of user experience, will weigh HTML 5 vs. Native Apps.


May 1, 2010

Engineer-to-Engineer Talk: How and Why Twitter Uses Scala

To kick off our San Francisco series of engineer-to-engineer lectures on new technologies and interesting problems in consumer software, we invited in the Great Alex Payne to talk about how Twitter uses Scala, a programming language that combines traits of object-oriented languages and functional languages with an eye toward supporting concurrency better in large-scale software.

Alex started at Twitter in 2007, working remotely in Washington DC, when there were “only one and a half engineers.” Now, Twitter has 170 engineers. “It has been an interesting process,” Alex said. Right after his talk, Alex packed up his cats and headed for Portland, where he’ll still work for Twitter, but ensconced in a smaller, more closely-knit community. Here are his thoughts on Scala (Alex talks fast, and doesn’t waste many word, so my hands were in a rictus of agony from trying to type what he wrote) :

Best, Glenn at Redfin

I started working the programming interface when we were at this very early stage. Now, it handles a couple billion operations every day. It is being baked into more and more of the Web.

I’ve spent the past year working on Twitter’s infrastructure. For that, we use a weird language called Scala. I worked on a book for O’Reilly about Scala that you could sit down with over a three-day weekend to get up to speed on the language.

Why Use Scala?
Why use Scala when you have Ruby and Ruby on Rails? Well, we still use Rails. It works great for front-end stuff. The productivity is worth the tradeoff for working in a slower-performing dynamic language. When you think about what a web framework is doing under the hood, it’s tons and tons of string concatenation. Ruby on Rails can handle that.

What we had a need for as Twitter grew was for long-running heavy processes, message-queuing, caching layers for doing 20,000 operations a second. Ruby garbage-collection is tough, Ruby doesn’t do really well with long-running processes.alex.payne Engineer to Engineer Talk: How and Why Twitter Uses Scala

Languages Twitter Considered
We knew we needed another language. How did we pick a language that was really fun for us? We considered Java, C/C++ of course. And we looked at Haskell and OCaml for functional programming, though neither has gotten much commercial use. Erlang developers are doing stuff with a lot of network I/O but not with a lot of disk I/O; the knowledge-base around the language wasn’t great though, and the community seemed inaccessible.

Java is easy to use, but it’s not very fun, especially if you’ve been using Ruby for a while. Java’s productive, but it’s just not sexy anymore. C++ was barely considered as an option. Some guys said, if I have to work in C++ again, I’m going to stab my eyes out with a shrimp fork. Java-script on the server-side via Rhino had performance problems, and it wasn’t quite there yet when we were evaluating it.

So what were our criteria for choosing Scala? Well first we asked, was it fast, and fun, and good for long-running process? Does it have advanced features? Can you be productive quickly? Developers of the language itself had to be accessible to us as we’d been burned by Ruby in that respect. Ruby’s developers had been clear about focusing it on fun, even sometimes at the expense of performance. They understood our concerns about enterprise-class support and sometimes had other priorities.

We wanted to be able to talk to the guys building the language, not to steer the language, but at least to have a conversation with them.

Was Scala Fast?
And did Scala turn out to be fast? Well, what’s your definition of fast? About as fast as Java. It doesn’t have to be as fast as C or Assembly. Python is not significantly faster than Ruby. We wanted to do more with fewer machines, taking better advantage of concurrency; we wanted it to be compiled so it’s not burning CPU doing the wrong stuff.

What Alex Likes About Scala
Scala is a lot of fun to work in; yes, you can write staid, Java-like code when you start. Later, you can write Scala code that almost looks like Haskell. It can be very idiomatic, very functional — there’s a lot of flexibility there.

And it’s fast. The principal language developer at Scala worked on the JVM at Sun. When Java started, it was clearly a great language, but the VM was slow. The JVM has been brought to the modern age and we don’t think twice about using it.

Scala can borrow libraries from Java libraries; you’re compiling down to Java byte code, and it’s all calling back and forth in a way that is really efficient. We haven’t run into any library dependencies that cause problems. We can hire people with Java and they can do pretty well.

The community is small but growing, and it’s really accessible. We got to sit down with Martin and ask him and his team about funding for Scala, how problems with Scala will get solved. We’ve never really had to call on that level of access, but it’s really nice to know it’s there.

The Grand Unified Theory of Scala
The grand unified theory of Scala is that it combines objective-oriented programming (OOP) and functional programming (FP). Scala’s goal is to essentially say OOP and FP don’t have to be these separate worlds. It’s kind of zen, and you don’t get it when you first start. It’s really, really powerful; it’s nice to have a language with a thesis, rather than trying to appeal to every programmer out there. Scala is trying to solve a specific intellectual problem.ScalaLogo 300x88 Engineer to Engineer Talk: How and Why Twitter Uses Scala

You have methods that take anything between a string and several point away on the inheritance chain from a string. The syntax is more flexible than Java; it’s very human-readable, as you can leave out period between method calls so it looks like a series of words. Your program can make nice declarative statements about the logic of what you’re trying to do.

Traits, Pattern-Matching, Mutability
With Scala, you can also use traits. This is handy because of course you have cross-cutting concerns in your application. For example, every object needs to be able to log stuff, but you don’t want everything extending from a logger class — that’s crazy. With Scala, you can use a trait to shove that right in, and you can add as many traits as you like to a given class or object.

You can choose between mutability and immutability. This can be dangerous. 9 out of 10 times you use immutable variables when you want predictability, especially when you have stuff running concurrently. But Scala trusts the programmer for mutability when he or she needs it.

Scala has the concept of lazy values – you can say lazy val x = a really complicated function. That isn’t going to be calculated until the last second, when you need that value. This is nice.

Pattern-matching is nice too. It lets you dive into a data structure so you can, for example, explode out a collection that matches an array with “2” as its third element. You can break out strings and regular expressions, and you can pattern-match groups with regular expressions.

An oddball feature that is really useful is the ability to use XML literals, so that you can make something equal to an XML literal, as if the XML literal is a string. You don’t have to import Sax or some crazy XML library.

The Concurrency Story
When people read about Scala, it’s almost always in the context of concurrency. Concurrency can be solved by a good programmer in many languages, but it’s a tough problem to solve. Scala has an Actor library that is commonly used to solve concurrency problems, and it makes that problem a lot easier to solve.

An Actor is an object that has a mailbox; it queues messages and deals with them in a loop, and it can leave a message on the floor when it doesn’t know what to do with it.

You can model concurrency as messages – a unit of work — sent to actors, which is really nice. It’s like using a queuing system. You can also use Java.util.concurrency stuff too, Netty and Apache Mina, dropping it right in. You can rewrite the Actor implementation, and some folks have gone so far as rolling their own software transactional memory libraries.

Java interoperability is a big, big win. There are ten years of great libraries, things like Jodatime. We use a lot of Hadoop and it has been easy to wire Scala to the Hadoop libraries. We use Thrift, without having to patch it; we use libraries from the Apache Commons and from Google.

How Twitter Uses Scala
So that’s why we use Scala, but how do we use it?

In the enterprise world, a service-oriented architecture is not new, but in Web 2.0 it is crazy new science. With PHP or Ruby on Rails, when you need more functionality, you just include more plugins and libraries, shoving them all in to the server. The result is a giant ball of mud.

So anything that has to do heavy lifting in our stack is going to be an independent service. We can load-test it independently, it’s a nice way to decompose our architecture.

What services at Twitter are Scala-powered? We have a queuing system called Kestrel. It uses a souped-up version of the mem-cache protocol. We originally wrote it in Ruby — it got us through a few weeks, but because Ruby is a dynamic language, the service began to show its performance weak spots.

Flock to Store the Social Graph
We use Flock to store our social graph, as a denormalized list of user ids. It’s not a graph database, so you can’t perform random walks along the graph. But it’s great for quickly storing denormalized sets of user ids, and doing intersections. We’re doing 20,000 operations a second right now, backed by a MySQL schema designed to keep as much as possible in memory. It has been very efficient — not many servers are needed.

Hawkwind for People Search
Our people-search is powered by a Scala-built service we called Hawkwind. It’s a bunch of user objects dumped out by Hadoop, where the request is fanned out to multiple machine and then pulled back together.

Hosebird for Streaming
We stream out tweets to public search engines, using a low-latency, HTTP-based, persistent connection system called Hosebird. We looked at queuing systems that financial-services companies use, but couldn’t find anything that could handle the volume of the load. We built something on top of Jetty using Scala. We have more Scala-powered services in the works that I can’t talk about.

Thrift for Transferring Data
We use also Thrift, built at Facebook then open-sourced at Apache. With Thrift, you can define data structures and methods, and it deals with everything you don’t want to deal with to efficiently represent data and get it from point A to point B. As your system evolves, your method signatures change, and Thrift has a nice system for creating positional arguments and being backwards compatible.

These services make our life a lot easier. We often staff projects with two people who are pair programming, sitting together for six or eight weeks. These guys can build something like people-search in a couple of months.

The only problem with so many different teams is that there is some divergence in terms of operational approaches – we have to work with ops guys to monitor the right stuff, be it disk or memory or what have you — but we can resolve that jitter over time. We’re ok with the tradeoffs.

The Development Environment
OK, now let’s talk about the tools… the IDEs for Scala are not up to snuff, that is true. IntelliJ IDEA is good but it’s shockingly buggy. The solution we’ve settled on is just using a plain text editor. We use EMACS, as there’s a really nice mode for the build tool. That takes compile/test BS out of your workflow. Of course, you can give the IDEs a try. Even though I’m an IDE cynic, maybe they’ve improved; that said, a plain text editor can be really productive.

Simple Build Tool
sbt is our Simple Build Tool, but it’s not simple or limited in any way. It’s Scala’s answer to Ant and Maven, and really it’s a superset of Ant and Maven. It’ll set up a new project, create a nice project structure for you and manage dependencies — you can slap ‘em right in by copying XML.

You can write your own build-tasks. We added support for Thrift in an afternoon; it’s got a library for shelling out, as Java is not so great at shell operations because it targets so many platforms. sbt is well well documented. And the absolutely coolest feature is that it’s got an interactive console interface where you can type in code and see how it works.

So that means sbt can insert you in an interactive way into your running program. This is great for debugging, great for sketching code out. You have a nice workflow where you don’t have to worry about compilation.

specs
We’re very test-driven, we’re not wedded to behavior-driven development (BDD), but the best library in Scala is BDD-oriented. You can throw in different mocking libraries, and it works just as well in Scala as Java.

Libraries
We’ve built a bunch of libraries. We gather a lot of stats, I mean, A LOT. We spent the first year of Twitter pushing forward on features, but never thinking about what we were building scientifically. That bit us in the ass in a big way.

You’ve probably seen a gradual increase in stability. At conferences, people ask us if it was the switch from Ruby to Scala, or if it was more machines. But really what did it was gathering numbers on everything, setting metrics and trying to improve.

Ostrich helps here. It is an in-process statistics gatherer, with counters, gauges, timers. You can share stats via JMX, JSON-over-HTTP etc. Hopefully it’s pretty simple to use and easy to integrate.

Configgy manages configuration files and logging in a really nice, flexible way. You can include config files in one another and you can do inheritance; it throws in a really nice logging wrapper, with lazy evaluation on the values you’re trying to log so you don’t burn machine-time generating log statements. It has a subscription API for pushing out a new config file. It’s a little crazy to have our own config file format, but Scala makes it work.

xrayspecs: this is an extension to specs, because we need a way to test concurrent operations. Some of the extensions in xrayspecs have been merged back into specs. We can freeze and unfreeze time.

scala-json: this is a better Scala JSON codec. We’ve used this really heavily in production for a while. If you need something like this, hopefully it’ll do the job.

Other Twitter Scala libraries: Naggatti (protocol builder for Apache Mina), Smile (Actor-powered memcached client), Querulous (a nice SQL database client) and Jackhammer (a load testing framework in its early stages). Check out GitHub for more.

How Do we Teach People?
I think we’re employing at Twitter about half the people in the world who know the Scala language. The other half are academics or at Foursquare. Even though Scala’s getting more and more popular, fundamentally we can’t hire people with experience in the language.

Pair Programming, Code Reviews
To start people out, we pair program. It isn’t mandatory at Twitter, but it’s a great way to learn Scala. We’ve come up with a bunch of style guides. The good and bad thing is that Scala’s going to be C++ in ten years, because there’s just a lot of surface area and it can get complicated. For that reason, we are pretty rigorous about a style code.

We do code reviews; it doesn’t go into the master branch if it hasn’t been reviewed by your peers. Right now, I’m working with a guy we hired from Google. He’s an amazing engineer, far better than I am, but at first he didn’t know Scala.

When I looked at his code, there was absolutely nothing wrong under the hood. But we’d go through and say, “Here’s where this line could be a little more idiomatic from a Scala perspective.” I do classes over lunch – but you need a big group to commit to come every week. Then there’s my book, and there’s other books: Dave Pollak’s book, the Odersky book (Programming in Scala, aka “the stairway book”). If you learn by example and need a desk reference, grab “the stairway book.” Or search Google for a talk by my co-worker on “The Seductions of Scala” for lots of examples

What Version of Scala Does Twitter Use?
We use 2.7. It’s got a couple of warts, particularly in the collections classes. Scala 2.8 fixes a lot of those warts, and there’s a bunch of performance work in there too, plus the ability to have named arguments in your functions.Jeff Hammerbacher

I’m co-organizing a Scala summit at the OSCON conference in Portland this summer; come to that if you want to learn more! There’s a great blog called DailyScala, where an engineer writes about what he’s learning. I learn stuff from that guy all the time…

And that was it! Many thanks to Alex for his magnificent talk, and to all the lovely folks who visited our offices! We had a lot of fun, we learned a ton, and now we’re looking forward on May 20 to hearing from Cloudera’s Jay Hammerbacher — the man who conceived of and built the data team at Facebook — on Hadoop. Everyone’s invited!


close