Engineer-to-Engineer: Evolving a New Analytical Platform with Hadoop

In the second installment of our San Francisco series of engineer-to-engineer lectures, Jeff Hammerbacher described the challenges of building data-intensive, distributed applications and how using Hadoop saved the day at Facebook.  Speaking to an audience of approximately thirty Hadoop experts and enthusiasts hailing from all around the Bay Area, the Valley, and even Seattle, he also discussed what’s wrong with today’s analytical platforms and what will shape the platform of the future.

And Jeff should know.  After studying Mathematics at Harvard and wearing a suit as a quantitative analyst on Wall Street, he conceived, built, and led the data team at Facebook.  He then went on to start Cloudera, the leader in commercializing Apache Hadoop, where he currently works as Chief Scientist and VP of Products.  Jeff also served as Contributing Editor for a book: Beautiful Data: The Stories Behind Elegant Data Solutions, the proceeds of which are split between Creative Commons and Sunlight Labs.

The Scoop on Hadoop

Hadoop is an open source framework that enables data-intensive distributed applications to efficiently process gigantic amounts of data.  It’s an open source implementation of the MapReduce approach to processing data.  MapReduce was invented at Google to deal with the massive quantities of data necessary to index the web.  There are two main components to the system: the Hadoop Distributed File System (HDFS) which stores and maintains data across many machines, and the MapReduce engine which processes the data.

But the talk didn’t really go into Hadoop internals — as Jeff pointed out, the documentation is readily available online.  Rather, the talk was about how and why Hadoop will provide the foundation on which the next generation platform for analytics will be built.  Making bold predictions about technology is hard.  Jeff quoted Larry Ellison’s quip that “the computer industry is the only industry that is more fashion-driven than women’s fashion.”  And yet, using real-world examples from his experience at Facebook, Jeff makes a compelling sell.

Bottlenecks, Costs, the Black Box, and the Kitchen Sink

A typical architecture for large-scale data analysis includes a data source, a data warehouse, ETL (aka: “extract-transform-load”; the step that gets data out of and into RDBMSs and converts source data to the data warehouse’s format), and business intelligence and analytics systems – all of which are usually centered around relational databases.  However, Jeff stressed that a relational database is a specialty and not a foundation, arguing that the abstractions provided by them are no longer useful on their own for analytical data management.

One reason is that over the past few years, there has been an explosion in data volume primarily originating from machine-generated logs.  By simply tweaking an Apache log, you can grow your data volume and complexity by several orders of magnitude.  As we’ll see in Facebook’s case, their relational database approach simply didn’t scale and they soon needed new tools to handle the load.

Another point Jeff made was that the percentage of data that actually gets stored in a relational database is shrinking.  What do you do with all the unstructured data (accounting for 95% of the digital universe) that doesn’t necessarily make sense to persist relationally?  Do you still need expensive relational data warehouses and proprietary boutique servers?  Jeff’s team at Facebook made a bet on commodity hardware which turned out to be a good move, ultimately pushing the complexity out of hardware and into the software layer.

They also bet on open source data stores built by consumer web firms, arguing that web properties have the most representative problems: scalability and unstructured data management.  Jeff stated that most production-quality data stores came from enterprise software firms in the mid-1990s, but now a growing percentage of the world’s data is persisted in open source data stores.  He also mentioned that a nice side-effect of adopting open source solutions is that it’s much better to have a modular collection of open tools rather than an opaque abstraction.  Why?  Because there’s great benefit in being able to pick and choose solutions and understand what’s going on under the hood.

Jeff noted that another problem is that, in many cases, enterprise software does not service developers well. Many relational data warehouses simply just expose SQL; but to get real traction/adoption from developers, you need more than that…  You need open applications for analysis, not just a SQL interface.  He feels that “in addition, these data stores often expose a proprietary interface for application programming (e.g. PL/SQL or TSQL), but not the full power of procedural programming.  More programmer-friendly parallel dataflow languages await discovery, I think.  MapReduce is one (small) step in that direction.”

Where is this new platform going to come from?  Any new platform must be centered around addressing these new user needs, which is hard to achieve by re-implementing an old spec in a new, clever way.  He cautioned that implementing a new, successful cut of the ANSI SQL spec would be a real undertaking.  Not only would it take ages before you had anything to show, but it would likely suffer the same scalability problems of previous implementations.

Facebook and Hadoop are Now Friends

Using Facebook as a real-world example, Jeff described the challenge of measuring how changes to the site improved or impaired user experience.  Their original data analysis system featured source data living on a horizontally partitioned MySQL tier and a cron job running Python scripts that pinged stats back to a central MySQL database.  The main problem with this setup was that it made intensive historical analysis difficult since the source data was spread over many machines and aggregating the data to the analytics database was a slow, inefficient process.  Plus, when it barfed, it took three days to replay the edit logs in order to diagnose the problem.

So Facebook hired a data warehouse engineer to build a 10TB Oracle warehouse.  This worked for a bit and would’ve been fine for small and medium-sized businesses, but ultimately didn’t scale — particularly when they turned on impression logging which generated over 400GB of data on the first day!  This quickly grew to 1TB of data per day in 2007.

You might suggest that since disks are cheap, why not throw more storage at the warehouse?  It turns out that, in addition to the problem of data volume, there was also a bottlenecking CPU utilization problem. The ETL process ended up taking more than a day to aggregate, import, and load the necessary data for analysis.  Jeff went on to explain that proprietary ETL vendors have lots of downsides and generally don’t scale well for large sets of databases (on the order of thousands, in Facebook’s case).  In addition, when “warts” start to show up in proprietary vendors, the closed nature of the software prohibits developers from tinkering with the source to diagnose and resolve problems.

Meanwhile, his team started to play with Hadoop on the side as an open source alternative.  They got a Hadoop cluster to replace the data collection and processing tiers.  So the new architecture still has multiple data sources (log files, MySQL) but is now fed into HDFS instead.  Work is done via MapReduce and the artifacts are then published to Oracle RAC servers for consumption by business intelligence and analysis.  It also simultaneously publishes results back to the MySQL tier.

Data Flow Architecture at Facebook

From “Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop“, slide 21

Initially, this shift was met with a lot of resistance mostly because Hadoop is Java-based and, since the majority of Facebook’s services were written in C++, the developers there weren’t comfortable in Java. But it wasn’t long before the new platform showed its strengths:

  • Switching to this system greatly reduced latency because the ETL process is no longer done in flight – it’s done after persistence in Hadoop.
  • Hadoop enabled Facebook to efficiently crunch extremely large data sets on the order of multi-petabytes, previously impractical under the old system.
  • The Hadoop data warehouse became easily accessible to developers which turned out to be a real bonus.  Developers previously found SQL to be an unfriendly environment because they couldn’t predict the impact of running SQL (it was easy for them to hose themselves and others) and because the dev environment for SQL was crude.  After switching, however, they found that a lot of Facebook’s developers started freely playing with the data set which fostered innovation and led to new features.

Shaping a New Platform

Jeff emphasized that while Hadoop provides a great foundation for data analysis, it’s not the whole story.  Today, there are many technologies built on top of Hadoop that need to be considered for your system.  For example: there is Hive, a system for offline analysis; there is HBase, an open source implementation of Google’s BigTable to name a couple.  He remarked that the abstraction layer needs to be redrawn to include the functionality provided by ETL, master data management (MDM), stream management, reporting, online analytical processing (OLAP), and search tools; all with a unified UI.

Jeff explained that SQL Server 2008 R2 is a good model.  SQL Server is no longer just a database – there are a bunch of associated products in the box offering a full suite of features.  You still have the old features like SQL Server Integration Services (ETL), SQL Server (data warehouse), SQL Server Reporting Services, SQL Server Analysis Services, and full-text search.  But now you also get a bunch of new features such as stream management (StreamInsight) providing real-time analytics, OLAP (PowerPivot) enabling rapid navigation of subsets of data, collaboration via SharePoint, MDM for integrating disparate data sources and entity resolution, and features that aid in scaling your servers out to a many-node SQL solution.  Jeff remarked that it’s “kind of scary that Microsoft has started to do a lot right within the last 5 years.”

Providing a full suite of features is also what Cloudera does well, but for Hadoop.  They’re not the primary developers of this stuff (currently only 3 out of 17 contributors on HDFS), but they do an excellent job at packaging and polishing Hadoop and make their money in training, services, and support.  And, like Microsoft, they eat their own dogfood: using the tools they build to solve their own business problems.  Jeff joked that it’s “interesting being a vendor now – I can see what we put these other vendors through [while at Facebook].”

Many thanks to Jeff for the great talk, Greylock for helping with the logistics (and providing the delicious pizza and beer), and to everybody that came out!  Be sure to check out the next talk on June 10th when our own Sasha Aickin, Redfin’s head of user experience, will weigh HTML 5 vs. Native Apps.