A/B Testing Framework: Build or Buy?

Six months ago Redfin needed an A/B testing framework. The big question: “Do we use a service or do we build our own?” After a lot of investigating we chose to build our own. This summary of our investigation should provide some useful context to anyone trying to make the same decision.

What flavor do you need? png;base64647be74a62654577

One thing I should point out is that A/B testing is a very broad term. It can range from testing marketing copy, to tracking session behaviors, to testing interactive features that require abstract custom defined behavior. Third-party solutions usually focus on one area really well,but can’t do something else that you really need. Be clear on what you need your framework to do vs. what would be nice to have.

Buying or selling a home is a multi-month (sometime multi-year) journey which makes a customer’s relationship with Redfin pretty unique. We need to measure experiment results over months, not just within a single web session.  Of course we don’t want to run A/B test for months, so we use proxy measurements that we’ve correlated to long term customer behavior. We knew we needed a framework that could test interactive features and track custom defined behaviors.

Bandits (so hot right now)

png;base64c3830d2c32744f4c

One other consideration when picking a framework is your need for multi-armed bandit testing. At the highest level multi arm bandit algorithms automatically direct more traffic to the winning variant(s). This means you start getting more of the benefits of your change sooner and never end a test too soon, because you never really have to end a test (the algorithm just directs all the traffic to the winning variant).  Bandit testing is “Oh so hot right now” and like most things that are hot, there are strong opinions on why you’d be stupid not to use it and why you might be smart not to. Initially we were interested in bandit testing, but we decided we didn’t need it.

 

Peer Research

We knew that all the big guys (Amazon, Google, Facebook, etc.) built their own framework, but they are waaay bigger than us and they started before commercial solutions were available. What we really wanted to know was “What are other startups our size doing? Were they buying or building?”

Luckily we’re backed by a couple of great venture capital firms who helped us connect with other companies in their networks. In total we talked with engineers and product managers at 11 companies including Etsy, Quora, Decide, Payscale, Zynga, Icanhascheezburger, Vizify, and more.

We learned that nine of the eleven startups we talked to had built their own A/B testing framework.

Surprised? We were.

Dan McKinley: Principal Engineer at Etsy and occasional sea captain.

Dan McKinley: Principal Engineer at Etsy and occasional sea captain.

The most useful insights from talking with other companies came from Dan McKinley at Etsy. If you haven’t watched his talk, you should. Dan was kind of this “purveyor of truth” around A/B testing for us. In our conversation with him (and in the video) he frequently drops these nuggets of knowledge (simply stated, yet very profound). We continually found out how true they were.

My favorites:

  • Counting is really hard.

  • Being able to dig into the data is important. Most of the time you’ll use the first round of data to figure out how you set up your experiment incorrectly.

Trying the Buy Route

While talking with other companies we also began a technical investigation into the new Google Content Experiments (GCE) API and an up-and-coming A/B testing framework provider. At Redfin, before we decide to build a large system, we always take a hard look at what already exists. We’re a small company, so in many cases we’ve been able to get further faster by relying on third party tools

Regarding GCE, one team at Redfin was already using GCE to do simple A/B tests, but had run into a lot of limitations. We were excited about the new “more robust” API, but we quickly found GCE was still too limited for our needs (ex: poor support for testing on dynamic ajax-heavy pages).

 

The other 3rd party service we tried included multi-armed bandit testing and was great to work with – responsive, understanding, and agile. They even let us talk with one of their customers. It was very easy to set up, and we got it running quickly. However, we started having trouble with the system during the reporting phase, and debugging those mistakes was time consuming and frustrating. Once we thought we had everything working, we ran an AA test (a test where both variants are identical) to verify our setup. Since the two variants were the same we expected that the framework wouldn’t pick a winner. To our surprise the algorithm picked a winner overnight. We repeated the AA test a couple more times with the same result. The provider explained this was actually expected because there isn’t any difference between the two variants. While this is “correct”, it blocked us from validating the accuracy of the framework. Bottom line: if you don’t trust the results from your A/B framework, it’s not going to work.

Building

Kevin & Wei-Ting.  The two engineers who built the framework.

Kevin & Wei-Ting. The two engineers who built the framework.

We decided to switch to building our in-house A/B testing framework for three main reasons:

  1. From our experience using a 3rd party, we had gained a deep appreciation for the ability to dig into the raw data to debug surprising results.

  2. We were more convinced we wanted the ability to connect our A/B results with internal business metrics.

  3. 90% of the companies we talked with had built their own and were happy with their choice.

 

Building the first version of our own A/B testing framework took two developers (one senior, one junior) about six weeks.

This cost comes with some important caveats because we already had several pieces of infrastructure we could reuse.

  • Bouncer: Code that gives us the ability to turn part of the site on and off for different groups of people. We added controls for experiments and the algorithm for putting users into buckets.

  • Apache weblog processing: We already have a great team of data scientists who process the Apache logs on an Amazon Redshift cluster. We added our own processing to pull out the A/B test results.

  • Google Spreadsheets: One of the best decisions we made was to use Google Spreadsheets for reporting. Our engineers wrote scripts that populate Google spreadsheets with the test results and easy to read graphs. We can’t be sure, but we think creating our own HTML reports would have taken longer.

Benefits

  • The big benefit of building our own framework has been the ability to connect user behavior with proprietary information. This would be impossible with a 3rd party solution without giving them access to our internal data or pulling in their data. We’ve integrated several custom events into our reports (ex: number of homes viewed per visit).

  • Because the A/B framework uses familiar systems and languages and the code is in the same repository, any engineer at Redfin can debug it, which has helped streamline internal adoption.

  • We’re planning to use the same system for adding A/B testing in our emails and mobile apps. It will be nice to have the results in the same place.

Shortcomings

  • Our framework doesn’t get better unless we spend engineering time on it.

  • We haven’t implemented multi-armed bandit testing.

Conclusion

Thanks to Madrona Ventures and the engineers and product managers at the 11 companies we talked to especially Dan McKinley, Chris Han, and Eli Tucker.  Thus far building our own framework feels like the right decision. We’re currently enabling A/B testing in our emails and plan to enable A/B testing in our mobile app this spring.  At some point in the future we may find that 3rd party A/B testing solutions have matured enough that we’ll want to switch, but for now the ease of use and integration into our business data wins.

 

We’d love to hear your questions and comments.

Quinn Hawkins @qhawk

Lead Product Manager @Redfin

 

 

Discussion