Getting started with Mechanical Turk for data cleanup

In the months leading to the launch of Redfin Open Book, we embarked in an ambitious data cleanup project. We had 7,500+ free-form text fields from which we needed to extract structured vendor information. We ended up with 3,000+ cleaned database records. In this post I’ll walk you through how we used MTurk, and share the lessons I learned from using it.

The source data

These free-form text fields were notes taken by folks in the field about the various vendors they worked with in client transactions. As with any free-form text fields, the content was about as unique as the person entering it. Some of these notes were fairly suitable for computer processing, containing information in a similar format as below:

Jane Doe
ABC Company
(206) 555-5555
123 Main St, #75
Seattle, WA

The challenge was that they frequently included “notes to self” about the vendor, like “Jane will be at the client’s on 4/15” or “Quoted 675 to the client”, etc. Variations in order were also frequent.

A dead end

Having a software engineering background (and reminiscing about my year-long school project on Natural Language Processing) the first approach I attempted was automated tools for extracting and structuring the data.

I spent a couple of days brushing up on text mining tools, and I found the most suitable to be RapidMiner (great tool, hopefully will do a write-up later).

However, I realized I wouldn’t be able to extract much more than emails and phone numbers without a significant investment in coding.

Tapping the collective: getting started with MTurk

I had researched MTurk in the past and was somewhat familiar with the premise of the service. However, I had never used it beyond small trial jobs.

I won’t go into detail about how it works — that’s what the documentation is for. Here are some basic concepts that should get you up-and-running with little prior experience:

  • At the core of MTurk there’s the concept of a “HIT” or human intelligence task, which is the unit of work you’ll request from your workers
  • When defining a HIT, you need to think of:
    • A set of input variables (what data will you provide?)
    • A set of output variables (what data you want to extract?)
    • The layout the workers will see
  • You get to define the job parameters in the HIT Template, these include qualifications you’ll ask for the workers, how much will you pay per HIT and how many times you want each HIT to be completed (you want at least 2, more on this below)
  • You publish HITs in “batches”. There is a command-line interface, an API or you can simply upload .CSV files. I used the latter.

Step-by-step guide

Here are the key steps. For brevity I skipped some steps that are self-explanatory (like funding your account, etc.)

  1. Create your account, if you don’t already have one at https://requester.mturk.com/
  2. Click on the “Design” tab, and select any category
  3. From the list of templates presented, select the “Data Correction” template, as it has a few examples on how to work with input/output fields
  4. Create a title for the HIT, be descriptive of what’s required from the worker. Besides the reward per HIT, this is the information that workers will use to make a decision on whether they work on your HIT or not. Here’s what I used: ”You will see a small paragraph of text that contains contact information about a vendor and you’ll split it into separate fields (name, address, phone, etc)
  5. Update any parameters you’d like. I changed the rewards per assignment (started paying 8c) and the number of assignments to 2 per HIT
  6. Click on the layout, and using HTML form syntax you will be able to define the input and output parameters
    • Input parameters are strings of the form ${Parameter}. In my case I had two inputs: The free-form text field, and a unique ID that I could join back to our database records
    • Output parameters will be the names that you give to the form elements (INPUT, TEXTAREA, etc) that appear in your design layout. In my case they were: First Name, Last Name, Phone, Address, etc.
  7. After saving, you’re ready to load your first batch. Go to the “Publish” tab
  8. Click the “Download” link to the right of the “Upload” button. You will receive a .CSV file with a column for each input.
  9. Make your data match that .CSV format, you filling out the input parameters. I did some pre-processing on Excel to discard rows that weren’t worth processing (e.g. empty or single word notes), and I replaced newlines with “/” characters using Excel’s SUBSTITUTE function
  10. Upload an initial file with a small sample of your data (100 or less)
  11. Go to “Manage” and in a few minutes you should be able to download a .CSV with the results
  12. The results .CSV has tons of columns, but starting on column AA you will see a column for each input and each output (in alphabetical order)
  13. If you chose two assignments per HIT, you will see two rows of output for each row you submitted
  14. For post-processing, I highly recommend Excel. Here’s an example spreadsheet I used. Columns AA & AB are the data shown to the workers, columns AZ through AP is the results from the HITs (two rows per HIT). Columns AS through BH are for comparing content and using conditional formatting to highlight the fields where there were discrepancies between the two outputs from the same HIT
  15. Fixing discrepancies can be labor-intensive, using simple Excel macros can be of great help. An alternative that occurred to me was to feed the discrepancies back into MTurk to identify the correct output. I ended up not pursuing it since I could enroll the help of some folks at work to give me a hand. Let me know if you try this alternative!
  16. After all discrepancies are resolved, voilà! you have structured data you can upload to your database
  17. Make any updates necessary to the design template and publish subsequent batches to process the rest of your data

Tips and Tricks

  • The one tip to remember: Run small batches first to see if there’s a pattern in the discrepancies you’re getting. Go back to your layout design and address those by being specific about what you want from the workers. Using this iterative approach I reduced the error rate from 11% to 4%
  • Less ambiguity yields better results. Initially I only asked for “First Name” & “Last Name”, but I saw differences on how cases like “Arthur C. Clarke” were handled. I decided to add a “Middle Name” column, and discrepancies in name dropped from 8% to under 2%. Similar improvement applied to Address when I added “Suite #” as a separate field to extract.
  • Be clear on the format you expect on the extracted data. Phones had a high discrepancy rate until I requested the phone numbers to be in the 888-888-8888 format
  • The more specific you are (i.e. more columns) the *faster* the workers get through your HITs. That’s because you’re leaving out the guesswork about how you want the data entered. The hourly rate dropped from $4.20 for the first batches to $2.40 for the last batch, which had five additional columns.
  • When you see an odd pattern of errors (in my case it was First/Last name swapped), take a look at the worker ID. Some workers were making the same error over and over again, so I was able to quickly fix the errors when I filtered by that worker
  • Reject the HITs where the worker clearly didn’t follow instructions. You will be able to request a minimum HIT success rate in future batches, and you’re helping future requesters.

Discussion

  • JanelleS

    Nice post, Alfredo. Viva la Open Book!

  • http://www.cloudstaff.com/downloads/dataentry Ramon Andrews

    Thank you for sharing this post. It is very informative and helpful. Cheers!

  • http://www.process-box.com/ Clarissa Lucas

    A very interesting article on data cleanup. Thanks for posting.

    • Alfredo Mendez

      Thanks Clarissa, glad you find it useful!
      You made me realize I need to repost with some updates due to UI changes to MTurk.
      Cheers.