Reeling in the years, stowing away the time

Defining data feeds that need to be consumed elsewhere might seem like a “solved” problem, simply because there are so many data feeds and standards out there in the wild.  But, as the story below demonstrates, there are a lot of subtleties that need to be considered when determining how to represent each piece of data.

Redfin recently implemented a feature to allow people to search based on the monthly amount of Homeowner’s Association Dues (HOA Dues).  One of the most difficult aspects of this feature has been determining whether to interpret the dues as a monthly or a yearly amount (e.g. $100/month of $100/year).

First, a bit of background… Redfin pulls in real estate data from dozens of feeds provided by regional databases called Multiple Listing Services (MLS).  Each of these data feeds has its own (usually XML) structure.  So, for example, in one feed, the HOA Dues field might be known as “MONTHLY_HOA_DUES” – in that case, it’s pretty straightforward to figure out that we’re dealing with a monthly amount.   And some of the data feeds include a description in the metadata that tell us whether the interval is monthly or yearly.

But in another feed, the HOA Dues data field might be called simply “FEE_AMT,” with no additional metadata description.  Which means we need to do some sleuthing.  In a few cases, we had to ask one of our real estate agents to log into their MLS’s user interface to see whether it specified that the amount should be monthly or yearly.  But then we found that for one MLS, neither the data feed nor the user interface specified whether the real estate agent should be entering a monthly or yearly amount.  Which means that that data is essentially a messy mix of monthly and yearly amounts (uggh).  (Note: that MLS has said they are planning to add an interval field to the user interface and data feed, which we are eagerly awaiting!)

Passing data from a human to a user interface to a database to a data feed to a data consumer (Redfin) is like a game of telephone, with bits of information falling off at each step.  When you are modeling your data, make sure to think about how it might ultimately be consumed!