HubSpot Dev Blog

Current Articles | RSS Feed RSS Feed

Data Processing -- Be Afraid, Be Very Afraid

Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Share on Facebook Facebook | Share on Twitter Twitter | Share on LinkedIn LinkedIn 

So, here at HubSpot, and at my previous gig at Lookery, I've been writing a pile of Hadoop code to take log files, pull out key info, sum it up in various ways, and store the results.

Overall, this has been a lot of fun, but I've developed a sort of healthy fear. Because of one thing: it's terrifyingly easy to make invisible mistakes.

E.g. say you're counting up unique visitors from paid search ads on Google. You run it on a month of data. Your program churns along, and then spits out: 12335.

And now we come to the problem: is 12335 the Right Answer?

If you've been writing programs for more than about ten minutes, you've discovered that your code usually has errors. If you're writing something which generates a web page (possibly by talking to a db), you find those errors relatively quickly. Even, worst case, you release it with a bug, then some customer says "Hey, when I click on thing X, it doesn't do what it should." This is Not That Bad.

But with data processing, errors can linger for a while, and, worse yet, can easily infect all the numbers you're collecting. Invisibly making everything wrong. Then, later, your customer says "Hey, why didn't these ads show up on my search marketing screen?" And it's totally non-obvious at which stage of your multipart data pipeline things went awry. It's also not clear how to fix all the existing data you've already collected, which is now suspect. This is Very, Very Bad Indeed.

Here's how I'm currently dealing with the Fear:

  • Test-Driven Development is Your Friend

    Without tests, why on earth would you trust the 12335 above? Also, data-processing programs tend to be very simple to test, because they have defined inputs and outputs. You're going to be a whole heck of a lot happier if you start with with tests.

  • Kill, Kill, Kill the Whole Pipeline

    This is a Toyota-inspired idea, and, again, differs from other kinds of coding. Basically, in any kind of error or unexpected situation, it's really good to just kill the whole pipeline immediately. This encourages the developers to immediately deal with issues, and work towards an entirely defect-free pipeline. The alternative makes it very easy for downstream data to become corrupted, again, in ways you can't easily remedy.

  • Reentrancy Will Save Your A**

    Even with your careful testing, and your aggressive pipeline stopping, you're still going to get into situations where you need to drop partial data and re-run. If you can make sure that every step can be run multiple times without causing trouble, you're going to be so, so much happier. E.g. if you're writing to a directory in HDFS, blow the entire directory away and recreate the whole thing. If you're writing summary rows into the db, record enough info to be able to either rewrite entire rows or skip ones which have already been entered.

That's my current set of takeaways -- anyone else have experiences on these fronts they'd like to talk about?

===

Update: the nice folks over at Hacker News point out that when I say "reentrant", I really mean "idempotent".  They are, in fact, totally right.

Comments

A trick I've used: Add a signal into the input and check that appears in the output, removing it from the final report. So for example you might put a fake visitor into the logs and make sure he shows up correctly the output stats. 
 
A naive variation of that is to run the beast twice, but add ten and be sure you get ten more on the output. With a bit of work you can usually thread these test signals so they are always in the runs. You then run both all the time - to assure the testing is always on - but only on a sample. 
 
I'm such a bad programmer I often write it two ways until i get the same results out both - sort of double entry accounting. You don't really need for both of these to be extremely efficient, eh? This lets me use more recreational languages for the 2nd version. 
 
The other thing I do a lot is look at the distributions of the outputs, since those shouldn't change much over time. The most naive variation of that is Benford's law.
Posted @ Wednesday, July 01, 2009 11:51 AM by Ben Hyde
I have been doing analytics for a year, and most of times the people who use numbers produced by you don't *trust* your number. Because thrid party like Omniture, comScore has more authority than you. Even when their systems are broken, the way they count numbers is by sampling/caching.  
 
And most of time, even if you get somehow people to agree on defination: week,visitor,visit,day and standardize on time. They still make up numbers.
Posted @ Thursday, July 02, 2009 2:14 PM by Sidharth Shah
with our stats app, we parse the logs, create sql files that drop the data into the db, create sql files that can rollback and remove that same data from the db, and then gzip the log, insert and rollback files all together. 
 
when we have run into issues, we just run the rollback files and then re-parse the logs. it's worked well so far (fingers crossed). nice article.
Posted @ Thursday, July 02, 2009 5:28 PM by todd g
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Receive email when someone replies.
Subscribe to our blog
Your email: