HubSpot Dev Blog

Current Articles | RSS Feed RSS Feed

Who Loves the Magic Undocumented Hive Mapjoin? This Guy.

Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Share on Facebook Facebook | Share on Twitter Twitter | Share on LinkedIn LinkedIn 

So, I've got this nice Hive join statement, joining a tiny little partition from one table against a sizable set of partitions from another.  And I'm running it, and it's taking a while.  And I can tell,from looking at the job, that it's doing the join reduce-side --meaning, it's generating the cross-product in the mapper, and then sending it over to the reducer to filter it down. 

But, clearly, this is a perfect fit for a map-side hash join (meaning, hold the entire tiny partition in memory in each map task + run no reducers at all).  If I was coding it myself, I could make this happen via a bunch of coding +some configuration trickery.  But, surely, Hive will make it easier, no?

The docs had little to tell me, but I saw Jira tickets about adding this ability, and finally found a mailing list message which had the magic incantation.  It's a hint within the statement, just convert this:

  SELECT t1.portal_id, t2.lead_id, t1.visit_time,

to this:

  SELECT /*+ MAPJOIN(t2)*/ t1.portal_id, t2.lead_id, t1.visit_time,

Done, and now my entire job is running in the mapper and is taking about 30% of the time it used to.  Woo.  Big points for Hive, for damn sure.

Comments

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Receive email when someone replies.
Subscribe to our blog
Your email: