So, I've got this nice Hive join statement, joining a tiny little partition from one table against a sizable set of partitions from another.  And I'm running it, and it's taking a while.  And I can tell,from looking at the job, that it's doing the join reduce-side --meaning, it's generating the cross-product in the mapper, and then sending it over to the reducer to filter it down. 

But, clearly, this is a perfect fit for a map-side hash join (meaning, hold the entire tiny partition in memory in each map task + run no reducers at all).  If I was coding it myself, I could make this happen via a bunch of coding +some configuration trickery.  But, surely, Hive will make it easier, no?

The docs had little to tell me, but I saw Jira tickets about adding this ability, and finally found a mailing list message which had the magic incantation.  It's a hint within the statement, just convert this:

  SELECT t1.portal_id, t2.lead_id, t1.visit_time,

to this:

  SELECT /*+ MAPJOIN(t2)*/ t1.portal_id, t2.lead_id, t1.visit_time,

Done, and now my entire job is running in the mapper and is taking about 30% of the time it used to.  Woo.  Big points for Hive, for damn sure.

so-hiring

Recommended Articles

Join our subscribers

Sign up here and we'll keep you updated on the latest in product, UX, and engineering from HubSpot.

Subscribe to the newsletter