Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> LOAD function vs. UDF eval

Copy link to this message
LOAD function vs. UDF eval
We're analyzing session(s) using Pig and HBase, and this session data is
currently stored in a single HBase table, where rowkey is a
sessionid-eventid combo (tall table).  I'm trying to optimize the
"extract-all-events-for-a-given-session" step of our workflow.

This could be a simple JOIN.  But this seems inefficient even for a
replicated join, since the vast majority of repjoin mappers will have no
output.  It also seems inefficient because all events are grouped together
in clusters, and a de facto JOIN would ignore this locality.

Possibly it'd be cleaner to model this as a LOAD function, which would use
the HBase client API to issue several scans in parallel.  But to handle our
use case, I'd have to be able to pass 10000s of sessionids to the loader
(perhaps UDFContext?)

So I'm back to writing a UDF eval function to handle this, which seems

But this made me think -- is there any performance benefit to modeling
these "load" style steps as LOAD functions vs. generic UDFs?  In both
cases, they'd return a bag per row.