We're analyzing session(s) using Pig and HBase, and this session data is
currently stored in a single HBase table, where rowkey is a
sessionid-eventid combo (tall table). I'm trying to optimize the
"extract-all-events-for-a-given-session" step of our workflow.
This could be a simple JOIN. But this seems inefficient even for a
replicated join, since the vast majority of repjoin mappers will have no
output. It also seems inefficient because all events are grouped together
in clusters, and a de facto JOIN would ignore this locality.
Possibly it'd be cleaner to model this as a LOAD function, which would use
the HBase client API to issue several scans in parallel. But to handle our
use case, I'd have to be able to pass 10000s of sessionids to the loader
So I'm back to writing a UDF eval function to handle this, which seems
But this made me think -- is there any performance benefit to modeling
these "load" style steps as LOAD functions vs. generic UDFs? In both
cases, they'd return a bag per row.