Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> LOAD function vs. UDF eval

Norbert Burger 2012-05-29, 17:20
Copy link to this message
Re: LOAD function vs. UDF eval
I would still use a UDF, it is lot more flexible.

Passing large number of ids to the loader is part of the problem..

Your UDF would take a bag of ids and return bag{(session, events:bag{})}

You can pass the bag of ids in various ways :
   - load ids as a relation, group all to put all of them in a single bag..
        - in fact you can use RANDOM() to batch 10 ids in a bag (this
avoids buffering all of the output in UDF).
    - or put them in a json and load json..


On Tue, May 29, 2012 at 10:20 AM, Norbert Burger

> We're analyzing session(s) using Pig and HBase, and this session data is
> currently stored in a single HBase table, where rowkey is a
> sessionid-eventid combo (tall table).  I'm trying to optimize the
> "extract-all-events-for-a-given-session" step of our workflow.
> This could be a simple JOIN.  But this seems inefficient even for a
> replicated join, since the vast majority of repjoin mappers will have no
> output.  It also seems inefficient because all events are grouped together
> in clusters, and a de facto JOIN would ignore this locality.
> Possibly it'd be cleaner to model this as a LOAD function, which would use
> the HBase client API to issue several scans in parallel.  But to handle our
> use case, I'd have to be able to pass 10000s of sessionids to the loader
> (perhaps UDFContext?)
> So I'm back to writing a UDF eval function to handle this, which seems
> wonky.
> But this made me think -- is there any performance benefit to modeling
> these "load" style steps as LOAD functions vs. generic UDFs?  In both
> cases, they'd return a bag per row.
> Norbert
Norbert Burger 2012-05-30, 02:22