Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> LOAD function vs. UDF eval

Copy link to this message
Re: LOAD function vs. UDF eval
Thanks, Raghu.  Maybe another benefit of the UDF route is that it could
support the accumulator interface.

Since both approaches would use the HBase client API directly, there's no
Pig-specific benefit to using a loader, right?


On Tue, May 29, 2012 at 8:37 PM, Raghu Angadi <[EMAIL PROTECTED]> wrote:

> I would still use a UDF, it is lot more flexible.
> Passing large number of ids to the loader is part of the problem..
> Your UDF would take a bag of ids and return bag{(session, events:bag{})}
> You can pass the bag of ids in various ways :
>   - load ids as a relation, group all to put all of them in a single bag..
>        - in fact you can use RANDOM() to batch 10 ids in a bag (this
> avoids buffering all of the output in UDF).
>    - or put them in a json and load json..
>  etc...
> On Tue, May 29, 2012 at 10:20 AM, Norbert Burger
> > We're analyzing session(s) using Pig and HBase, and this session data is
> > currently stored in a single HBase table, where rowkey is a
> > sessionid-eventid combo (tall table).  I'm trying to optimize the
> > "extract-all-events-for-a-given-session" step of our workflow.
> >
> > This could be a simple JOIN.  But this seems inefficient even for a
> > replicated join, since the vast majority of repjoin mappers will have no
> > output.  It also seems inefficient because all events are grouped
> together
> > in clusters, and a de facto JOIN would ignore this locality.
> >
> > Possibly it'd be cleaner to model this as a LOAD function, which would
> use
> > the HBase client API to issue several scans in parallel.  But to handle
> our
> > use case, I'd have to be able to pass 10000s of sessionids to the loader
> > (perhaps UDFContext?)
> >
> > So I'm back to writing a UDF eval function to handle this, which seems
> > wonky.
> >
> > But this made me think -- is there any performance benefit to modeling
> > these "load" style steps as LOAD functions vs. generic UDFs?  In both
> > cases, they'd return a bag per row.
> >
> > Norbert
> >