Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - LOAD function vs. UDF eval

Copy link to this message
Re: LOAD function vs. UDF eval
Norbert Burger 2012-05-30, 02:22
Thanks, Raghu.  Maybe another benefit of the UDF route is that it could
support the accumulator interface.

Since both approaches would use the HBase client API directly, there's no
Pig-specific benefit to using a loader, right?


On Tue, May 29, 2012 at 8:37 PM, Raghu Angadi <[EMAIL PROTECTED]> wrote:

> I would still use a UDF, it is lot more flexible.
> Passing large number of ids to the loader is part of the problem..
> Your UDF would take a bag of ids and return bag{(session, events:bag{})}
> You can pass the bag of ids in various ways :
>   - load ids as a relation, group all to put all of them in a single bag..
>        - in fact you can use RANDOM() to batch 10 ids in a bag (this
> avoids buffering all of the output in UDF).
>    - or put them in a json and load json..
>  etc...
> On Tue, May 29, 2012 at 10:20 AM, Norbert Burger
> > We're analyzing session(s) using Pig and HBase, and this session data is
> > currently stored in a single HBase table, where rowkey is a
> > sessionid-eventid combo (tall table).  I'm trying to optimize the
> > "extract-all-events-for-a-given-session" step of our workflow.
> >
> > This could be a simple JOIN.  But this seems inefficient even for a
> > replicated join, since the vast majority of repjoin mappers will have no
> > output.  It also seems inefficient because all events are grouped
> together
> > in clusters, and a de facto JOIN would ignore this locality.
> >
> > Possibly it'd be cleaner to model this as a LOAD function, which would
> use
> > the HBase client API to issue several scans in parallel.  But to handle
> our
> > use case, I'd have to be able to pass 10000s of sessionids to the loader
> > (perhaps UDFContext?)
> >
> > So I'm back to writing a UDF eval function to handle this, which seems
> > wonky.
> >
> > But this made me think -- is there any performance benefit to modeling
> > these "load" style steps as LOAD functions vs. generic UDFs?  In both
> > cases, they'd return a bag per row.
> >
> > Norbert
> >