|
|
-
LOAD function vs. UDF eval
Norbert Burger 2012-05-29, 17:20
We're analyzing session(s) using Pig and HBase, and this session data is currently stored in a single HBase table, where rowkey is a sessionid-eventid combo (tall table). I'm trying to optimize the "extract-all-events-for-a-given-session" step of our workflow.
This could be a simple JOIN. But this seems inefficient even for a replicated join, since the vast majority of repjoin mappers will have no output. It also seems inefficient because all events are grouped together in clusters, and a de facto JOIN would ignore this locality.
Possibly it'd be cleaner to model this as a LOAD function, which would use the HBase client API to issue several scans in parallel. But to handle our use case, I'd have to be able to pass 10000s of sessionids to the loader (perhaps UDFContext?)
So I'm back to writing a UDF eval function to handle this, which seems wonky.
But this made me think -- is there any performance benefit to modeling these "load" style steps as LOAD functions vs. generic UDFs? In both cases, they'd return a bag per row.
Norbert
-
Re: LOAD function vs. UDF eval
Raghu Angadi 2012-05-30, 00:37
I would still use a UDF, it is lot more flexible.
Passing large number of ids to the loader is part of the problem..
Your UDF would take a bag of ids and return bag{(session, events:bag{})}
You can pass the bag of ids in various ways : - load ids as a relation, group all to put all of them in a single bag.. - in fact you can use RANDOM() to batch 10 ids in a bag (this avoids buffering all of the output in UDF). - or put them in a json and load json..
etc...
On Tue, May 29, 2012 at 10:20 AM, Norbert Burger <[EMAIL PROTECTED]>wrote:
> We're analyzing session(s) using Pig and HBase, and this session data is > currently stored in a single HBase table, where rowkey is a > sessionid-eventid combo (tall table). I'm trying to optimize the > "extract-all-events-for-a-given-session" step of our workflow. > > This could be a simple JOIN. But this seems inefficient even for a > replicated join, since the vast majority of repjoin mappers will have no > output. It also seems inefficient because all events are grouped together > in clusters, and a de facto JOIN would ignore this locality. > > Possibly it'd be cleaner to model this as a LOAD function, which would use > the HBase client API to issue several scans in parallel. But to handle our > use case, I'd have to be able to pass 10000s of sessionids to the loader > (perhaps UDFContext?) > > So I'm back to writing a UDF eval function to handle this, which seems > wonky. > > But this made me think -- is there any performance benefit to modeling > these "load" style steps as LOAD functions vs. generic UDFs? In both > cases, they'd return a bag per row. > > Norbert >
-
Re: LOAD function vs. UDF eval
Norbert Burger 2012-05-30, 02:22
Thanks, Raghu. Maybe another benefit of the UDF route is that it could support the accumulator interface.
Since both approaches would use the HBase client API directly, there's no Pig-specific benefit to using a loader, right?
Norbert
On Tue, May 29, 2012 at 8:37 PM, Raghu Angadi <[EMAIL PROTECTED]> wrote:
> I would still use a UDF, it is lot more flexible. > > Passing large number of ids to the loader is part of the problem.. > > Your UDF would take a bag of ids and return bag{(session, events:bag{})} > > You can pass the bag of ids in various ways : > - load ids as a relation, group all to put all of them in a single bag.. > - in fact you can use RANDOM() to batch 10 ids in a bag (this > avoids buffering all of the output in UDF). > - or put them in a json and load json.. > > etc... > > On Tue, May 29, 2012 at 10:20 AM, Norbert Burger > <[EMAIL PROTECTED]>wrote: > > > We're analyzing session(s) using Pig and HBase, and this session data is > > currently stored in a single HBase table, where rowkey is a > > sessionid-eventid combo (tall table). I'm trying to optimize the > > "extract-all-events-for-a-given-session" step of our workflow. > > > > This could be a simple JOIN. But this seems inefficient even for a > > replicated join, since the vast majority of repjoin mappers will have no > > output. It also seems inefficient because all events are grouped > together > > in clusters, and a de facto JOIN would ignore this locality. > > > > Possibly it'd be cleaner to model this as a LOAD function, which would > use > > the HBase client API to issue several scans in parallel. But to handle > our > > use case, I'd have to be able to pass 10000s of sessionids to the loader > > (perhaps UDFContext?) > > > > So I'm back to writing a UDF eval function to handle this, which seems > > wonky. > > > > But this made me think -- is there any performance benefit to modeling > > these "load" style steps as LOAD functions vs. generic UDFs? In both > > cases, they'd return a bag per row. > > > > Norbert > > >
|
|