Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Passing a BAG to Pig UDF constructor?


Copy link to this message
-
Re: Passing a BAG to Pig UDF constructor?
Abhinav Neelam 2012-06-28, 09:54
You're not passing a bag to your UDF, you're passing a relation. I believe
the FOREACH.. GENERATE looks for columns within the relation being iterated
on meaning that it's looking for 'bag1' within the schema of 'a'

One way of doing this is generating a bag containing all the tuples in
relation b, and passing that to the UDF.
bag1 = LOAD 'somefile' AS (f1, f2, f3);
bag_grouped = GROUP bag1 ALL;
-- build your hash here
bag_dummy = FOREACH bag_grouped GENERATE myUDF(bag1);
-- write some logic into the UDF to check if it's receiving a bag or two
scalars, if you wish to reuse it
b = FOREACH a GENERATE myUDF(a1,a2);

The problem here is the GROUP... ALL statement as it uses only reducer in
the reduce phase. You can make your myUDF algebraic (if possible) to speed
up the hash-building FOREACH...GENERATE step.

Another way of doing this (I'm just throwing this one out there) is maybe
to simply FOREACH..GENERATE over the relation 'bag1', and in the exec
function build your hash using using the input tuples of bag1 (f1,f2,f3)
(Do you need all the tuples in bag1 at one time to build your hash?)

bag1 = LOAD 'somefile' AS (f1, f2, f3);
-- build your hash here, perhaps use some identifier if you wish to reuse
your UDF
bag_dummy = FOREACH bag1 GENERATE myUDF('build',f1, f2, f3);
-- now use the hash
b = FOREACH a GENERATE myUDF('check',a1,a2);
Regards,
Abhinva
On 28 June 2012 04:38, Dexin Wang <[EMAIL PROTECTED]> wrote:

> Actually how do you pass a bag to UDF? I did this:
>
>    a = LOAD 'file_a' AS (a1, a2, a3);
>
>    *bag1* = LOAD 'somefile' AS (f1, f2, f3);
>
>    b = FOREACH a GENERATE myUDF(*bag1*, a1, a2);
>
> But I got this error:
>
>     Invalid scalar projection: bag1 : A column needs to be projected from
> a relation for it to be used as a scalar
>
> What is the right way of doing this? Thanks.
>
>
> On Wed, Jun 27, 2012 at 10:30 AM, Dexin Wang <[EMAIL PROTECTED]> wrote:
>
> > That's a good idea (to pass the bag to UDF and initialize it on first UDF
> > invocation). Thanks.
> >
> > Why do you think it is expensive Mridul?
> >
> >
> > On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan <
> > [EMAIL PROTECTED]> wrote:
> >
> >>
> >>
> >> > -----Original Message-----
> >> > From: Jonathan Coveney [mailto:[EMAIL PROTECTED]]
> >> > Sent: Wednesday, June 27, 2012 3:12 AM
> >> > To: [EMAIL PROTECTED]
> >> > Subject: Re: Passing a BAG to Pig UDF constructor?
> >> >
> >> > You can also just pass the bag to the UDF, and have a lazy initializer
> >> > in exec that loads the bag into memory.
> >>
> >>
> >> Can you elaborate what you mean by pass the bag to the UDF ?
> >> Pass it as part of the input to the udf in exec and initialize it only
> >> once (first time) ? (If yes, this is expensive)
> >> Or something else ?
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> >
> >> > 2012/6/26 Mridul Muralidharan <[EMAIL PROTECTED]>
> >> >
> >> > > You could dump the data in a dfs file and pass the location of the
> >> > > file as param to your udf in define - so that it initializes itself
> >> > > using that data ...
> >> > >
> >> > >
> >> > > - Mridul
> >> > >
> >> > >
> >> > > > -----Original Message-----
> >> > > > From: Dexin Wang [mailto:[EMAIL PROTECTED]]
> >> > > > Sent: Tuesday, June 26, 2012 10:58 PM
> >> > > > To: [EMAIL PROTECTED]
> >> > > > Subject: Passing a BAG to Pig UDF constructor?
> >> > > >
> >> > > > Is it possible to pass a bag to a Pig UDF constructor?
> >> > > >
> >> > > > Basically in the constructor I want to initialize some hash map so
> >> > > > that on every exec operation, I can use the hashmap to do a lookup
> >> > > > and find the value I need, and apply some algorithm to it.
> >> > > >
> >> > > > I realize I could just do a replicated join to achieve similar
> >> > > > things but the algorithm is more than a few lines and there are
> >> > some
> >> > > > edge cases so I would rather wrap that logic inside a UDF
> function.
> >> > > > I also realize I could just pass a file path to the constructor