Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Passing a BAG to Pig UDF constructor?


Copy link to this message
-
Re: Passing a BAG to Pig UDF constructor?
Jonathan Coveney 2012-06-29, 17:25
I would run a perf test, but compared to the many other costs, I think it
will be minimal (unless it's a really massive bag). Pig should probably
allow for more graceful initialization in cases like this, but in my
experience I haven't noticed any serious degradation from this sort of
thing.

2012/6/29 Mridul Muralidharan <[EMAIL PROTECTED]>

>
>
> > -----Original Message-----
> > From: Dexin Wang [mailto:[EMAIL PROTECTED]]
> > Sent: Wednesday, June 27, 2012 11:00 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: Passing a BAG to Pig UDF constructor?
> >
> > That's a good idea (to pass the bag to UDF and initialize it on first
> > UDF invocation). Thanks.
> >
> > Why do you think it is expensive Mridul?
>
>
> You will be passing the bag with each tuple, but using it only for the
> first invocation per mapper/reducer.
> If other computations are more expensive, then it will get amortized over
> time; but it is a cost nonetheless ... only a perf test will tell you if it
> is small enough to ignore !
>
>
> Regards,
> Mridul
>
>
> >
> > On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan
> > <[EMAIL PROTECTED]>wrote:
> >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jonathan Coveney [mailto:[EMAIL PROTECTED]]
> > > > Sent: Wednesday, June 27, 2012 3:12 AM
> > > > To: [EMAIL PROTECTED]
> > > > Subject: Re: Passing a BAG to Pig UDF constructor?
> > > >
> > > > You can also just pass the bag to the UDF, and have a lazy
> > > > initializer in exec that loads the bag into memory.
> > >
> > >
> > > Can you elaborate what you mean by pass the bag to the UDF ?
> > > Pass it as part of the input to the udf in exec and initialize it
> > only
> > > once (first time) ? (If yes, this is expensive) Or something else ?
> > >
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > >
> > > >
> > > > 2012/6/26 Mridul Muralidharan <[EMAIL PROTECTED]>
> > > >
> > > > > You could dump the data in a dfs file and pass the location of
> > the
> > > > > file as param to your udf in define - so that it initializes
> > > > > itself using that data ...
> > > > >
> > > > >
> > > > > - Mridul
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Dexin Wang [mailto:[EMAIL PROTECTED]]
> > > > > > Sent: Tuesday, June 26, 2012 10:58 PM
> > > > > > To: [EMAIL PROTECTED]
> > > > > > Subject: Passing a BAG to Pig UDF constructor?
> > > > > >
> > > > > > Is it possible to pass a bag to a Pig UDF constructor?
> > > > > >
> > > > > > Basically in the constructor I want to initialize some hash map
> > > > > > so that on every exec operation, I can use the hashmap to do a
> > > > > > lookup and find the value I need, and apply some algorithm to
> > it.
> > > > > >
> > > > > > I realize I could just do a replicated join to achieve similar
> > > > > > things but the algorithm is more than a few lines and there are
> > > > some
> > > > > > edge cases so I would rather wrap that logic inside a UDF
> > function.
> > > > > > I also realize I could just pass a file path to the constructor
> > > > > > and read the files to initialize the hashmap but my files are
> > on
> > > > > > Amazon's S3 and I don't want to deal with
> > > > > > S3 API to read the file.
> > > > > >
> > > > > > Is this possible or is there some alternative ways to achieve
> > > > > > the same thing?
> > > > > >
> > > > > > Thanks.
> > > > > > Dexin
> > > > >
> > >
>