Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Passing a BAG to Pig UDF constructor?


Copy link to this message
-
Re: Passing a BAG to Pig UDF constructor?
I would run a perf test, but compared to the many other costs, I think it
will be minimal (unless it's a really massive bag). Pig should probably
allow for more graceful initialization in cases like this, but in my
experience I haven't noticed any serious degradation from this sort of
thing.

2012/6/29 Mridul Muralidharan <[EMAIL PROTECTED]>

>
>
> > -----Original Message-----
> > From: Dexin Wang [mailto:[EMAIL PROTECTED]]
> > Sent: Wednesday, June 27, 2012 11:00 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: Passing a BAG to Pig UDF constructor?
> >
> > That's a good idea (to pass the bag to UDF and initialize it on first
> > UDF invocation). Thanks.
> >
> > Why do you think it is expensive Mridul?
>
>
> You will be passing the bag with each tuple, but using it only for the
> first invocation per mapper/reducer.
> If other computations are more expensive, then it will get amortized over
> time; but it is a cost nonetheless ... only a perf test will tell you if it
> is small enough to ignore !
>
>
> Regards,
> Mridul
>
>
> >
> > On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan
> > <[EMAIL PROTECTED]>wrote:
> >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jonathan Coveney [mailto:[EMAIL PROTECTED]]
> > > > Sent: Wednesday, June 27, 2012 3:12 AM
> > > > To: [EMAIL PROTECTED]
> > > > Subject: Re: Passing a BAG to Pig UDF constructor?
> > > >
> > > > You can also just pass the bag to the UDF, and have a lazy
> > > > initializer in exec that loads the bag into memory.
> > >
> > >
> > > Can you elaborate what you mean by pass the bag to the UDF ?
> > > Pass it as part of the input to the udf in exec and initialize it
> > only
> > > once (first time) ? (If yes, this is expensive) Or something else ?
> > >
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > >
> > > >
> > > > 2012/6/26 Mridul Muralidharan <[EMAIL PROTECTED]>
> > > >
> > > > > You could dump the data in a dfs file and pass the location of
> > the
> > > > > file as param to your udf in define - so that it initializes
> > > > > itself using that data ...
> > > > >
> > > > >
> > > > > - Mridul
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Dexin Wang [mailto:[EMAIL PROTECTED]]
> > > > > > Sent: Tuesday, June 26, 2012 10:58 PM
> > > > > > To: [EMAIL PROTECTED]
> > > > > > Subject: Passing a BAG to Pig UDF constructor?
> > > > > >
> > > > > > Is it possible to pass a bag to a Pig UDF constructor?
> > > > > >
> > > > > > Basically in the constructor I want to initialize some hash map
> > > > > > so that on every exec operation, I can use the hashmap to do a
> > > > > > lookup and find the value I need, and apply some algorithm to
> > it.
> > > > > >
> > > > > > I realize I could just do a replicated join to achieve similar
> > > > > > things but the algorithm is more than a few lines and there are
> > > > some
> > > > > > edge cases so I would rather wrap that logic inside a UDF
> > function.
> > > > > > I also realize I could just pass a file path to the constructor
> > > > > > and read the files to initialize the hashmap but my files are
> > on
> > > > > > Amazon's S3 and I don't want to deal with
> > > > > > S3 API to read the file.
> > > > > >
> > > > > > Is this possible or is there some alternative ways to achieve
> > > > > > the same thing?
> > > > > >
> > > > > > Thanks.
> > > > > > Dexin
> > > > >
> > >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB