Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Passing a BAG to Pig UDF constructor?


Copy link to this message
-
Re: Passing a BAG to Pig UDF constructor?
You're right I guess. There's no reason why the two steps should happen on
the same nodes. To get around this, you'd have to make the hash available
on all the nodes - through the distributed cache or by putting it on HDFS
as Mridul suggested. Speaking of which, what's wrong with Mridul's
solution? (#2 in this thread)

If you absolutely have to go with the bag+UDF approach, you can try this if
your 'bag1' has a small number of tuples.
bag1 = LOAD 'somefile' AS (f1, f2, f3);
bag_grouped = GROUP bag1 ALL;
-- copy bag_grouped into each tuple of a
a_bag = FOREACH a GENERATE bag_grouped.bag1,$0 ..;
-- build and use your hash
b = FOREACH a_bag GENERATE myUDF($0 ..);

I don't think this'd be a good approach because you'd 1) be passing 'bag1'
with every tuple in a, and 2)building your hash every time. Maybe you can
avoid 2) by saving the hash on the first run in a globally available store
and reusing it, but this wouldn't be much different from Mridul's solution
of making the contents of 'bag1' available on HDFS in the first place.

Thanks,
Abhinav

On 29 June 2012 00:16, Dexin Wang <[EMAIL PROTECTED]> wrote:

> This (your second method) is very neat, thanks a lot Abhinav.
>
> Some problems though. First, I would have to do a STORE or DUMP of
> bag_dummy. Otherwise, Pig won't even run the bag_dummy line.
>
> Another problem, is it possible that the invocation of "build" step (that
> iterates through the bag1) and the "check" step (that iterates through the
> data bag "a") happen in different JVM or even different compute node? If
> that happens, the "check" step will not have access to all the hashes it
> needs.
>
> Hi Jonathan, you initially mentioned passing BAG to UDF, how would you do
> that? Is what Abhinav said something similar to what you had in mind?
>
> Thanks.
>
> On Thu, Jun 28, 2012 at 2:54 AM, Abhinav Neelam <[EMAIL PROTECTED]
> >wrote:
>
> > You're not passing a bag to your UDF, you're passing a relation. I
> believe
> > the FOREACH.. GENERATE looks for columns within the relation being
> iterated
> > on meaning that it's looking for 'bag1' within the schema of 'a'
> >
> > One way of doing this is generating a bag containing all the tuples in
> > relation b, and passing that to the UDF.
> > bag1 = LOAD 'somefile' AS (f1, f2, f3);
> > bag_grouped = GROUP bag1 ALL;
> > -- build your hash here
> > bag_dummy = FOREACH bag_grouped GENERATE myUDF(bag1);
> > -- write some logic into the UDF to check if it's receiving a bag or two
> > scalars, if you wish to reuse it
> > b = FOREACH a GENERATE myUDF(a1,a2);
> >
> > The problem here is the GROUP... ALL statement as it uses only reducer in
> > the reduce phase. You can make your myUDF algebraic (if possible) to
> speed
> > up the hash-building FOREACH...GENERATE step.
> >
> > Another way of doing this (I'm just throwing this one out there) is maybe
> > to simply FOREACH..GENERATE over the relation 'bag1', and in the exec
> > function build your hash using using the input tuples of bag1 (f1,f2,f3)
> > (Do you need all the tuples in bag1 at one time to build your hash?)
> >
> > bag1 = LOAD 'somefile' AS (f1, f2, f3);
> > -- build your hash here, perhaps use some identifier if you wish to reuse
> > your UDF
> > bag_dummy = FOREACH bag1 GENERATE myUDF('build',f1, f2, f3);
> > -- now use the hash
> > b = FOREACH a GENERATE myUDF('check',a1,a2);
> >
> >
> > Regards,
> > Abhinva
> > On 28 June 2012 04:38, Dexin Wang <[EMAIL PROTECTED]> wrote:
> >
> > > Actually how do you pass a bag to UDF? I did this:
> > >
> > >    a = LOAD 'file_a' AS (a1, a2, a3);
> > >
> > >    *bag1* = LOAD 'somefile' AS (f1, f2, f3);
> > >
> > >    b = FOREACH a GENERATE myUDF(*bag1*, a1, a2);
> > >
> > > But I got this error:
> > >
> > >     Invalid scalar projection: bag1 : A column needs to be projected
> from
> > > a relation for it to be used as a scalar
> > >
> > > What is the right way of doing this? Thanks.
> > >
> > >
> > > On Wed, Jun 27, 2012 at 10:30 AM, Dexin Wang <[EMAIL PROTECTED]>
> > wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB