Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Can I pass an entire relation to a Pig UDF?


Copy link to this message
-
Re: Can I pass an entire relation to a Pig UDF?
What you've indicated does require access to the whole relation at once
or at least a way of incrementing a counter and assigning its value to
each tuple. This kind of shared/synchronized state isn't possible with
Pig at the moment as far as I know.

--jacob
@thedatachef

On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote:
> Thanks Jacob for the response.
>
> If I run the UDF on each tuple then how can I preserve the state of the rank
> variable. I mean the UDF won't be able to save the rank value between calls,
> right? Correct me if I am wrong in interpreting that the UDF would be
> invoked for each tuple.
>
> What I am looking in my output is an additional column indicating the rank.
> Something like
>
> Hick    35      1
> Jimmy   30    2
> Jack    25      3
> Tampa   22    4
> Sam     20     5
>
> Thanks.
>
> Arun
>
>
> On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins <[EMAIL PROTECTED]>wrote:
>
> > The question is, do you need the entire relation all at once to assign a
> > rank? If so then map-reduce may not be the answer. If not, why not just
> > run the UDF on each tuple of the relation, one at a time, with a
> > projection?
> >
> > If you need some global information, such as the max and min score, then
> > you might look at the MAX and MIN operations. They do require a GROUP
> > ALL but are algebraic so it's not actually going to bring all the data
> > to one machine as it otherwise would.
> >
> > --jacob
> > @thedatachef
> >
> >
> > On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
> > > Hi
> > >
> > > I have the following input relation:
> > > Name Score
> > > Jack    25
> > > Jimmy   30
> > > Sam     20
> > > Hick    35
> > > Tampa   22
> > >
> > > My goal is to rank the tuples by score.
> > >
> > > Pig script:
> > >
> > > sample_data = LOAD 'sample.txt' USING PigStorage()   AS (name:chararray,
> > > score:int);
> > > sample_data_group = GROUP sample_data BY score;
> > > sample_data_count = FOREACH sample_data_group GENERATE group AS score,
> > > COUNT(sample_data.name) AS countVal;
> > > sample_data_order = ORDER sample_data_count BY score DESC;
> > > sample_data_group_all = GROUP sample_data_order all;
> > > sample_data_project = FOREACH sample_data_group_all GENERATE
> > > FLATTEN(myUDF.Rank(sample_data_order));
> > > dump sample_data_project;
> > >
> > > Can someone please point me to a UDF example where a relation is read in
> > and
> > > iterated over all its tuples? I plan to iterate over the tuples and
> > assign a
> > > rank to each of them based on the score value.
> > >
> > > Is there any other way to generate rank?
> > >
> > > Thanks much.
> > >
> > > Arun
> >
> >
> >
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB