Pig, mail # user - Can I pass an entire relation to a Pig UDF?

Jacob Perkins 2011-04-27, 02:18
The question is, do you need the entire relation all at once to assign a
rank? If so then map-reduce may not be the answer. If not, why not just
run the UDF on each tuple of the relation, one at a time, with a

If you need some global information, such as the max and min score, then
you might look at the MAX and MIN operations. They do require a GROUP
ALL but are algebraic so it's not actually going to bring all the data
to one machine as it otherwise would.

On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
> Hi
> I have the following input relation:
> Name Score
> Jack    25
> Jimmy   30
> Sam     20
> Hick    35
> Tampa   22
> My goal is to rank the tuples by score.
> Pig script:
> sample_data = LOAD 'sample.txt' USING PigStorage()   AS (name:chararray,
> score:int);
> sample_data_group = GROUP sample_data BY score;
> sample_data_count = FOREACH sample_data_group GENERATE group AS score,
> COUNT(sample_data.name) AS countVal;
> sample_data_order = ORDER sample_data_count BY score DESC;
> sample_data_group_all = GROUP sample_data_order all;
> sample_data_project = FOREACH sample_data_group_all GENERATE
> FLATTEN(myUDF.Rank(sample_data_order));
> dump sample_data_project;
> Can someone please point me to a UDF example where a relation is read in and
> iterated over all its tuples? I plan to iterate over the tuples and assign a
> rank to each of them based on the score value.
> Is there any other way to generate rank?
> Thanks much.
> Arun