Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Can I pass an entire relation to a Pig UDF?


Copy link to this message
-
Re: Can I pass an entire relation to a Pig UDF?
Dexin Wang 2011-04-27, 04:14
If the whole set is not that big, sorting in shell might be the easiest.  I've done that with result set of millions of records.
On Apr 26, 2011, at 8:49 PM, Arun A K <[EMAIL PROTECTED]> wrote:

> Thanks Jacob.
>
> I wonder if it is possible to get the rank of each record or say row number
> using Pig. Or do I need to have an external driver like a shell script which
> augments the sorted output from Pig with a rank?
>
> Thanks
> Arun
>
>
>
> On Tue, Apr 26, 2011 at 7:54 PM, Jacob Perkins <[EMAIL PROTECTED]>wrote:
>
>> What you've indicated does require access to the whole relation at once
>> or at least a way of incrementing a counter and assigning its value to
>> each tuple. This kind of shared/synchronized state isn't possible with
>> Pig at the moment as far as I know.
>>
>> --jacob
>> @thedatachef
>>
>> On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote:
>>> Thanks Jacob for the response.
>>>
>>> If I run the UDF on each tuple then how can I preserve the state of the
>> rank
>>> variable. I mean the UDF won't be able to save the rank value between
>> calls,
>>> right? Correct me if I am wrong in interpreting that the UDF would be
>>> invoked for each tuple.
>>>
>>> What I am looking in my output is an additional column indicating the
>> rank.
>>> Something like
>>>
>>> Hick    35      1
>>> Jimmy   30    2
>>> Jack    25      3
>>> Tampa   22    4
>>> Sam     20     5
>>>
>>> Thanks.
>>>
>>> Arun
>>>
>>>
>>> On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins <
>> [EMAIL PROTECTED]>wrote:
>>>
>>>> The question is, do you need the entire relation all at once to assign
>> a
>>>> rank? If so then map-reduce may not be the answer. If not, why not just
>>>> run the UDF on each tuple of the relation, one at a time, with a
>>>> projection?
>>>>
>>>> If you need some global information, such as the max and min score,
>> then
>>>> you might look at the MAX and MIN operations. They do require a GROUP
>>>> ALL but are algebraic so it's not actually going to bring all the data
>>>> to one machine as it otherwise would.
>>>>
>>>> --jacob
>>>> @thedatachef
>>>>
>>>>
>>>> On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
>>>>> Hi
>>>>>
>>>>> I have the following input relation:
>>>>> Name Score
>>>>> Jack    25
>>>>> Jimmy   30
>>>>> Sam     20
>>>>> Hick    35
>>>>> Tampa   22
>>>>>
>>>>> My goal is to rank the tuples by score.
>>>>>
>>>>> Pig script:
>>>>>
>>>>> sample_data = LOAD 'sample.txt' USING PigStorage()   AS
>> (name:chararray,
>>>>> score:int);
>>>>> sample_data_group = GROUP sample_data BY score;
>>>>> sample_data_count = FOREACH sample_data_group GENERATE group AS
>> score,
>>>>> COUNT(sample_data.name) AS countVal;
>>>>> sample_data_order = ORDER sample_data_count BY score DESC;
>>>>> sample_data_group_all = GROUP sample_data_order all;
>>>>> sample_data_project = FOREACH sample_data_group_all GENERATE
>>>>> FLATTEN(myUDF.Rank(sample_data_order));
>>>>> dump sample_data_project;
>>>>>
>>>>> Can someone please point me to a UDF example where a relation is read
>> in
>>>> and
>>>>> iterated over all its tuples? I plan to iterate over the tuples and
>>>> assign a
>>>>> rank to each of them based on the score value.
>>>>>
>>>>> Is there any other way to generate rank?
>>>>>
>>>>> Thanks much.
>>>>>
>>>>> Arun
>>>>
>>>>
>>>>
>>
>>
>>