Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Can I pass an entire relation to a Pig UDF?


Copy link to this message
-
Re: Can I pass an entire relation to a Pig UDF?
If the whole set is not that big, sorting in shell might be the easiest.  I've done that with result set of millions of records.
On Apr 26, 2011, at 8:49 PM, Arun A K <[EMAIL PROTECTED]> wrote:

> Thanks Jacob.
>
> I wonder if it is possible to get the rank of each record or say row number
> using Pig. Or do I need to have an external driver like a shell script which
> augments the sorted output from Pig with a rank?
>
> Thanks
> Arun
>
>
>
> On Tue, Apr 26, 2011 at 7:54 PM, Jacob Perkins <[EMAIL PROTECTED]>wrote:
>
>> What you've indicated does require access to the whole relation at once
>> or at least a way of incrementing a counter and assigning its value to
>> each tuple. This kind of shared/synchronized state isn't possible with
>> Pig at the moment as far as I know.
>>
>> --jacob
>> @thedatachef
>>
>> On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote:
>>> Thanks Jacob for the response.
>>>
>>> If I run the UDF on each tuple then how can I preserve the state of the
>> rank
>>> variable. I mean the UDF won't be able to save the rank value between
>> calls,
>>> right? Correct me if I am wrong in interpreting that the UDF would be
>>> invoked for each tuple.
>>>
>>> What I am looking in my output is an additional column indicating the
>> rank.
>>> Something like
>>>
>>> Hick    35      1
>>> Jimmy   30    2
>>> Jack    25      3
>>> Tampa   22    4
>>> Sam     20     5
>>>
>>> Thanks.
>>>
>>> Arun
>>>
>>>
>>> On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins <
>> [EMAIL PROTECTED]>wrote:
>>>
>>>> The question is, do you need the entire relation all at once to assign
>> a
>>>> rank? If so then map-reduce may not be the answer. If not, why not just
>>>> run the UDF on each tuple of the relation, one at a time, with a
>>>> projection?
>>>>
>>>> If you need some global information, such as the max and min score,
>> then
>>>> you might look at the MAX and MIN operations. They do require a GROUP
>>>> ALL but are algebraic so it's not actually going to bring all the data
>>>> to one machine as it otherwise would.
>>>>
>>>> --jacob
>>>> @thedatachef
>>>>
>>>>
>>>> On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
>>>>> Hi
>>>>>
>>>>> I have the following input relation:
>>>>> Name Score
>>>>> Jack    25
>>>>> Jimmy   30
>>>>> Sam     20
>>>>> Hick    35
>>>>> Tampa   22
>>>>>
>>>>> My goal is to rank the tuples by score.
>>>>>
>>>>> Pig script:
>>>>>
>>>>> sample_data = LOAD 'sample.txt' USING PigStorage()   AS
>> (name:chararray,
>>>>> score:int);
>>>>> sample_data_group = GROUP sample_data BY score;
>>>>> sample_data_count = FOREACH sample_data_group GENERATE group AS
>> score,
>>>>> COUNT(sample_data.name) AS countVal;
>>>>> sample_data_order = ORDER sample_data_count BY score DESC;
>>>>> sample_data_group_all = GROUP sample_data_order all;
>>>>> sample_data_project = FOREACH sample_data_group_all GENERATE
>>>>> FLATTEN(myUDF.Rank(sample_data_order));
>>>>> dump sample_data_project;
>>>>>
>>>>> Can someone please point me to a UDF example where a relation is read
>> in
>>>> and
>>>>> iterated over all its tuples? I plan to iterate over the tuples and
>>>> assign a
>>>>> rank to each of them based on the score value.
>>>>>
>>>>> Is there any other way to generate rank?
>>>>>
>>>>> Thanks much.
>>>>>
>>>>> Arun
>>>>
>>>>
>>>>
>>
>>
>>