Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> illustrate


+
Gianmarco De Francisci Mo... 2012-08-02, 13:52
+
Thejas Nair 2012-08-03, 21:30
+
Gianmarco De Francisci Mo... 2012-08-04, 07:22
+
Allan 2012-08-04, 08:01
+
Allan 2012-08-07, 14:34
+
Gianmarco De Francisci Mo... 2012-08-09, 16:01
I must be missing some tricky detail... Which of these operations could not be done by clever udfs?

On Aug 9, 2012, at 9:01 AM, Gianmarco De Francisci Morales <[EMAIL PROTECTED]> wrote:

> Hi Allan,
> I think I found an answer to your problem:
>
> 1) Modify PhysicalPlanResetter by adding:
>
>    @Override
>
>    public void visitCounter(POCounter counter) throws VisitorException {
>
>        counter.reset();
>
>    }
>
>
> 2) Modify POCounter by adding
>
>    @Override
>
>    public void reset() {
>
>        localCount = 0L;
>
>        taskID = "-1";
>
>        incrementer = 1;
>
>    }
>
>
>
> I get this result on this file + script:
>
> ------------------------------------------
>
> | a     | id:int    | value:chararray    |
>
> ------------------------------------------
>
> |       | 6         | g                  |
>
> ------------------------------------------
>
> -----------------------------------------------------------
>
> | b     | rank_a:long    | id:int    | value:chararray    |
>
> -----------------------------------------------------------
>
> |       | 1              | 6         | g                  |
>
> -----------------------------------------------------------
>
>
>
> grunt> cat file.txt
>
> 1 a
>
> 2 b
>
> 3 c
>
> 3 d
>
> 4 e
>
> 6 f
>
> 6 g
>
> 8 h
>
>
>
>
> grunt> a = load 'file.txt' as (id:int, value:chararray);
>
> grunt> b = rank a;
>
> grunt> illustrate b
>
>
>
>
> Hope it helps.
>
>
> Cheers,
>
> --
> Gianmarco
>
>
>
> On Tue, Aug 7, 2012 at 4:34 PM, Allan <[EMAIL PROTECTED]> wrote:
>
>> Hi to everybody!
>>
>> I'm working on the implementation of rank operator, which successfully
>> passed all the e2e tests on a cluster.
>> Rank operator is composed by two physical operators: POCounter and PORank,
>> and it provides two functionalities:
>>
>> 1) First functionality is similar to ROW NUMBER like on SQL, which
>> provides a sequential number to each tuple.
>> This is implemented by two map-only works (one for each physical
>> operator).
>>
>> - POCounter adds to each tuple the task identifier (which is processing
>> it) and a local counter.  Furthermore, POCounter register the total number
>> of processed tuples by each task, through the used of global counters.
>> After finished the POCounter, it is calculated the cumulative sum, which
>> is the summation of the total tuples processed by previous tasks, i.e. for
>> task0 cumulative sum is 0 (there is not tuples before), task1 cumulative
>> sum is the number of tuples processed by task0 (the only task before it is
>> task0), and so on.
>>
>> - Finally, PORank reads the corresponding cumulative according to the task
>> id of each tuple and sums the local counter at the tuple.
>>
>> An input example for the POCount could be:
>>
>> (1,n,5)
>> (8,a,0)
>> (0,b,9)
>>
>> result of POCounter, and input to the PORank:
>>
>> (0,1,1,n,5)
>> (0,2,8,a,0)
>> (0,3,0,b,9)
>>
>> and result after PORank processing:
>>
>> (1,1,n,5)
>> (2,8,a,0)
>> (3,0,b,9)
>>
>>
>> 2) Second functionality is RANK BY, which is based on set of ordered
>> columns.
>> And it requires another methodology:
>> First, the dataset is group by the desired columns. Then, this result is
>> sorted by the columns specified. And, at the end this result is processed
>> by POCounter and PORank.
>> As in the previous case, POCounter adds to each tuple the task identifier
>> and the local counter. But here, local counter is not sequentially
>> incremented. Instead, it is added the number of tuples in the bag (produced
>> within the previous "group by").
>> Another particular change is the fact of the global counter is also
>> incremented by the size of bags on each tuple.
>>
>> Finally, PORank does the same as the previous implementation without
>> change. After that, the rank column is spread to each component on the bag
>> within a for each operator.
>>
>> An input example for the POCounter (after sorting and grouping):
>> On this case, I would like to rank by the first column.
+
Gianmarco De Francisci Mo... 2012-08-12, 09:39