Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - illustrate


Copy link to this message
-
Re: illustrate
Gianmarco De Francisci Mo... 2012-08-09, 16:01
Hi Allan,
I think I found an answer to your problem:

1) Modify PhysicalPlanResetter by adding:

    @Override

    public void visitCounter(POCounter counter) throws VisitorException {

        counter.reset();

    }
2) Modify POCounter by adding

    @Override

    public void reset() {

        localCount = 0L;

        taskID = "-1";

        incrementer = 1;

    }

I get this result on this file + script:

------------------------------------------

| a     | id:int    | value:chararray    |

------------------------------------------

|       | 6         | g                  |

------------------------------------------

-----------------------------------------------------------

| b     | rank_a:long    | id:int    | value:chararray    |

-----------------------------------------------------------

|       | 1              | 6         | g                  |

-----------------------------------------------------------

grunt> cat file.txt

1 a

2 b

3 c

3 d

4 e

6 f

6 g

8 h
grunt> a = load 'file.txt' as (id:int, value:chararray);

grunt> b = rank a;

grunt> illustrate b
Hope it helps.
Cheers,

--
Gianmarco

On Tue, Aug 7, 2012 at 4:34 PM, Allan <[EMAIL PROTECTED]> wrote:

> Hi to everybody!
>
> I'm working on the implementation of rank operator, which successfully
> passed all the e2e tests on a cluster.
> Rank operator is composed by two physical operators: POCounter and PORank,
> and it provides two functionalities:
>
> 1) First functionality is similar to ROW NUMBER like on SQL, which
> provides a sequential number to each tuple.
> This is implemented by two map-only works (one for each physical
> operator).
>
> - POCounter adds to each tuple the task identifier (which is processing
> it) and a local counter.  Furthermore, POCounter register the total number
> of processed tuples by each task, through the used of global counters.
> After finished the POCounter, it is calculated the cumulative sum, which
> is the summation of the total tuples processed by previous tasks, i.e. for
> task0 cumulative sum is 0 (there is not tuples before), task1 cumulative
> sum is the number of tuples processed by task0 (the only task before it is
> task0), and so on.
>
> - Finally, PORank reads the corresponding cumulative according to the task
> id of each tuple and sums the local counter at the tuple.
>
> An input example for the POCount could be:
>
> (1,n,5)
> (8,a,0)
> (0,b,9)
>
> result of POCounter, and input to the PORank:
>
> (0,1,1,n,5)
> (0,2,8,a,0)
> (0,3,0,b,9)
>
> and result after PORank processing:
>
> (1,1,n,5)
> (2,8,a,0)
> (3,0,b,9)
>
>
> 2) Second functionality is RANK BY, which is based on set of ordered
> columns.
> And it requires another methodology:
> First, the dataset is group by the desired columns. Then, this result is
> sorted by the columns specified. And, at the end this result is processed
> by POCounter and PORank.
> As in the previous case, POCounter adds to each tuple the task identifier
> and the local counter. But here, local counter is not sequentially
> incremented. Instead, it is added the number of tuples in the bag (produced
> within the previous "group by").
> Another particular change is the fact of the global counter is also
> incremented by the size of bags on each tuple.
>
> Finally, PORank does the same as the previous implementation without
> change. After that, the rank column is spread to each component on the bag
> within a for each operator.
>
> An input example for the POCounter (after sorting and grouping):
> On this case, I would like to rank by the first column.
>
> (0,{(0,b,9)})
> (1,{(1,n,5)})
>  (8,{(8,a,0)})
>
> And after being processed by POCounter, and an input example for the
> PORank:
>
> (0,1,0,{(0,b,9)})
> (0,2,1,{(1,n,5)})
> (0,3,8,{(8,a,0)})
>
> Then, the resulting after PORank:
>
> (1,0,{(0,b,9)})
> (2,1,{(1,n,5)})
> (3,8,{(8,a,0)})
>
> Finally, the rank value is spread to each element at the bag through a for
> each operator, resulting:
>
> (1,0,b,9)