Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - illustrate


Copy link to this message
-
Re: illustrate
Gianmarco De Francisci Mo... 2012-08-12, 09:39
Hi Dmitriy,
I think there are at least a couple things that would be more difficult to
do with a UDF implementation, namely:
1) AFAIK, you don't have access to the MR task id within the UDF.
2) Using the counters between the two steps of the operations in order to
communicate the cumulative sums.
Cheers,
--
Gianmarco

On Fri, Aug 10, 2012 at 9:13 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> I must be missing some tricky detail... Which of these operations could
> not be done by clever udfs?
>
> On Aug 9, 2012, at 9:01 AM, Gianmarco De Francisci Morales <
> [EMAIL PROTECTED]> wrote:
>
> > Hi Allan,
> > I think I found an answer to your problem:
> >
> > 1) Modify PhysicalPlanResetter by adding:
> >
> >    @Override
> >
> >    public void visitCounter(POCounter counter) throws VisitorException {
> >
> >        counter.reset();
> >
> >    }
> >
> >
> > 2) Modify POCounter by adding
> >
> >    @Override
> >
> >    public void reset() {
> >
> >        localCount = 0L;
> >
> >        taskID = "-1";
> >
> >        incrementer = 1;
> >
> >    }
> >
> >
> >
> > I get this result on this file + script:
> >
> > ------------------------------------------
> >
> > | a     | id:int    | value:chararray    |
> >
> > ------------------------------------------
> >
> > |       | 6         | g                  |
> >
> > ------------------------------------------
> >
> > -----------------------------------------------------------
> >
> > | b     | rank_a:long    | id:int    | value:chararray    |
> >
> > -----------------------------------------------------------
> >
> > |       | 1              | 6         | g                  |
> >
> > -----------------------------------------------------------
> >
> >
> >
> > grunt> cat file.txt
> >
> > 1 a
> >
> > 2 b
> >
> > 3 c
> >
> > 3 d
> >
> > 4 e
> >
> > 6 f
> >
> > 6 g
> >
> > 8 h
> >
> >
> >
> >
> > grunt> a = load 'file.txt' as (id:int, value:chararray);
> >
> > grunt> b = rank a;
> >
> > grunt> illustrate b
> >
> >
> >
> >
> > Hope it helps.
> >
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> >
> >
> > On Tue, Aug 7, 2012 at 4:34 PM, Allan <[EMAIL PROTECTED]> wrote:
> >
> >> Hi to everybody!
> >>
> >> I'm working on the implementation of rank operator, which successfully
> >> passed all the e2e tests on a cluster.
> >> Rank operator is composed by two physical operators: POCounter and
> PORank,
> >> and it provides two functionalities:
> >>
> >> 1) First functionality is similar to ROW NUMBER like on SQL, which
> >> provides a sequential number to each tuple.
> >> This is implemented by two map-only works (one for each physical
> >> operator).
> >>
> >> - POCounter adds to each tuple the task identifier (which is processing
> >> it) and a local counter.  Furthermore, POCounter register the total
> number
> >> of processed tuples by each task, through the used of global counters.
> >> After finished the POCounter, it is calculated the cumulative sum, which
> >> is the summation of the total tuples processed by previous tasks, i.e.
> for
> >> task0 cumulative sum is 0 (there is not tuples before), task1 cumulative
> >> sum is the number of tuples processed by task0 (the only task before it
> is
> >> task0), and so on.
> >>
> >> - Finally, PORank reads the corresponding cumulative according to the
> task
> >> id of each tuple and sums the local counter at the tuple.
> >>
> >> An input example for the POCount could be:
> >>
> >> (1,n,5)
> >> (8,a,0)
> >> (0,b,9)
> >>
> >> result of POCounter, and input to the PORank:
> >>
> >> (0,1,1,n,5)
> >> (0,2,8,a,0)
> >> (0,3,0,b,9)
> >>
> >> and result after PORank processing:
> >>
> >> (1,1,n,5)
> >> (2,8,a,0)
> >> (3,0,b,9)
> >>
> >>
> >> 2) Second functionality is RANK BY, which is based on set of ordered
> >> columns.
> >> And it requires another methodology:
> >> First, the dataset is group by the desired columns. Then, this result is
> >> sorted by the columns specified. And, at the end this result is
> processed
> >> by POCounter and PORank.
> >> As in the previous case, POCounter adds to each tuple the task