Hi to everybody!

I'm working on the implementation of rank operator, which successfully

passed all the e2e tests on a cluster.

Rank operator is composed by two physical operators: POCounter and PORank,

and it provides two functionalities:

1) First functionality is similar to ROW NUMBER like on SQL, which provides

a sequential number to each tuple.

This is implemented by two map-only works (one for each physical operator).

- POCounter adds to each tuple the task identifier (which is processing it)

and a local counter. Furthermore, POCounter register the total number of

processed tuples by each task, through the used of global counters.

After finished the POCounter, it is calculated the cumulative sum, which is

the summation of the total tuples processed by previous tasks, i.e. for

task0 cumulative sum is 0 (there is not tuples before), task1 cumulative

sum is the number of tuples processed by task0 (the only task before it is

task0), and so on.

- Finally, PORank reads the corresponding cumulative according to the task

id of each tuple and sums the local counter at the tuple.

An input example for the POCount could be:

(1,n,5)

(8,a,0)

(0,b,9)

result of POCounter, and input to the PORank:

(0,1,1,n,5)

(0,2,8,a,0)

(0,3,0,b,9)

and result after PORank processing:

(1,1,n,5)

(2,8,a,0)

(3,0,b,9)

2) Second functionality is RANK BY, which is based on set of ordered

columns.

And it requires another methodology:

First, the dataset is group by the desired columns. Then, this result is

sorted by the columns specified. And, at the end this result is processed

by POCounter and PORank.

As in the previous case, POCounter adds to each tuple the task identifier

and the local counter. But here, local counter is not sequentially

incremented. Instead, it is added the number of tuples in the bag (produced

within the previous "group by").

Another particular change is the fact of the global counter is also

incremented by the size of bags on each tuple.

Finally, PORank does the same as the previous implementation without

change. After that, the rank column is spread to each component on the bag

within a for each operator.

An input example for the POCounter (after sorting and grouping):

On this case, I would like to rank by the first column.

(0,{(0,b,9)})

(1,{(1,n,5)})

(8,{(8,a,0)})

And after being processed by POCounter, and an input example for the PORank:

(0,1,0,{(0,b,9)})

(0,2,1,{(1,n,5)})

(0,3,8,{(8,a,0)})

Then, the resulting after PORank:

(1,0,{(0,b,9)})

(2,1,{(1,n,5)})

(3,8,{(8,a,0)})

Finally, the rank value is spread to each element at the bag through a for

each operator, resulting:

(1,0,b,9)

(2,1,n,5)

(3,8,a,0)

After testing some options, I got a way to illustrate the rank operator,

but I have some problems:

1.- I guess that due to the illustrator algorithm, resulting tuples after

POCounter produces numbers high counters values two or three times than

expected, for example:

(0,38,1,n,5)

(0,39,8,a,0)

(0,40,0,b,9)

2.- Until now, I get 1 tuple example after illustrate. How could I get at

least three or four tuples as result?

Thanks in advance for your replies,

--

Allan AvendaĆ±o S.

Computer Engineer

Ex-SWY22 Participant

Rome - Italy

Gmail: [EMAIL PROTECTED]

--

I'm working on the implementation of rank operator, which successfully

passed all the e2e tests on a cluster.

Rank operator is composed by two physical operators: POCounter and PORank,

and it provides two functionalities:

1) First functionality is similar to ROW NUMBER like on SQL, which provides

a sequential number to each tuple.

This is implemented by two map-only works (one for each physical operator).

- POCounter adds to each tuple the task identifier (which is processing it)

and a local counter. Furthermore, POCounter register the total number of

processed tuples by each task, through the used of global counters.

After finished the POCounter, it is calculated the cumulative sum, which is

the summation of the total tuples processed by previous tasks, i.e. for

task0 cumulative sum is 0 (there is not tuples before), task1 cumulative

sum is the number of tuples processed by task0 (the only task before it is

task0), and so on.

- Finally, PORank reads the corresponding cumulative according to the task

id of each tuple and sums the local counter at the tuple.

An input example for the POCount could be:

(1,n,5)

(8,a,0)

(0,b,9)

result of POCounter, and input to the PORank:

(0,1,1,n,5)

(0,2,8,a,0)

(0,3,0,b,9)

and result after PORank processing:

(1,1,n,5)

(2,8,a,0)

(3,0,b,9)

2) Second functionality is RANK BY, which is based on set of ordered

columns.

And it requires another methodology:

First, the dataset is group by the desired columns. Then, this result is

sorted by the columns specified. And, at the end this result is processed

by POCounter and PORank.

As in the previous case, POCounter adds to each tuple the task identifier

and the local counter. But here, local counter is not sequentially

incremented. Instead, it is added the number of tuples in the bag (produced

within the previous "group by").

Another particular change is the fact of the global counter is also

incremented by the size of bags on each tuple.

Finally, PORank does the same as the previous implementation without

change. After that, the rank column is spread to each component on the bag

within a for each operator.

An input example for the POCounter (after sorting and grouping):

On this case, I would like to rank by the first column.

(0,{(0,b,9)})

(1,{(1,n,5)})

(8,{(8,a,0)})

And after being processed by POCounter, and an input example for the PORank:

(0,1,0,{(0,b,9)})

(0,2,1,{(1,n,5)})

(0,3,8,{(8,a,0)})

Then, the resulting after PORank:

(1,0,{(0,b,9)})

(2,1,{(1,n,5)})

(3,8,{(8,a,0)})

Finally, the rank value is spread to each element at the bag through a for

each operator, resulting:

(1,0,b,9)

(2,1,n,5)

(3,8,a,0)

After testing some options, I got a way to illustrate the rank operator,

but I have some problems:

1.- I guess that due to the illustrator algorithm, resulting tuples after

POCounter produces numbers high counters values two or three times than

expected, for example:

(0,38,1,n,5)

(0,39,8,a,0)

(0,40,0,b,9)

2.- Until now, I get 1 tuple example after illustrate. How could I get at

least three or four tuples as result?

Thanks in advance for your replies,

--

Allan AvendaĆ±o S.

Computer Engineer

Ex-SWY22 Participant

Rome - Italy

Gmail: [EMAIL PROTECTED]

--