Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Ordered partitioned data


Copy link to this message
-
Re: Ordered partitioned data
Cheolsoo Park 2013-05-13, 17:18
Hi Ahmed,

Please try this:

grped = GROUP foo BY group_id;
sorted = FOREACH grped {
    ordered = ORDER foo BY position;
    GENERATE group, MyUDF(ordered.name); -- MyUDF concatenates strings in a
bag
};

What this will do is:
1) Mappers will send the same keys to a reducer.
2) Each reducer will only sort values of their keys.

In fact, it is possible for Pig to optimize this even further
using secondary key sort optimization (i.e. Pig can remove ORDER BY in
reducers and entirely rely on Hadoop secondary sorting instead). But there
were some bugs with secondary key sort optimization for this case, and it
is removed from trunk recently.

Thanks,
Cheolsoo

On Mon, May 13, 2013 at 7:52 AM, Ahmed Eldawy <[EMAIL PROTECTED]> wrote:

> Hi,
>   I have a dataset with two three columns, group_id, position, and name. I
> need for each group to generate a concatenated string of all names ordered
> by their position. I can do this by sorting all data based on position, (or
> group_id and position), then grouping them by group_id, and finally
> concatenating names in each group. I have two questions here,
> 1- Does this really work? In other words, does the GROUP BY operator retain
> order?
> 2- What is the most efficient way to do it? Is it better, if possible, to
> group first and then sort?  Let's say I order by the pair (group_id,
> position) first, can this be hinted to Pig to make the group by faster.
> Thanks for your help
>
>
> Best regards,
> Ahmed Eldawy
>