Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Dedupe Logic


Copy link to this message
-
Re: Dedupe Logic
Abhishek,

You should be able to do this by grouping by the three columns and then
ordering by the fourth in a nested foreach.

eg:

data = load 'some_url' as (f11, f12, f13, f14);

deduped = foreach (group data by (f11,f12,f13)) {
            ordered = order data by f14 asc;
            one_rec = limit ordered 1;
            generate
              flatten(one_rec) as (f11, f2, f13, f14);
          };
--jacob
@thedatachef
On Sat, 2013-08-24 at 18:03 +0000, Ambastha, Abhishek wrote:
> Hi,
>
> How can I sort and dedupe on multiple columns ?
>
> I have a 5 GB file with 70 columns. I want to sort on four columns f11, f12, f13 and f14. Then I want to dedupe on three columns f11, f12 and f13 so that the minimum value of f14 is retained (that is pick up the first record after sort). Please suggest how to do this.
>
> Also, can this be done using rank function?
>
> Regards,
> Abhishek