Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Dedupe Logic

Ambastha, Abhishek 2013-08-24, 18:03
Copy link to this message
Re: Dedupe Logic

You should be able to do this by grouping by the three columns and then
ordering by the fourth in a nested foreach.


data = load 'some_url' as (f11, f12, f13, f14);

deduped = foreach (group data by (f11,f12,f13)) {
            ordered = order data by f14 asc;
            one_rec = limit ordered 1;
              flatten(one_rec) as (f11, f2, f13, f14);
On Sat, 2013-08-24 at 18:03 +0000, Ambastha, Abhishek wrote:
> Hi,
> How can I sort and dedupe on multiple columns ?
> I have a 5 GB file with 70 columns. I want to sort on four columns f11, f12, f13 and f14. Then I want to dedupe on three columns f11, f12 and f13 so that the minimum value of f14 is retained (that is pick up the first record after sort). Please suggest how to do this.
> Also, can this be done using rank function?
> Regards,
> Abhishek
Ambastha, Abhishek 2013-08-26, 15:40