Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Dedupe Logic


+
Ambastha, Abhishek 2013-08-24, 18:03
Copy link to this message
-
Re: Dedupe Logic
Jacob Perkins 2013-08-24, 18:19
Abhishek,

You should be able to do this by grouping by the three columns and then
ordering by the fourth in a nested foreach.

eg:

data = load 'some_url' as (f11, f12, f13, f14);

deduped = foreach (group data by (f11,f12,f13)) {
            ordered = order data by f14 asc;
            one_rec = limit ordered 1;
            generate
              flatten(one_rec) as (f11, f2, f13, f14);
          };
--jacob
@thedatachef
On Sat, 2013-08-24 at 18:03 +0000, Ambastha, Abhishek wrote:
> Hi,
>
> How can I sort and dedupe on multiple columns ?
>
> I have a 5 GB file with 70 columns. I want to sort on four columns f11, f12, f13 and f14. Then I want to dedupe on three columns f11, f12 and f13 so that the minimum value of f14 is retained (that is pick up the first record after sort). Please suggest how to do this.
>
> Also, can this be done using rank function?
>
> Regards,
> Abhishek
+
Ambastha, Abhishek 2013-08-26, 15:40