Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Does Pig guarantee output won't include duplicated rows?


+
Brendan Gill 2012-05-18, 15:55
Copy link to this message
-
Re: Does Pig guarantee output won't include duplicated rows?

C = DISTINCT B;

STORE C INTO '$OUTPUT';

-Kris

On Fri, May 18, 2012 at 04:55:23PM +0100, Brendan Gill wrote:
> Hi all,
>
> We've been getting some funny outputs to some Pig jobs recently that
> contains a lot of duplicated data.  I'm wondering if the cause of this
> could be Pig, or if we must have duplicates in our raw data set (which is
> very possible).
>
> We're running simple Pig jobs that are just filtering a subset of our data
> based on co-ordinates e.g.:
>
> A =  LOAD '$INPUT' USING PigStorage('\t') as (entity_id: long, lat: double,
> lng: double);
>
> B =  FILTER A BY (lat > 37.708) AND (lat < 37.817) AND (lng > -122.519) AND
> (lng < -122.356);
>
> STORE B INTO '$OUTPUT';
>
> Thanks.

--
Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
+
Brendan Gill 2012-05-18, 16:28
+
Alan Gates 2012-05-18, 16:35
+
Brendan Gill 2012-05-18, 16:59