Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Does Pig guarantee output won't include duplicated rows?


Copy link to this message
-
Re: Does Pig guarantee output won't include duplicated rows?

C = DISTINCT B;

STORE C INTO '$OUTPUT';

-Kris

On Fri, May 18, 2012 at 04:55:23PM +0100, Brendan Gill wrote:
> Hi all,
>
> We've been getting some funny outputs to some Pig jobs recently that
> contains a lot of duplicated data.  I'm wondering if the cause of this
> could be Pig, or if we must have duplicates in our raw data set (which is
> very possible).
>
> We're running simple Pig jobs that are just filtering a subset of our data
> based on co-ordinates e.g.:
>
> A =  LOAD '$INPUT' USING PigStorage('\t') as (entity_id: long, lat: double,
> lng: double);
>
> B =  FILTER A BY (lat > 37.708) AND (lat < 37.817) AND (lng > -122.519) AND
> (lng < -122.356);
>
> STORE B INTO '$OUTPUT';
>
> Thanks.

--
Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB