Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Does Pig guarantee output won't include duplicated rows?


Copy link to this message
-
Re: Does Pig guarantee output won't include duplicated rows?
Pig doesn't create duplicates (at least it shouldn't) if that's what you're asking.  But if there are duplicates in your data it won't take them out unless you instruct it to (by adding a 'distinct' to your script).

Alan.

On May 18, 2012, at 9:28 AM, Brendan Gill wrote:

> Sure, I know there's an easy fix.  I was just wondering if that is really
> the case - that Pig doesn't guarantee non duplicated results?
>
> If so, it seems like we'll want to add a distinct filter to almost every
> query we write, which seems a bit redundant.
>
>
>
> On Fri, May 18, 2012 at 4:58 PM, Kris Coward <[EMAIL PROTECTED]> wrote:
>
>>
>> C = DISTINCT B;
>>
>> STORE C INTO '$OUTPUT';
>>
>> -Kris
>>
>> On Fri, May 18, 2012 at 04:55:23PM +0100, Brendan Gill wrote:
>>> Hi all,
>>>
>>> We've been getting some funny outputs to some Pig jobs recently that
>>> contains a lot of duplicated data.  I'm wondering if the cause of this
>>> could be Pig, or if we must have duplicates in our raw data set (which is
>>> very possible).
>>>
>>> We're running simple Pig jobs that are just filtering a subset of our
>> data
>>> based on co-ordinates e.g.:
>>>
>>> A =  LOAD '$INPUT' USING PigStorage('\t') as (entity_id: long, lat:
>> double,
>>> lng: double);
>>>
>>> B =  FILTER A BY (lat > 37.708) AND (lat < 37.817) AND (lng > -122.519)
>> AND
>>> (lng < -122.356);
>>>
>>> STORE B INTO '$OUTPUT';
>>>
>>> Thanks.
>>
>> --
>> Kris Coward                                     http://unripe.melon.org/
>> GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB