-Re: Does Pig guarantee output won't include duplicated rows?
Alan Gates 2012-05-18, 16:35
Pig doesn't create duplicates (at least it shouldn't) if that's what you're asking. But if there are duplicates in your data it won't take them out unless you instruct it to (by adding a 'distinct' to your script).
On May 18, 2012, at 9:28 AM, Brendan Gill wrote:
> Sure, I know there's an easy fix. I was just wondering if that is really
> the case - that Pig doesn't guarantee non duplicated results?
> If so, it seems like we'll want to add a distinct filter to almost every
> query we write, which seems a bit redundant.
> On Fri, May 18, 2012 at 4:58 PM, Kris Coward <[EMAIL PROTECTED]> wrote:
>> C = DISTINCT B;
>> STORE C INTO '$OUTPUT';
>> On Fri, May 18, 2012 at 04:55:23PM +0100, Brendan Gill wrote:
>>> Hi all,
>>> We've been getting some funny outputs to some Pig jobs recently that
>>> contains a lot of duplicated data. I'm wondering if the cause of this
>>> could be Pig, or if we must have duplicates in our raw data set (which is
>>> very possible).
>>> We're running simple Pig jobs that are just filtering a subset of our
>>> based on co-ordinates e.g.:
>>> A = LOAD '$INPUT' USING PigStorage('\t') as (entity_id: long, lat:
>>> lng: double);
>>> B = FILTER A BY (lat > 37.708) AND (lat < 37.817) AND (lng > -122.519)
>>> (lng < -122.356);
>>> STORE B INTO '$OUTPUT';
>> Kris Coward http://unripe.melon.org/
>> GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3