Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> DISTINCT with 2 fields in a tuple


+
Mohit Anchlia 2012-04-11, 20:53
+
Mehmet Tepedelenlioglu 2012-04-11, 21:04
+
Prashant Kommireddi 2012-04-11, 20:57
+
Mohit Anchlia 2012-04-11, 21:06
Copy link to this message
-
Re: DISTINCT with 2 fields in a tuple
Hi,

Distinct with the foreach is more efficient then grouping, as long as you
don't need the rest of the data you are better off with this solution.

With the syntax A.FORM_ID, A.SET_ID you are invoking scalar projection,
that is you are telling Pig to treat the value as a scalar. The right
syntax is the first one (without the "A." in front).

Cheers,
--
Gianmarco

On Wed, Apr 11, 2012 at 23:06, Mohit Anchlia <[EMAIL PROTECTED]> wrote:

>  Thanks I tried something like this and it worked, but I have one more
> question:
>
>
> grunt> B = foreach A GENERATE FORM_ID, SET_ID;
>
> grunt> C= DISTINCT B;
>
> What's the different between foreach A GENERATE FORM_ID, SET_ID;  and
> foreach A GENERATE A.FORM_ID, A.SET_ID;, To me they look the same but
> results are different.
>
> On Wed, Apr 11, 2012 at 1:57 PM, Prashant Kommireddi <[EMAIL PROTECTED]
> >wrote:
>
> > You are doing a distinct on a Tuple, and not a Bag?
> >
> > In your example, DISTINCT on Field name on each record/tuple would not
> make
> > sense as its always a single value. You need to group by on a certain key
> > before a distinct.
> >
> >
> > On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <[EMAIL PROTECTED]
> > >wrote:
> >
> > > I am trying to get distinct from 2 fields in a record. something like
> > > select distinct a, b from c; So I wrote this in pig which is actually
> not
> > > working. I did:
> > >
> > >
> > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS
> > > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray);
> > >
> > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;}
> > >
> > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME:
> > chararray
> > > ...
> > >
> > > But this doesn't seem to be working. I thought A is a tuple and form_id
> > and
> > > set_id are fields that I can do DISTINCT on. I saw similar example
> online
> > > but not exactly same.
> > >
> >
>
+
Mohit Anchlia 2012-04-12, 14:55
+
Gianmarco De Francisci Mo... 2012-04-12, 15:01