Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> DISTINCT with 2 fields in a tuple


+
Mohit Anchlia 2012-04-11, 20:53
+
Mehmet Tepedelenlioglu 2012-04-11, 21:04
+
Prashant Kommireddi 2012-04-11, 20:57
+
Mohit Anchlia 2012-04-11, 21:06
+
Gianmarco De Francisci Mo... 2012-04-12, 14:49
+
Mohit Anchlia 2012-04-12, 14:55
Copy link to this message
-
Re: DISTINCT with 2 fields in a tuple
Exactly like you posted.

Cheers,
--
Gianmarco

On Thu, Apr 12, 2012 at 16:55, Mohit Anchlia <[EMAIL PROTECTED]> wrote:

> How can I do distinct with foreach? Are those 2 separate statement like the
> one I posted or something different?
>
> On Thu, Apr 12, 2012 at 7:49 AM, Gianmarco De Francisci Morales <
> [EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > Distinct with the foreach is more efficient then grouping, as long as you
> > don't need the rest of the data you are better off with this solution.
> >
> > With the syntax A.FORM_ID, A.SET_ID you are invoking scalar projection,
> > that is you are telling Pig to treat the value as a scalar. The right
> > syntax is the first one (without the "A." in front).
> >
> > Cheers,
> > --
> > Gianmarco
> >
> >
> >
> > On Wed, Apr 11, 2012 at 23:06, Mohit Anchlia <[EMAIL PROTECTED]>
> > wrote:
> >
> > >  Thanks I tried something like this and it worked, but I have one more
> > > question:
> > >
> > >
> > > grunt> B = foreach A GENERATE FORM_ID, SET_ID;
> > >
> > > grunt> C= DISTINCT B;
> > >
> > > What's the different between foreach A GENERATE FORM_ID, SET_ID;  and
> > > foreach A GENERATE A.FORM_ID, A.SET_ID;, To me they look the same but
> > > results are different.
> > >
> > > On Wed, Apr 11, 2012 at 1:57 PM, Prashant Kommireddi <
> > [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > You are doing a distinct on a Tuple, and not a Bag?
> > > >
> > > > In your example, DISTINCT on Field name on each record/tuple would
> not
> > > make
> > > > sense as its always a single value. You need to group by on a certain
> > key
> > > > before a distinct.
> > > >
> > > >
> > > > On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <
> [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > I am trying to get distinct from 2 fields in a record. something
> like
> > > > > select distinct a, b from c; So I wrote this in pig which is
> actually
> > > not
> > > > > working. I did:
> > > > >
> > > > >
> > > > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t')
> AS
> > > > > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray);
> > > > >
> > > > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;}
> > > > >
> > > > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME:
> > > > chararray
> > > > > ...
> > > > >
> > > > > But this doesn't seem to be working. I thought A is a tuple and
> > form_id
> > > > and
> > > > > set_id are fields that I can do DISTINCT on. I saw similar example
> > > online
> > > > > but not exactly same.
> > > > >
> > > >
> > >
> >
>