|
|
-
DISTINCT with 2 fields in a tuple
Mohit Anchlia 2012-04-11, 20:53
I am trying to get distinct from 2 fields in a record. something like select distinct a, b from c; So I wrote this in pig which is actually not working. I did: A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray);
B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;}
ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME: chararray ...
But this doesn't seem to be working. I thought A is a tuple and form_id and set_id are fields that I can do DISTINCT on. I saw similar example online but not exactly same.
-
Re: DISTINCT with 2 fields in a tuple
Prashant Kommireddi 2012-04-11, 20:57
You are doing a distinct on a Tuple, and not a Bag?
In your example, DISTINCT on Field name on each record/tuple would not make sense as its always a single value. You need to group by on a certain key before a distinct. On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
> I am trying to get distinct from 2 fields in a record. something like > select distinct a, b from c; So I wrote this in pig which is actually not > working. I did: > > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray); > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;} > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME: chararray > ... > > But this doesn't seem to be working. I thought A is a tuple and form_id and > set_id are fields that I can do DISTINCT on. I saw similar example online > but not exactly same. >
-
Re: DISTINCT with 2 fields in a tuple
Mehmet Tepedelenlioglu 2012-04-11, 21:04
Just group on those 2 fields. The 'group' field of the output will contain all the distinct combinations. That is, of course, if that is what you wanted to do in the first place. So no 'DISTINCT' is really necessary.
On Apr 11, 2012, at 1:53 PM, Mohit Anchlia wrote:
> I am trying to get distinct from 2 fields in a record. something like > select distinct a, b from c; So I wrote this in pig which is actually not > working. I did: > > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray); > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;} > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME: chararray > ... > > But this doesn't seem to be working. I thought A is a tuple and form_id and > set_id are fields that I can do DISTINCT on. I saw similar example online > but not exactly same.
-
Re: DISTINCT with 2 fields in a tuple
Mohit Anchlia 2012-04-11, 21:06
Thanks I tried something like this and it worked, but I have one more question: grunt> B = foreach A GENERATE FORM_ID, SET_ID;
grunt> C= DISTINCT B;
What's the different between foreach A GENERATE FORM_ID, SET_ID; and foreach A GENERATE A.FORM_ID, A.SET_ID;, To me they look the same but results are different.
On Wed, Apr 11, 2012 at 1:57 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
> You are doing a distinct on a Tuple, and not a Bag? > > In your example, DISTINCT on Field name on each record/tuple would not make > sense as its always a single value. You need to group by on a certain key > before a distinct. > > > On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > I am trying to get distinct from 2 fields in a record. something like > > select distinct a, b from c; So I wrote this in pig which is actually not > > working. I did: > > > > > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS > > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray); > > > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;} > > > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME: > chararray > > ... > > > > But this doesn't seem to be working. I thought A is a tuple and form_id > and > > set_id are fields that I can do DISTINCT on. I saw similar example online > > but not exactly same. > > >
-
Re: DISTINCT with 2 fields in a tuple
Gianmarco De Francisci Mo... 2012-04-12, 14:49
Hi,
Distinct with the foreach is more efficient then grouping, as long as you don't need the rest of the data you are better off with this solution.
With the syntax A.FORM_ID, A.SET_ID you are invoking scalar projection, that is you are telling Pig to treat the value as a scalar. The right syntax is the first one (without the "A." in front).
Cheers, -- Gianmarco
On Wed, Apr 11, 2012 at 23:06, Mohit Anchlia <[EMAIL PROTECTED]> wrote:
> Thanks I tried something like this and it worked, but I have one more > question: > > > grunt> B = foreach A GENERATE FORM_ID, SET_ID; > > grunt> C= DISTINCT B; > > What's the different between foreach A GENERATE FORM_ID, SET_ID; and > foreach A GENERATE A.FORM_ID, A.SET_ID;, To me they look the same but > results are different. > > On Wed, Apr 11, 2012 at 1:57 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > You are doing a distinct on a Tuple, and not a Bag? > > > > In your example, DISTINCT on Field name on each record/tuple would not > make > > sense as its always a single value. You need to group by on a certain key > > before a distinct. > > > > > > On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > I am trying to get distinct from 2 fields in a record. something like > > > select distinct a, b from c; So I wrote this in pig which is actually > not > > > working. I did: > > > > > > > > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS > > > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray); > > > > > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;} > > > > > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME: > > chararray > > > ... > > > > > > But this doesn't seem to be working. I thought A is a tuple and form_id > > and > > > set_id are fields that I can do DISTINCT on. I saw similar example > online > > > but not exactly same. > > > > > >
-
Re: DISTINCT with 2 fields in a tuple
Mohit Anchlia 2012-04-12, 14:55
How can I do distinct with foreach? Are those 2 separate statement like the one I posted or something different?
On Thu, Apr 12, 2012 at 7:49 AM, Gianmarco De Francisci Morales < [EMAIL PROTECTED]> wrote:
> Hi, > > Distinct with the foreach is more efficient then grouping, as long as you > don't need the rest of the data you are better off with this solution. > > With the syntax A.FORM_ID, A.SET_ID you are invoking scalar projection, > that is you are telling Pig to treat the value as a scalar. The right > syntax is the first one (without the "A." in front). > > Cheers, > -- > Gianmarco > > > > On Wed, Apr 11, 2012 at 23:06, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > > > Thanks I tried something like this and it worked, but I have one more > > question: > > > > > > grunt> B = foreach A GENERATE FORM_ID, SET_ID; > > > > grunt> C= DISTINCT B; > > > > What's the different between foreach A GENERATE FORM_ID, SET_ID; and > > foreach A GENERATE A.FORM_ID, A.SET_ID;, To me they look the same but > > results are different. > > > > On Wed, Apr 11, 2012 at 1:57 PM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > You are doing a distinct on a Tuple, and not a Bag? > > > > > > In your example, DISTINCT on Field name on each record/tuple would not > > make > > > sense as its always a single value. You need to group by on a certain > key > > > before a distinct. > > > > > > > > > On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia <[EMAIL PROTECTED] > > > >wrote: > > > > > > > I am trying to get distinct from 2 fields in a record. something like > > > > select distinct a, b from c; So I wrote this in pig which is actually > > not > > > > working. I did: > > > > > > > > > > > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') AS > > > > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray); > > > > > > > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;} > > > > > > > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME: > > > chararray > > > > ... > > > > > > > > But this doesn't seem to be working. I thought A is a tuple and > form_id > > > and > > > > set_id are fields that I can do DISTINCT on. I saw similar example > > online > > > > but not exactly same. > > > > > > > > > >
-
Re: DISTINCT with 2 fields in a tuple
Gianmarco De Francisci Mo... 2012-04-12, 15:01
Exactly like you posted.
Cheers, -- Gianmarco
On Thu, Apr 12, 2012 at 16:55, Mohit Anchlia <[EMAIL PROTECTED]> wrote:
> How can I do distinct with foreach? Are those 2 separate statement like the > one I posted or something different? > > On Thu, Apr 12, 2012 at 7:49 AM, Gianmarco De Francisci Morales < > [EMAIL PROTECTED]> wrote: > > > Hi, > > > > Distinct with the foreach is more efficient then grouping, as long as you > > don't need the rest of the data you are better off with this solution. > > > > With the syntax A.FORM_ID, A.SET_ID you are invoking scalar projection, > > that is you are telling Pig to treat the value as a scalar. The right > > syntax is the first one (without the "A." in front). > > > > Cheers, > > -- > > Gianmarco > > > > > > > > On Wed, Apr 11, 2012 at 23:06, Mohit Anchlia <[EMAIL PROTECTED]> > > wrote: > > > > > Thanks I tried something like this and it worked, but I have one more > > > question: > > > > > > > > > grunt> B = foreach A GENERATE FORM_ID, SET_ID; > > > > > > grunt> C= DISTINCT B; > > > > > > What's the different between foreach A GENERATE FORM_ID, SET_ID; and > > > foreach A GENERATE A.FORM_ID, A.SET_ID;, To me they look the same but > > > results are different. > > > > > > On Wed, Apr 11, 2012 at 1:57 PM, Prashant Kommireddi < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > You are doing a distinct on a Tuple, and not a Bag? > > > > > > > > In your example, DISTINCT on Field name on each record/tuple would > not > > > make > > > > sense as its always a single value. You need to group by on a certain > > key > > > > before a distinct. > > > > > > > > > > > > On Wed, Apr 11, 2012 at 1:53 PM, Mohit Anchlia < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > I am trying to get distinct from 2 fields in a record. something > like > > > > > select distinct a, b from c; So I wrote this in pig which is > actually > > > not > > > > > working. I did: > > > > > > > > > > > > > > > A = LOAD '/examples/form_out/part-m-00000' USING PigStorage('\t') > AS > > > > > (FILE_NAME:chararray,FORM_ID:chararray,SET_ID:chararray); > > > > > > > > > > B = foreach A {dist = DISTINCT A.FORM_ID, A.SET_ID; GENERATE dist;} > > > > > > > > > > ERROR 1000: Error during parsing. Invalid alias: A in {FILE_NAME: > > > > chararray > > > > > ... > > > > > > > > > > But this doesn't seem to be working. I thought A is a tuple and > > form_id > > > > and > > > > > set_id are fields that I can do DISTINCT on. I saw similar example > > > online > > > > > but not exactly same. > > > > > > > > > > > > > > >
|
|