Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> dereference bag of tuples of fields


+
Rodriguez, John 2010-07-30, 22:10
+
Thejas M Nair 2010-07-30, 22:38
+
Rodriguez, John 2010-07-31, 16:35
+
Scott Carey 2010-07-31, 16:39
+
Rodriguez, John 2010-08-01, 14:48
Copy link to this message
-
Re: dereference bag of tuples of fields
If you are loading data through PigStorage (which will be used if you
dont specify any) then there should be a comma separating tuples in
the bag, so your data should look like

cat data
{(1,1,1)}
{(2,2,2),(3,3,3)}
{(4,4,4),(5,5,5),(6,6,6)}

then
grunt> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
grunt> C = foreach A generate B.t1, B.t2, B.t3;
grunt> dump C;

{(1)},{(1)},{(1)})
({(2),(3)},{(2),(3)},{(2),(3)})
({(4),(5),(6)},{(4),(5),(6)},{(4),(5),(6)})
Ashutosh
On Sun, Aug 1, 2010 at 07:48, Rodriguez, John <[EMAIL PROTECTED]> wrote:
> Does this mean there is no way to access the fields t1, t2, t3?
>
>
>
> cat data
>
> {(1,1,1)}
>
> {(2,2,2)(3,3,3)}
>
> {(4,4,4)(5,5,5)(6,6,6)}
>
> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
>
>
>
>
>
> From: Scott Carey [mailto:[EMAIL PROTECTED]]
> Sent: Saturday, July 31, 2010 9:39 AM
> To: [EMAIL PROTECTED]; Rodriguez, John
> Subject: Re: dereference bag of tuples of fields
>
>
>
> data.isValid
>
> All bags are bags of tuples.  The tuple is intrinsic and invisible at
> the syntax level - its visible to udfs though.  If you nest one more
> tuple in that nested tuple pig gets confused.    So 'bag.field' is
> actually a double dereference - one for the bag and one for the
> intrinsic tuple.
>
> ----- Reply message -----
> From: "Rodriguez, John" <[EMAIL PROTECTED]>
> Date: Fri, Jul 30, 2010 3:11 pm
> Subject: dereference bag of tuples of fields
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>
> I have built a bag tuples where the tuples contain fields.
>
>
>
> I am reading SequenceFiles and have reading MyLoader to do this. I
> created a subset of all the fields, "isValid" to make the example
> simpler.
>
>
>
> I am not sure how to apply a dereference operator to this?
>
>
>
> A = LOAD '/data/NetFlowDigests/rk/DigestMessage/part-r-00000' using
> MyLoader() AS (data: bag{t: tuple(isValid:int)});
>
> DESCRIBE A;
>
> A: {data: {t: (isValid: int)}}
>
>
>
> So all the ways that I have tried to dereference have syntax errors.
>
>
>
> B = GROUP A BY (data.t);
>
> 2010-07-30 21:51:29,881 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
> access to the elements of the tuple in the bag is allowed.
>
>
>
> B = GROUP A BY (data.t.isValid);
>
> 2010-07-30 21:54:11,157 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
> access to the elements of the tuple in the bag is allowed.
>
>
>
> B = GROUP A BY (t.isValid);
>
> 2010-07-30 21:55:31,475 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Invalid alias: t in {data: {t:
> (isValid: int)}}
>
>
>
> What is the proper way to do this?
>
>
>
> John Rodriguez
>
>
>
>
+
Rodriguez, John 2010-08-02, 17:04
+
Rodriguez, John 2010-08-02, 19:35
+
Xiaomeng Wan 2010-08-03, 17:16