|
|
-
"Exploding" a Hive array<string> in Pig from an RCFile
Malcolm Tye 2012-04-05, 12:58
Hi, I'm storing data into a partitioned table using Hive in RCFile format, but I want to use Pig to do the aggregation of that data.
In my array <string> in Hive, I have colon delimited data, E.g.
:0:12:21:99:
With the lateral view and explode functions in Hive, I can output each value as a separate row.
In Pig, I think I need to use flatten, but it just outputs the array as a single field, and I can't see where to specify that the delimiter is the delimiter/value separator
register /opt/pig/trunk/bin/piggybank.jar mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID string,seg_ids array<string>'); opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump opt;
Thanks
Malc
-
Re: "Exploding" a Hive array<string> in Pig from an RCFile
Norbert Burger 2012-04-06, 15:00
Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize a chararray on some delimeter. So the following should work:
opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as s_seg_id;
Norbert
On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye <[EMAIL PROTECTED]>wrote:
> Hi, > I'm storing data into a partitioned table using Hive in RCFile format, > but I want to use Pig to do the aggregation of that data. > > In my array <string> in Hive, I have colon delimited data, E.g. > > :0:12:21:99: > > With the lateral view and explode functions in Hive, I can output each > value > as a separate row. > > In Pig, I think I need to use flatten, but it just outputs the array as a > single field, and I can't see where to specify that the delimiter is the > delimiter/value separator > > register /opt/pig/trunk/bin/piggybank.jar > mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID > string,seg_ids > array<string>'); > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; > dump opt; > > > > Thanks > > Malc > > >
-
RE: "Exploding" a Hive array<string> in Pig from an RCFile
Malcolm Tye 2012-04-11, 22:59
Hi Norbert, I don't seem to be getting what I'm after. If my data looks like this
1133957209,61:0:1 4524524233,21:0
I want to produce
1133957209,61 1133957209,0 1133957209,1 4524524233,21 4524524233,0
I changed the LOAD statement to
mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID string,seg_ids array'); opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as s_seg_id;
I don't seem to be getting the cross product, just something like the following
1133957209,61,0,1 4524524233,21,0
Any ideas ? Thanks
Malc -----Original Message----- From: Norbert Burger [mailto:[EMAIL PROTECTED]] Sent: 06 April 2012 16:01 To: [EMAIL PROTECTED] Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile
Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize a chararray on some delimeter. So the following should work:
opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as s_seg_id;
Norbert
On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye <[EMAIL PROTECTED]>wrote:
> Hi, > I'm storing data into a partitioned table using Hive in RCFile > format, but I want to use Pig to do the aggregation of that data. > > In my array <string> in Hive, I have colon delimited data, E.g. > > :0:12:21:99: > > With the lateral view and explode functions in Hive, I can output each > value as a separate row. > > In Pig, I think I need to use flatten, but it just outputs the array > as a single field, and I can't see where to specify that the delimiter > is the delimiter/value separator > > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID > string,seg_ids > array<string>'); > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump > opt; > > > > Thanks > > Malc > > >
-
Re: "Exploding" a Hive array<string> in Pig from an RCFile
Norbert Burger 2012-04-12, 03:14
A little wonky, but try wrapping the flattened tuple elements in a bag, and then re-flattening that:
A = LOAD 'test.txt' USING PigStorage(',') AS (C_SUB_ID:chararray,seg_ids:chararray); B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':')); C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));
Only flattened bags generate the cols -> rows transformation that you're trying to make. Flattened tuples, on the other hand, simply explode the tuple into its composite elements, but without creating the multiple rows ("cross product') in your relation. A custom UDF would be another option here.
Norbert
On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye <[EMAIL PROTECTED]>wrote:
> Hi Norbert, > I don't seem to be getting what I'm after. If my data looks like > this > > 1133957209,61:0:1 > 4524524233,21:0 > > I want to produce > > 1133957209,61 > 1133957209,0 > 1133957209,1 > 4524524233,21 > 4524524233,0 > > I changed the LOAD statement to > > mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID > string,seg_ids > array'); > opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as > s_seg_id; > > I don't seem to be getting the cross product, just something like the > following > > 1133957209,61,0,1 > 4524524233,21,0 > > Any ideas ? > > > Thanks > > Malc > > > -----Original Message----- > From: Norbert Burger [mailto:[EMAIL PROTECTED]] > Sent: 06 April 2012 16:01 > To: [EMAIL PROTECTED] > Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile > > Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize > a chararray on some delimeter. So the following should work: > > opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as > s_seg_id; > > Norbert > > On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye > <[EMAIL PROTECTED]>wrote: > > > Hi, > > I'm storing data into a partitioned table using Hive in RCFile > > format, but I want to use Pig to do the aggregation of that data. > > > > In my array <string> in Hive, I have colon delimited data, E.g. > > > > :0:12:21:99: > > > > With the lateral view and explode functions in Hive, I can output each > > value as a separate row. > > > > In Pig, I think I need to use flatten, but it just outputs the array > > as a single field, and I can't see where to specify that the delimiter > > is the delimiter/value separator > > > > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD > > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING > > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID > > string,seg_ids > > array<string>'); > > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump > > opt; > > > > > > > > Thanks > > > > Malc > > > > > > > >
-
Re: "Exploding" a Hive array<string> in Pig from an RCFile
Aniket Mokashi 2012-04-12, 09:38
Hi Malcolm,
arrays are converted to tuples and flatten should directly work on it. I think you need not worry about the delimiter (assuming hive knows how to deserialize it). Btw, does RCFile require delimiter to store arrays? I am not sure about that.
Thanks, Aniket On Wed, Apr 11, 2012 at 8:14 PM, Norbert Burger <[EMAIL PROTECTED]>wrote:
> A little wonky, but try wrapping the flattened tuple elements in a bag, and > then re-flattening that: > > A = LOAD 'test.txt' USING PigStorage(',') AS > (C_SUB_ID:chararray,seg_ids:chararray); > B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':')); > C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..)); > > Only flattened bags generate the cols -> rows transformation that you're > trying to make. Flattened tuples, on the other hand, simply explode the > tuple into its composite elements, but without creating the multiple rows > ("cross product') in your relation. A custom UDF would be another option > here. > > Norbert > > On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye <[EMAIL PROTECTED] > >wrote: > > > Hi Norbert, > > I don't seem to be getting what I'm after. If my data looks > like > > this > > > > 1133957209,61:0:1 > > 4524524233,21:0 > > > > I want to produce > > > > 1133957209,61 > > 1133957209,0 > > 1133957209,1 > > 4524524233,21 > > 4524524233,0 > > > > I changed the LOAD statement to > > > > mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING > > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID > > string,seg_ids > > array'); > > opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as > > s_seg_id; > > > > I don't seem to be getting the cross product, just something like the > > following > > > > 1133957209,61,0,1 > > 4524524233,21,0 > > > > Any ideas ? > > > > > > Thanks > > > > Malc > > > > > > -----Original Message----- > > From: Norbert Burger [mailto:[EMAIL PROTECTED]] > > Sent: 06 April 2012 16:01 > > To: [EMAIL PROTECTED] > > Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile > > > > Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to > tokenize > > a chararray on some delimeter. So the following should work: > > > > opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as > > s_seg_id; > > > > Norbert > > > > On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye > > <[EMAIL PROTECTED]>wrote: > > > > > Hi, > > > I'm storing data into a partitioned table using Hive in RCFile > > > format, but I want to use Pig to do the aggregation of that data. > > > > > > In my array <string> in Hive, I have colon delimited data, E.g. > > > > > > :0:12:21:99: > > > > > > With the lateral view and explode functions in Hive, I can output each > > > value as a separate row. > > > > > > In Pig, I think I need to use flatten, but it just outputs the array > > > as a single field, and I can't see where to specify that the delimiter > > > is the delimiter/value separator > > > > > > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD > > > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING > > > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID > > > string,seg_ids > > > array<string>'); > > > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump > > > opt; > > > > > > > > > > > > Thanks > > > > > > Malc > > > > > > > > > > > > > >
-- "...:::Aniket:::... Quetzalco@tl"
-
RE: "Exploding" a Hive array<string> in Pig from an RCFile
Malcolm Tye 2012-05-03, 12:29
Hi Norbert, Thanks for your answer. I'm just documenting the problems I experienced and will reply to the list soon with a detailed answer Thanks for your help Malc -----Original Message----- From: Norbert Burger [mailto:[EMAIL PROTECTED]] Sent: 12 April 2012 04:14 To: [EMAIL PROTECTED] Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile
A little wonky, but try wrapping the flattened tuple elements in a bag, and then re-flattening that:
A = LOAD 'test.txt' USING PigStorage(',') AS (C_SUB_ID:chararray,seg_ids:chararray); B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':')); C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));
Only flattened bags generate the cols -> rows transformation that you're trying to make. Flattened tuples, on the other hand, simply explode the tuple into its composite elements, but without creating the multiple rows ("cross product') in your relation. A custom UDF would be another option here.
Norbert
On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye <[EMAIL PROTECTED]>wrote:
> Hi Norbert, > I don't seem to be getting what I'm after. If my data looks > like this > > 1133957209,61:0:1 > 4524524233,21:0 > > I want to produce > > 1133957209,61 > 1133957209,0 > 1133957209,1 > 4524524233,21 > 4524524233,0 > > I changed the LOAD statement to > > mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID > string,seg_ids > array'); > opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as > s_seg_id; > > I don't seem to be getting the cross product, just something like the > following > > 1133957209,61,0,1 > 4524524233,21,0 > > Any ideas ? > > > Thanks > > Malc > > > -----Original Message----- > From: Norbert Burger [mailto:[EMAIL PROTECTED]] > Sent: 06 April 2012 16:01 > To: [EMAIL PROTECTED] > Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile > > Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to > tokenize a chararray on some delimeter. So the following should work: > > opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as > s_seg_id; > > Norbert > > On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye > <[EMAIL PROTECTED]>wrote: > > > Hi, > > I'm storing data into a partitioned table using Hive in RCFile > > format, but I want to use Pig to do the aggregation of that data. > > > > In my array <string> in Hive, I have colon delimited data, E.g. > > > > :0:12:21:99: > > > > With the lateral view and explode functions in Hive, I can output > > each value as a separate row. > > > > In Pig, I think I need to use flatten, but it just outputs the array > > as a single field, and I can't see where to specify that the > > delimiter is the delimiter/value separator > > > > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD > > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING > > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID > > string,seg_ids > > array<string>'); > > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; > > dump opt; > > > > > > > > Thanks > > > > Malc > > > > > > > >
|
|