Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> "Exploding" a Hive array<string> in Pig from an RCFile


+
Malcolm Tye 2012-04-05, 12:58
+
Norbert Burger 2012-04-06, 15:00
+
Malcolm Tye 2012-04-11, 22:59
+
Norbert Burger 2012-04-12, 03:14
+
Aniket Mokashi 2012-04-12, 09:38
Copy link to this message
-
RE: "Exploding" a Hive array<string> in Pig from an RCFile
Hi Norbert,
 Thanks for your answer. I'm just documenting the problems I
experienced and will reply to the list soon with a detailed answer
Thanks for your help
Malc
-----Original Message-----
From: Norbert Burger [mailto:[EMAIL PROTECTED]]
Sent: 12 April 2012 04:14
To: [EMAIL PROTECTED]
Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile

A little wonky, but try wrapping the flattened tuple elements in a bag, and
then re-flattening that:

A = LOAD 'test.txt' USING PigStorage(',') AS
(C_SUB_ID:chararray,seg_ids:chararray);
B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':'));
C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));

Only flattened bags generate the cols -> rows transformation that you're
trying to make.  Flattened tuples, on the other hand, simply explode the
tuple into its composite elements, but without creating the multiple rows
("cross product') in your relation.  A custom UDF would be another option
here.

Norbert

On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye
<[EMAIL PROTECTED]>wrote:

> Hi Norbert,
>            I don't seem to be getting what I'm after. If my data looks
> like this
>
> 1133957209,61:0:1
> 4524524233,21:0
>
> I want to produce
>
> 1133957209,61
> 1133957209,0
> 1133957209,1
> 4524524233,21
> 4524524233,0
>
> I changed the LOAD statement to
>
> mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> string,seg_ids
> array');
> opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
> s_seg_id;
>
> I don't seem to be getting the cross product, just something like the
> following
>
> 1133957209,61,0,1
> 4524524233,21,0
>
> Any ideas ?
>
>
> Thanks
>
> Malc
>
>
> -----Original Message-----
> From: Norbert Burger [mailto:[EMAIL PROTECTED]]
> Sent: 06 April 2012 16:01
> To: [EMAIL PROTECTED]
> Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile
>
> Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to
> tokenize a chararray on some delimeter.  So the following should work:
>
> opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
> s_seg_id;
>
> Norbert
>
> On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
> <[EMAIL PROTECTED]>wrote:
>
> > Hi,
> >    I'm storing data into a partitioned table using Hive in RCFile
> > format, but I want to use Pig to do the aggregation of that data.
> >
> > In my array <string> in Hive, I have colon delimited data, E.g.
> >
> > :0:12:21:99:
> >
> > With the lateral view and explode functions in Hive, I can output
> > each value as a separate row.
> >
> > In Pig, I think I need to use flatten, but it just outputs the array
> > as a single field, and I can't see where to specify that the
> > delimiter is the delimiter/value separator
> >
> > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD
> > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> > string,seg_ids
> > array<string>');
> > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id;
> > dump opt;
> >
> >
> >
> > Thanks
> >
> > Malc
> >
> >
> >
>
>