Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> "Exploding" a Hive array<string> in Pig from an RCFile


Copy link to this message
-
Re: "Exploding" a Hive array<string> in Pig from an RCFile
Hi Malcolm,

arrays are converted to tuples and flatten should directly work on it. I
think you need not worry about the delimiter (assuming hive knows how to
deserialize it). Btw, does RCFile require delimiter to store arrays? I am
not sure about that.

Thanks,
Aniket
On Wed, Apr 11, 2012 at 8:14 PM, Norbert Burger <[EMAIL PROTECTED]>wrote:

> A little wonky, but try wrapping the flattened tuple elements in a bag, and
> then re-flattening that:
>
> A = LOAD 'test.txt' USING PigStorage(',') AS
> (C_SUB_ID:chararray,seg_ids:chararray);
> B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':'));
> C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));
>
> Only flattened bags generate the cols -> rows transformation that you're
> trying to make.  Flattened tuples, on the other hand, simply explode the
> tuple into its composite elements, but without creating the multiple rows
> ("cross product') in your relation.  A custom UDF would be another option
> here.
>
> Norbert
>
> On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye <[EMAIL PROTECTED]
> >wrote:
>
> > Hi Norbert,
> >            I don't seem to be getting what I'm after. If my data looks
> like
> > this
> >
> > 1133957209,61:0:1
> > 4524524233,21:0
> >
> > I want to produce
> >
> > 1133957209,61
> > 1133957209,0
> > 1133957209,1
> > 4524524233,21
> > 4524524233,0
> >
> > I changed the LOAD statement to
> >
> > mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> > string,seg_ids
> > array');
> > opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
> > s_seg_id;
> >
> > I don't seem to be getting the cross product, just something like the
> > following
> >
> > 1133957209,61,0,1
> > 4524524233,21,0
> >
> > Any ideas ?
> >
> >
> > Thanks
> >
> > Malc
> >
> >
> > -----Original Message-----
> > From: Norbert Burger [mailto:[EMAIL PROTECTED]]
> > Sent: 06 April 2012 16:01
> > To: [EMAIL PROTECTED]
> > Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile
> >
> > Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to
> tokenize
> > a chararray on some delimeter.  So the following should work:
> >
> > opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
> > s_seg_id;
> >
> > Norbert
> >
> > On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
> > <[EMAIL PROTECTED]>wrote:
> >
> > > Hi,
> > >    I'm storing data into a partitioned table using Hive in RCFile
> > > format, but I want to use Pig to do the aggregation of that data.
> > >
> > > In my array <string> in Hive, I have colon delimited data, E.g.
> > >
> > > :0:12:21:99:
> > >
> > > With the lateral view and explode functions in Hive, I can output each
> > > value as a separate row.
> > >
> > > In Pig, I think I need to use flatten, but it just outputs the array
> > > as a single field, and I can't see where to specify that the delimiter
> > > is the delimiter/value separator
> > >
> > > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD
> > > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> > > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> > > string,seg_ids
> > > array<string>');
> > > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump
> > > opt;
> > >
> > >
> > >
> > > Thanks
> > >
> > > Malc
> > >
> > >
> > >
> >
> >
>

--
"...:::Aniket:::... Quetzalco@tl"