Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> "Exploding" a Hive array<string> in Pig from an RCFile


Copy link to this message
-
RE: "Exploding" a Hive array<string> in Pig from an RCFile
Hi Norbert,
   I don't seem to be getting what I'm after. If my data looks like
this

1133957209,61:0:1
4524524233,21:0

I want to produce

1133957209,61
1133957209,0
1133957209,1
4524524233,21
4524524233,0

I changed the LOAD statement to

mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID string,seg_ids
array');
opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
s_seg_id;

I don't seem to be getting the cross product, just something like the
following

1133957209,61,0,1
4524524233,21,0

Any ideas ?
Thanks

Malc
-----Original Message-----
From: Norbert Burger [mailto:[EMAIL PROTECTED]]
Sent: 06 April 2012 16:01
To: [EMAIL PROTECTED]
Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile

Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize
a chararray on some delimeter.  So the following should work:

opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
s_seg_id;

Norbert

On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
<[EMAIL PROTECTED]>wrote:

> Hi,
>    I'm storing data into a partitioned table using Hive in RCFile
> format, but I want to use Pig to do the aggregation of that data.
>
> In my array <string> in Hive, I have colon delimited data, E.g.
>
> :0:12:21:99:
>
> With the lateral view and explode functions in Hive, I can output each
> value as a separate row.
>
> In Pig, I think I need to use flatten, but it just outputs the array
> as a single field, and I can't see where to specify that the delimiter
> is the delimiter/value separator
>
> register /opt/pig/trunk/bin/piggybank.jar mt = LOAD
> '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> string,seg_ids
> array<string>');
> opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump
> opt;
>
>
>
> Thanks
>
> Malc
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB