|
|
-
Read Hive LazySimpleSerde with Pig
Shawn Hermans 2013-03-12, 18:17
All, Is there an easy way to read Hive LazySimpleSerde encoded files in Pig? I did some research and found support for Hive's columnar format and for SequenceFiles, but did not see anything for LazySimpleSerde.
Thanks, Shawn
-
Re: Read Hive LazySimpleSerde with Pig
Dmitriy Ryaboy 2013-03-12, 21:53
How does LazySimpleSerde store data? On Tue, Mar 12, 2013 at 11:17 AM, Shawn Hermans <[EMAIL PROTECTED]>wrote:
> All, > Is there an easy way to read Hive LazySimpleSerde encoded files in Pig? I > did some research and found support for Hive's columnar format and for > SequenceFiles, but did not see anything for LazySimpleSerde. > > Thanks, > Shawn >
-
Re: Read Hive LazySimpleSerde with Pig
Shawn Hermans 2013-03-12, 23:35
It uses ^A for record separator. That would be easy enough as I could just use PigStorage("\001") to pull in the records. The only issue is how to extract maps. It uses ^C to separate entires within the map and ^B to separate key/value pairs in the map. It wouldn't be too difficult to write a UDF to parse the map entries, I was just wondering if there was a built-in way of doing that.
Thanks, Shawn On Tue, Mar 12, 2013 at 2:53 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> How does LazySimpleSerde store data? > > > On Tue, Mar 12, 2013 at 11:17 AM, Shawn Hermans <[EMAIL PROTECTED] > >wrote: > > > All, > > Is there an easy way to read Hive LazySimpleSerde encoded files in Pig? > I > > did some research and found support for Hive's columnar format and for > > SequenceFiles, but did not see anything for LazySimpleSerde. > > > > Thanks, > > Shawn > > >
-
Re: Read Hive LazySimpleSerde with Pig
Shawn Hermans 2013-03-13, 15:39
Solved the issue with a Jython UDF.
REGISTER 'lazysimpleserde.py' USING jython AS myfuncs; A = LOAD '000000_0' using PigStorage('\\u001') AS (params:chararray); B = FOREACH pixels GENERATE myfuncs.extractMap(params);
@outputSchema("params:map[]") def extractMap(lazy_map): extracted = {} entries = lazy_map.split('\x02')
for entry in entries: split_entry = entry.split('\x03')
if len(split_entry) == 2: extracted[split_entry[0]] = split_entry[1]
return extracted On Tue, Mar 12, 2013 at 4:35 PM, Shawn Hermans <[EMAIL PROTECTED]>wrote:
> It uses ^A for record separator. That would be easy enough as I could > just use PigStorage("\001") to pull in the records. The only issue is how > to extract maps. It uses ^C to separate entires within the map and ^B to > separate key/value pairs in the map. It wouldn't be too difficult to write > a UDF to parse the map entries, I was just wondering if there was a > built-in way of doing that. > > Thanks, > Shawn > > > On Tue, Mar 12, 2013 at 2:53 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote: > >> How does LazySimpleSerde store data? >> >> >> On Tue, Mar 12, 2013 at 11:17 AM, Shawn Hermans <[EMAIL PROTECTED] >> >wrote: >> >> > All, >> > Is there an easy way to read Hive LazySimpleSerde encoded files in Pig? >> I >> > did some research and found support for Hive's columnar format and for >> > SequenceFiles, but did not see anything for LazySimpleSerde. >> > >> > Thanks, >> > Shawn >> > >> > >
|
|