-RE: Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat
Viraj Bhat 2010-06-11, 21:47
Thanks again for your help. Using a serde tailored for Zebra is
definitely a convenient way without having to convert the data.
When you mention complex types, does it mean bags and hashmaps?
I was also interested if there is a way to do this in M/R by calling
appropriate objects, from the serde to convert to BytesRefWritable. I
need to investigate.
Does anyone else in this group have experiences on writing M/R programs
for converting data to RC format?
From: Yongqiang He [mailto:[EMAIL PROTECTED]]
Sent: Thursday, June 10, 2010 12:24 AM
To: Viraj Bhat; [EMAIL PROTECTED]
Cc: Harmeek Singh Bedi
Subject: Re: Converting types from java HashMap, Long and Array to
BytesWritable for RCFileOutputFormat
Please see inline comment.
please correct me if I am wrong about the serde layer.
On 6/9/10 11:24 PM, "Viraj Bhat" <[EMAIL PROTECTED]> wrote:
> Hi Yongqiang and Hive users,
> In my Map Reduce program I have HashMap's and Array of HashMap's,
> I need to convert to BytesRefWritable for using the RCFileOutputFormat
> (which uses values as BytesRefWritable). I am then planning to re-read
> this data using the "ROW FORMAT SERDE
> Here are the questions I have about the steps to be followed:
> 1) Should I take the columnarserde code and write my own serde since I
> have HashMaps and Array of HashMaps?
I think you do not need to write your own serde. Hive's serde support
complex types and nested complex types.
> 2) Where should I specify the separators I need to use for the
> and Array of HashMaps I am creating?
If you are writing out the data, and want to use hive's serde to read
data. You can just use hive's default separators. (which is definded in
> 3) Should I be using LazyArray, LazyMap objects in my M/R program to
> the required serializations?
If you want to use hive's built-in serde, you don't need to.
> 4) If I write out my original data using TextFormat instead of
> RCFileOutputFormat and make Hive read it as an external table and then
> store the corresponding results to RCFormat using Hive DDL commands,
> does Hive convert to RC here. A) Can it do that? b) If it did that
> are the separators that are used in this case?
A) yes. It can do that.
B) the separators used are from the table's metadata. If not defined, it
will use default defined in LazySimpleSerde.
As long as data can be parsed by hive, hive can convert the data into
format you want. So you need hive to be able to parse you text format
be careful of separators).
Basically hive use de-serializer to de-serialize the input data to
built-in types and use serialzer to serialize the data out to hdfs.
Attached some code letting hive parse Zebra table which use pig's tuple
it data type. Right now it can work well with primitive pig types. But
should not be very difficult to extend to work with complex types.
Hope these code could be helpful to you. The code most related to serde
under zebra/serde and ZebraUtils.java
> Any insights would be appreciated.
> Thanks Viraj
> -----Original Message-----
> From: Yongqiang He [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, June 08, 2010 2:25 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Converting types from java HashMap, Long and Array to
> BytesWritable for RCFileOutputFormat
> Hi Viraj
> I recommend you to use Hive's columnserde/lazyserde's code to
> deserialize the data. This can help you avoid write your own way to
> serialze/deserialize the data.
> Basically, for primitives, it is easy to serialize and de-serialize.
> complex types, you need to use separators.
> On 6/8/10 10:50 AM, "Viraj Bhat" <[EMAIL PROTECTED]> wrote:
>> Hi all,
>> I am working on an M/R program to convert Zebra data to Hive RC