Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat


Copy link to this message
-
RE: Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat
Hi Yongqiang,
 Thanks again for your help. Using a serde tailored for Zebra is
definitely a convenient way without having to convert the data.
When you mention complex types, does it mean bags and hashmaps?

I was also interested if there is a way to do this in M/R by calling
appropriate objects, from the serde to convert to BytesRefWritable. I
need to investigate.

Meanwhile
Does anyone else in this group have experiences on writing M/R programs
for converting data to RC format?
Viraj

-----Original Message-----
From: Yongqiang He [mailto:[EMAIL PROTECTED]]
Sent: Thursday, June 10, 2010 12:24 AM
To: Viraj Bhat; [EMAIL PROTECTED]
Cc: Harmeek Singh Bedi
Subject: Re: Converting types from java HashMap, Long and Array to
BytesWritable for RCFileOutputFormat

Please see inline comment.
please correct me if I am wrong about the serde layer.

Thanks
Yongqiang
On 6/9/10 11:24 PM, "Viraj Bhat" <[EMAIL PROTECTED]> wrote:

> Hi Yongqiang and Hive users,
>  In my Map Reduce program I have HashMap's and Array of HashMap's,
which
> I need to convert to BytesRefWritable for using the RCFileOutputFormat
> (which uses values as BytesRefWritable). I am then planning to re-read
> this data using the "ROW FORMAT SERDE
> "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
>
> Here are the questions I have about the steps to be followed:
>
> 1) Should I take the columnarserde code and write my own serde since I
> have HashMaps and Array of HashMaps?

I think you do not need to write your own serde. Hive's serde support
complex types and nested complex types.

>
> 2) Where should I specify the separators I need to use for the
HashMaps
> and Array of HashMaps I am creating?
If you are writing out the data, and want to use hive's serde to read
the
data. You can just use hive's default separators. (which is definded in
LazySimpleSerde.java)
>
> 3) Should I be using LazyArray, LazyMap objects in my M/R program to
get
> the required serializations?
If you want to use hive's built-in serde, you don't need to.
>
> 4) If I write out my original data using TextFormat instead of
> RCFileOutputFormat and make Hive read it as an external table and then
> store the corresponding results to RCFormat using Hive DDL commands,
how
> does Hive convert to RC here. A) Can it do that?  b) If it did that
what
> are the separators that are used in this case?
A) yes. It can do that.
B) the separators used are from the table's metadata. If not defined, it
will use default defined in LazySimpleSerde.
  
As long as data can be parsed by hive, hive can convert the data into
what
format you want. So you need hive to be able to parse you text format
(again
be careful of separators).
Basically hive use de-serializer to de-serialize the input data to
hive's
built-in types and use serialzer to serialize the data out to hdfs.

Attached some code letting hive parse Zebra table which use pig's tuple
as
it data type. Right now it can work well with primitive pig types. But
It
should not be very difficult to extend to work with complex types.
Hope these code could be helpful to you. The code most related to serde
is
under zebra/serde and ZebraUtils.java
Thanks
Yongqiang
> Any insights would be appreciated.
>
> Thanks Viraj
>
>
> -----Original Message-----
> From: Yongqiang He [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, June 08, 2010 2:25 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Converting types from java HashMap, Long and Array to
> BytesWritable for RCFileOutputFormat
>
> Hi Viraj
>
> I recommend you to use Hive's columnserde/lazyserde's code to
serialize
> and
> deserialize the data. This can help you avoid write your own way to
> serialze/deserialize the data.
>
> Basically, for primitives, it is easy to serialize and de-serialize.
But
> for
> complex types, you need to use separators.
>
> Thanks
> Yongqiang
> On 6/8/10 10:50 AM, "Viraj Bhat" <[EMAIL PROTECTED]> wrote:
>
>> Hi all,
>>   I am working on an M/R program to convert Zebra data to Hive RC