Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat

Copy link to this message
RE: Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat
Hi Yongqiang and Hive users,
 In my Map Reduce program I have HashMap's and Array of HashMap's, which
I need to convert to BytesRefWritable for using the RCFileOutputFormat
(which uses values as BytesRefWritable). I am then planning to re-read
this data using the "ROW FORMAT SERDE

Here are the questions I have about the steps to be followed:

1) Should I take the columnarserde code and write my own serde since I
have HashMaps and Array of HashMaps?

2) Where should I specify the separators I need to use for the HashMaps
and Array of HashMaps I am creating?

3) Should I be using LazyArray, LazyMap objects in my M/R program to get
the required serializations?

4) If I write out my original data using TextFormat instead of
RCFileOutputFormat and make Hive read it as an external table and then
store the corresponding results to RCFormat using Hive DDL commands, how
does Hive convert to RC here. A) Can it do that?  b) If it did that what
are the separators that are used in this case?

Any insights would be appreciated.

Thanks Viraj
-----Original Message-----
From: Yongqiang He [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, June 08, 2010 2:25 PM
Subject: Re: Converting types from java HashMap, Long and Array to
BytesWritable for RCFileOutputFormat

Hi Viraj

I recommend you to use Hive's columnserde/lazyserde's code to serialize
deserialize the data. This can help you avoid write your own way to
serialze/deserialize the data.

Basically, for primitives, it is easy to serialize and de-serialize. But
complex types, you need to use separators.

On 6/8/10 10:50 AM, "Viraj Bhat" <[EMAIL PROTECTED]> wrote:

> Hi all,
>   I am working on an M/R program to convert Zebra data to Hive RC
> format.
> The TableInputFormat (Zebra) returns keys and values in the form of
> BytesWritable and (Pig) Tuple.
> In order to convert it to the RCFileOutputFormat whose key is
> "BytesWritable and value is "BytesRefArrayWritable" I need to take in
> Pig Tuple iterate over each of its contents and convert it to
> "BytesRefWritable".
> The easy part is for Strings, which can be converted to
> as:
> myvalue = new BytesRefArrayWritable(10);
> //value is a Pig Tuple and get returns a string
> String s = (String)value.get(0);
> myvalue.set(0, new BytesRefWritable(s.getBytes("UTF-8")));
> How do I do it for java "Long", "HashMap" and "Arrays"
> //value is a Pig tuple
> Long l = new Long((Long)value.get(1));
> myvalue.set(iter, new
> myvalue.set(1, new BytesRefWritable(l.getBytes("UTF-8")));
> HashMap<String, Object> hm = new
> HashMap<String,Object>((HashMap)value.get(2));
> myvalue.set(iter, new
> BytesRefWritable(hm.toString().getBytes("UTF-8")));
> Would the toString() method work? If I need to re-read RC format back
> through the "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
> it interpret correctly?
> Is there any documentation for the same?
> Any suggestions would be beneficial.
> Viraj