Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Avro vs Protocol Buffer


Copy link to this message
-
Re: Avro vs Protocol Buffer
Edward Capriolo 2012-07-20, 22:03
We just open sourced our protobuf support for Hive. We built it out
because in our line of work protobuf is very common and it gave us the
ability to log protobufs directly to files and then query them.

https://github.com/edwardcapriolo/hive-protobuf

I did not do any heavy benchmarking vs avro. However I did a few
things, sorry that I do not have exact numbers here.

A compresses SequenceFile of Text verses a sequence file of protobufs
is maybe 5-10 percent smaller depending on the data. That is pretty
good compression, so space wise your are not hurting there.

Speed wise I have to do some more analysis. Our input format is doing
reflection so that will have its cost (although we tried to cache
things where possible) protobuf has some DynamicObject components
which I need to explore to possibly avoid reflection. also you have to
consider that protobuf's do more (then TextinputFormat) like validate
data, so if you comparing raw speed you have to watch out for apples
to oranges type stuff.

I never put our ProtoBuf format head to head with the AvroFormat.
Generally I hate those type of benchmarks but I would be curious to
know.

Overall if you have no global serialization format (company wide) you
have to look at what tools you have and what they support. Aka Hive
has avro and protobuf, but maybe pig only has one of the other. Are
you using sqoop? and can it output files in the format that you want?
Are you using a language like Ruby and what support do you have there.

In my mind speed is important but compatibility is more so, for
example, even if reading avro was 2 times slower then reading thrift
(which it is not),your jobs might doing some very complex logic with a
long shuffle sort and reduce phase. Then the performance of physically
reading the file is not as important as it may seem.

On Thu, Jul 19, 2012 at 12:34 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> +1 to what Bruno's pointed you at. I personally like Avro for its data
> files (schema's stored on file, and a good, splittable container for
> typed data records). I think speed for serde is on-par with Thrift, if
> not faster today. Thrift offers no optimized data container format
> AFAIK.
>
> On Thu, Jul 19, 2012 at 1:57 PM, Bruno Freudensprung
> <[EMAIL PROTECTED]> wrote:
>> Once new results will be available, you might be interested in:
>> https://github.com/eishay/jvm-serializers/wiki/
>> https://github.com/eishay/jvm-serializers/wiki/Staging-Results
>>
>> My2cts,
>>
>> Bruno.
>>
>> Le 16/07/2012 22:49, Mike S a écrit :
>>
>>> Strictly from speed and performance perspective, is Avro as fast as
>>> protocol buffer?
>>>
>>
>
>
>
> --
> Harsh J