Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Compressed Avro vs. compressed Sequence - unexpected results?


Copy link to this message
-
Re: Compressed Avro vs. compressed Sequence - unexpected results?
For your avro files, double check that snappy is used (use avro-tools to
peek at the metadata in the file, or simply view the head in a text
editor, the compression codec used will be in the header).

Snappy is very fast, most likely the time to read is dominated by
deserialization.  Avro will be slower than a trivial deserializer (but
more compact), but being many times slower is not expected.  I am not
entirely sure how Hive's Avro serDe works -- it is possible there is a
performance issue there.  If you were able to get a handful of stack
traces (kill -3 or jstack) from the mapper tasks (or a profiler output),
it would be very insightful.
On 5/23/13 12:42 AM, "nir_zamir" <[EMAIL PROTECTED]> wrote:

>Hi,
>
>We're examining the storage of our data in Snappy-compressed files. Since
>we
>want the data's structure to be self contained, we checked it with Avro
>and
>with Sequence (both are splittable, which should best utilize our
>cluster).
>
>We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster
>on production environment (very strong machines).
>
>Compression
>
>What we did here (for test simplicity) is create two Hive tables:
>Avro-based
>and Sequence-based. Then we enabled Snappy compression and INSERTed the
>data
>from the RAW table (consisting of the 12GB file).
>
>In terms of compression rate, Avro was better: 72% vs. 57%.
>In both cases there were 45 mappers, and CPU/Mem were very far from their
>limit on all machines.
>Since there was no reduce operator, this created 45 files.
>
>Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for
>sequence files.
>
>Decompression
>
>What we did here was this Hive query:
>SELECT COUNT(1) FROM table-name;
>
>Here was the real difference: it took Avro about *75% longer* to perform
>this (3 minutes vs. 0.5 minute).
>This was very surprising since for our strong machines the I/O would be
>expected to be the bottleneck, and since Avro files are smaller,we
>expected
>them to be faster to decompress.
>The number of mappers in both cases was similar (14 vs. 17) and again,
>CPU/Mem didn't seem to be exausted.
>Since our most critical time is reading, this issue makes it hard for us
>to
>be using Avro.
>
>Maybe we're doing something wrong - your input would be much appreciated!
>
>Thanks,
>Nir
>
>
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/Compressed-Avro-vs-compressed-Sequ
>ence-unexpected-results-tp4027467.html
>Sent from the Avro - Users mailing list archive at Nabble.com.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB