Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Compressed Avro vs. compressed Sequence - unexpected results?


Copy link to this message
-
Re: Compressed Avro vs. compressed Sequence - unexpected results?
Scott Carey 2013-05-23, 18:38
For your avro files, double check that snappy is used (use avro-tools to
peek at the metadata in the file, or simply view the head in a text
editor, the compression codec used will be in the header).

Snappy is very fast, most likely the time to read is dominated by
deserialization.  Avro will be slower than a trivial deserializer (but
more compact), but being many times slower is not expected.  I am not
entirely sure how Hive's Avro serDe works -- it is possible there is a
performance issue there.  If you were able to get a handful of stack
traces (kill -3 or jstack) from the mapper tasks (or a profiler output),
it would be very insightful.
On 5/23/13 12:42 AM, "nir_zamir" <[EMAIL PROTECTED]> wrote:

>Hi,
>
>We're examining the storage of our data in Snappy-compressed files. Since
>we
>want the data's structure to be self contained, we checked it with Avro
>and
>with Sequence (both are splittable, which should best utilize our
>cluster).
>
>We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster
>on production environment (very strong machines).
>
>Compression
>
>What we did here (for test simplicity) is create two Hive tables:
>Avro-based
>and Sequence-based. Then we enabled Snappy compression and INSERTed the
>data
>from the RAW table (consisting of the 12GB file).
>
>In terms of compression rate, Avro was better: 72% vs. 57%.
>In both cases there were 45 mappers, and CPU/Mem were very far from their
>limit on all machines.
>Since there was no reduce operator, this created 45 files.
>
>Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for
>sequence files.
>
>Decompression
>
>What we did here was this Hive query:
>SELECT COUNT(1) FROM table-name;
>
>Here was the real difference: it took Avro about *75% longer* to perform
>this (3 minutes vs. 0.5 minute).
>This was very surprising since for our strong machines the I/O would be
>expected to be the bottleneck, and since Avro files are smaller,we
>expected
>them to be faster to decompress.
>The number of mappers in both cases was similar (14 vs. 17) and again,
>CPU/Mem didn't seem to be exausted.
>Since our most critical time is reading, this issue makes it hard for us
>to
>be using Avro.
>
>Maybe we're doing something wrong - your input would be much appreciated!
>
>Thanks,
>Nir
>
>
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/Compressed-Avro-vs-compressed-Sequ
>ence-unexpected-results-tp4027467.html
>Sent from the Avro - Users mailing list archive at Nabble.com.