nir_zamir 2013-05-23, 07:42
-Re: Compressed Avro vs. compressed Sequence - unexpected results?
Scott Carey 2013-05-23, 18:38
For your avro files, double check that snappy is used (use avro-tools to
peek at the metadata in the file, or simply view the head in a text
editor, the compression codec used will be in the header).
Snappy is very fast, most likely the time to read is dominated by
deserialization. Avro will be slower than a trivial deserializer (but
more compact), but being many times slower is not expected. I am not
entirely sure how Hive's Avro serDe works -- it is possible there is a
performance issue there. If you were able to get a handful of stack
traces (kill -3 or jstack) from the mapper tasks (or a profiler output),
it would be very insightful.
On 5/23/13 12:42 AM, "nir_zamir" <[EMAIL PROTECTED]> wrote:
>We're examining the storage of our data in Snappy-compressed files. Since
>want the data's structure to be self contained, we checked it with Avro
>with Sequence (both are splittable, which should best utilize our
>We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster
>on production environment (very strong machines).
>What we did here (for test simplicity) is create two Hive tables:
>and Sequence-based. Then we enabled Snappy compression and INSERTed the
>from the RAW table (consisting of the 12GB file).
>In terms of compression rate, Avro was better: 72% vs. 57%.
>In both cases there were 45 mappers, and CPU/Mem were very far from their
>limit on all machines.
>Since there was no reduce operator, this created 45 files.
>Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for
>What we did here was this Hive query:
>SELECT COUNT(1) FROM table-name;
>Here was the real difference: it took Avro about *75% longer* to perform
>this (3 minutes vs. 0.5 minute).
>This was very surprising since for our strong machines the I/O would be
>expected to be the bottleneck, and since Avro files are smaller,we
>them to be faster to decompress.
>The number of mappers in both cases was similar (14 vs. 17) and again,
>CPU/Mem didn't seem to be exausted.
>Since our most critical time is reading, this issue makes it hard for us
>be using Avro.
>Maybe we're doing something wrong - your input would be much appreciated!
>View this message in context:
>Sent from the Avro - Users mailing list archive at Nabble.com.