We're examining the storage of our data in Snappy-compressed files. Since we
want the data's structure to be self contained, we checked it with Avro and
with Sequence (both are splittable, which should best utilize our cluster).
We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster
on production environment (very strong machines).
What we did here (for test simplicity) is create two Hive tables: Avro-based
and Sequence-based. Then we enabled Snappy compression and INSERTed the data
from the RAW table (consisting of the 12GB file).
In terms of compression rate, Avro was better: 72% vs. 57%.
In both cases there were 45 mappers, and CPU/Mem were very far from their
limit on all machines.
Since there was no reduce operator, this created 45 files.
Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for
What we did here was this Hive query:
SELECT COUNT(1) FROM table-name;
Here was the real difference: it took Avro about *75% longer* to perform
this (3 minutes vs. 0.5 minute).
This was very surprising since for our strong machines the I/O would be
expected to be the bottleneck, and since Avro files are smaller,we expected
them to be faster to decompress.
The number of mappers in both cases was similar (14 vs. 17) and again,
CPU/Mem didn't seem to be exausted.
Since our most critical time is reading, this issue makes it hard for us to
be using Avro.
Maybe we're doing something wrong - your input would be much appreciated!
View this message in context: http://apache-avro.679487.n3.nabble.com/Compressed-Avro-vs-compressed-Sequence-unexpected-results-tp4027467.html
Sent from the Avro - Users mailing list archive at Nabble.com.