Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Avro file size is too big


+
Ruslan Al-Fakikh 2012-07-04, 13:32
+
Russell Jurney 2012-07-04, 21:58
Copy link to this message
-
Re: Avro file size is too big
Hi Russell,

I am not aware what flushing is. I am creating Avro files from Pig and
from Hive (and having basically the same results).
I've already saw that post, but my question differs. That guy had
40G of raw text, then after he RESOLVED his problem he got 4.5G of
Avro with deflate codec. So now he has 8.8X compression.
My results are even better from the beginning.
I have 27G of raw text, then I have 2.2G of Avro with deflate codec
(deflate level 9), so my compression is 12X
But my question is why gzip files and sequence files (with the default
codec) are 0.72 smaller than Avro files with deflate level 9?

Thanks

On Thu, Jul 5, 2012 at 1:58 AM, Russell Jurney <[EMAIL PROTECTED]> wrote:
> This thread looks useful. Are you flushing too often?
> http://apache-avro.679487.n3.nabble.com/avro-compression-using-snappy-and-deflate-td3870167.html
>
> Russell Jurney http://datasyndrome.com
>
> On Jul 4, 2012, at 6:33 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote:
>
>> Hello,
>>
>> In my organization currently we are evaluating Avro as a format. Our
>> concern is file size. I've done some comparisons of a piece of our
>> data.
>> Say we have sequence files, compressed. The payload (values) are just
>> lines. As far as I know we use line number as keys and we use the
>> default codec for compression inside sequence files. The size is 1.6G,
>> when I put it to avro with deflate codec with deflate level 9 it
>> becomes 2.2G.
>> This is interesting, because the values in seq files are just string,
>> but Avro has a normal schema with primitive types. And those are kept
>> binary. Shouldn't Avro be less in size?
>> Also I took another dataset which is 28G (gzip files, plain
>> tab-delimited text, don't know what is the deflate level) and put it
>> to Avro and it became 38G
>> Why Avro is so big in size? Am I missing some size optimization?
>>
>> Thanks in advance!
+
Doug Cutting 2012-07-05, 17:24
+
Ruslan Al-Fakikh 2012-07-05, 22:11
+
Doug Cutting 2012-07-05, 22:19
+
Ey-Chih chow 2012-07-18, 23:59
+
Harsh J 2012-07-20, 02:07
+
Ey-Chih chow 2012-07-20, 17:02
+
Ey-Chih chow 2012-07-20, 17:12
+
Doug Cutting 2012-07-20, 20:00
+
Ey-Chih chow 2012-07-20, 20:32
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB