Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Avro file size is too big


Copy link to this message
-
Re: Avro file size is too big
Ruslan Al-Fakikh 2012-07-05, 22:11
Hey Doug,

Here is a little more of explanation
http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E
I'll answer your questions later after some investigation

Thank you!
On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Rusian,
>
> This is unexpected.  Perhaps we can understand it if we have more information.
>
> What Writable class are you using for keys and values in the SequenceFile?
>
> What schema are you using in the Avro data file?
>
> Can you provide small sample files of each and/or code that will reproduce this?
>
> Thanks,
>
> Doug
>
> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote:
>> Hello,
>>
>> In my organization currently we are evaluating Avro as a format. Our
>> concern is file size. I've done some comparisons of a piece of our
>> data.
>> Say we have sequence files, compressed. The payload (values) are just
>> lines. As far as I know we use line number as keys and we use the
>> default codec for compression inside sequence files. The size is 1.6G,
>> when I put it to avro with deflate codec with deflate level 9 it
>> becomes 2.2G.
>> This is interesting, because the values in seq files are just string,
>> but Avro has a normal schema with primitive types. And those are kept
>> binary. Shouldn't Avro be less in size?
>> Also I took another dataset which is 28G (gzip files, plain
>> tab-delimited text, don't know what is the deflate level) and put it
>> to Avro and it became 38G
>> Why Avro is so big in size? Am I missing some size optimization?
>>
>> Thanks in advance!