Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> AvroStorage compression ratio


Copy link to this message
-
Re: AvroStorage compression ratio
For me it was:

27.5G for uncompressed tab-delimited plain txt
when compressed:
Format Size
sequence files 1.6G
avro deflate with level 1 2.9G
avro deflate with level 5 2.4G
avro deflate with level 9 2.2G
avro snappy 4.1G

I was using this:
https://ccp.cloudera.com/display/CDHDOC/Avro+Usage#AvroUsage-Pig
with CDH 3

Best Regards

On Tue, Oct 23, 2012 at 2:51 AM, Thejas Nair <[EMAIL PROTECTED]> wrote:
> What was the compression ratio you saw?
> I get the correct results, but the data size is almost same as uncompressed
> text.
>
> searches = load  '/user/testuser/aol_search_logs.txt' as (ID : int, Query :
> chararray, QueryTime : chararray, ItemRank : int, ClickURL : chararray);
> store searches into '/user/testuser/aol_search_logs.avro'  using
> AvroStorage();
>
> I also tried -
>
> SET avro.output.codec snappy
> SET mapred.output.compress true
> searches = load '/user/testuser/aol_search_logs.avro'  using
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> store searches into '/user/testuser/aol_search_logs.snappy.avro' using
> org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> -Thejas
>
>
>
>
> On 10/22/12 6:02 AM, Ruslan Al-Fakikh wrote:
>>
>> How do you generate your Avro files?
>> It worked OK for me with:
>>
>> SET avro.mapred.deflate.level 5
>> inputData = LOAD 'input path' USING
>> org.apache.pig.piggybank.storage.avro.AvroStorage();
>> STORE inputData INTO 'output path' USING
>> org.apache.pig.piggybank.storage.avro.AvroStorage();
>>
>> But I did these tests a long time ago with an old version.
>>
>> Ruslan
>>
>> On Sun, Oct 21, 2012 at 9:22 AM, Thejas Nair <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> Based on AvroStorage code and documentation, it looks like compression is
>>> enabled by default, codec set to "deflate". But the file size is almost
>>> same
>>> as that of uncompressed tab separated text data.
>>>
>>> This is probably a bug in AvroStorage, but I wanted to check if this is
>>> somehow expected, before I open a jira to track it.
>>>
>>> Uncompressed txt     2.12 GB
>>> avro (default compression)    2.09 GB
>>> avro + snappy compression     2.09 GB
>>> lzo compressed txt      0.69 GB
>>>
>>>
>>> Thanks,
>>> Thejas
>>>
>