Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - AvroStorage compression ratio


Copy link to this message
-
Re: AvroStorage compression ratio
Thejas Nair 2012-10-22, 22:51
What was the compression ratio you saw?
I get the correct results, but the data size is almost same as
uncompressed text.

searches = load  '/user/testuser/aol_search_logs.txt' as (ID : int,
Query : chararray, QueryTime : chararray, ItemRank : int, ClickURL :
chararray);
store searches into '/user/testuser/aol_search_logs.avro'  using
AvroStorage();

I also tried -

SET avro.output.codec snappy
SET mapred.output.compress true
searches = load '/user/testuser/aol_search_logs.avro'  using
org.apache.pig.piggybank.storage.avro.AvroStorage();
store searches into '/user/testuser/aol_search_logs.snappy.avro' using
org.apache.pig.piggybank.storage.avro.AvroStorage();

-Thejas

On 10/22/12 6:02 AM, Ruslan Al-Fakikh wrote:
> How do you generate your Avro files?
> It worked OK for me with:
>
> SET avro.mapred.deflate.level 5
> inputData = LOAD 'input path' USING
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> STORE inputData INTO 'output path' USING
> org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> But I did these tests a long time ago with an old version.
>
> Ruslan
>
> On Sun, Oct 21, 2012 at 9:22 AM, Thejas Nair <[EMAIL PROTECTED]> wrote:
>> Based on AvroStorage code and documentation, it looks like compression is
>> enabled by default, codec set to "deflate". But the file size is almost same
>> as that of uncompressed tab separated text data.
>>
>> This is probably a bug in AvroStorage, but I wanted to check if this is
>> somehow expected, before I open a jira to track it.
>>
>> Uncompressed txt     2.12 GB
>> avro (default compression)    2.09 GB
>> avro + snappy compression     2.09 GB
>> lzo compressed txt      0.69 GB
>>
>>
>> Thanks,
>> Thejas
>>