Thejas Nair 2012-10-21, 05:22
Ruslan Al-Fakikh 2012-10-22, 13:02
Thejas Nair 2012-10-22, 22:51
For me it was:
27.5G for uncompressed tab-delimited plain txt
sequence files 1.6G
avro deflate with level 1 2.9G
avro deflate with level 5 2.4G
avro deflate with level 9 2.2G
avro snappy 4.1G
I was using this:
with CDH 3
On Tue, Oct 23, 2012 at 2:51 AM, Thejas Nair <[EMAIL PROTECTED]> wrote:
> What was the compression ratio you saw?
> I get the correct results, but the data size is almost same as uncompressed
> searches = load '/user/testuser/aol_search_logs.txt' as (ID : int, Query :
> chararray, QueryTime : chararray, ItemRank : int, ClickURL : chararray);
> store searches into '/user/testuser/aol_search_logs.avro' using
> I also tried -
> SET avro.output.codec snappy
> SET mapred.output.compress true
> searches = load '/user/testuser/aol_search_logs.avro' using
> store searches into '/user/testuser/aol_search_logs.snappy.avro' using
> On 10/22/12 6:02 AM, Ruslan Al-Fakikh wrote:
>> How do you generate your Avro files?
>> It worked OK for me with:
>> SET avro.mapred.deflate.level 5
>> inputData = LOAD 'input path' USING
>> STORE inputData INTO 'output path' USING
>> But I did these tests a long time ago with an old version.
>> On Sun, Oct 21, 2012 at 9:22 AM, Thejas Nair <[EMAIL PROTECTED]>
>>> Based on AvroStorage code and documentation, it looks like compression is
>>> enabled by default, codec set to "deflate". But the file size is almost
>>> as that of uncompressed tab separated text data.
>>> This is probably a bug in AvroStorage, but I wanted to check if this is
>>> somehow expected, before I open a jira to track it.
>>> Uncompressed txt 2.12 GB
>>> avro (default compression) 2.09 GB
>>> avro + snappy compression 2.09 GB
>>> lzo compressed txt 0.69 GB