On Sun, Dec 19, 2010 at 6:14 PM, Scott Carey <[EMAIL PROTECTED]>wrote:
> On Dec 18, 2010, at 1:05 PM, Joe Crobak wrote:
> > AvroOutputFormat supports setting deflate level, but not the sync
> > Was this a conscious decision (i.e. would there be drawbacks of making
> > sync interval larger)?
> > In some tests that I've done, Avro data files were over 50% smaller when
> > upped the sync interval to 2MB (default is 16000 bytes). I also saw a
> > modest speedup in building the files (I suspect my program was IO-bound).
> > Would folks support a patch to add setting a sync interval as a static
> > configuration option to AvroOutputFormat?
> Yes, it makes sense to expose that.
In that case, I'd be happy to file a ticket and create a patch.
> Out of curiosity, how much of an improvement do you get for going to 64000
> bytes? A larger default for the MapReduce case makes sense, but 2MB may be
> on the large side. M/R has to split the file at sync boundaries and you
> don't want those to end up too far from the HDFS block boundaries.
Here are the compression ratios I'm seeing (block size, compression ratio):
So the sweet-spot for this data seems to be around 128K-256K, which is
within 7.7% - 16% of "optimal" (where optimal is the uncompressed file
compressed with command-line gzip).
> The file format default is moderately sized because for many non M/R use
> cases, syncing to disk more regularly is a good idea. With the default
> deflate lookback window 32k, compression ratio as a function of block size
> tends to have a sharp elbow near that size. In my experiments, compression
> ratio did not go up after blocks that are about 120k in size, and was only
> moderately better than 16000 byte blocks. But my data isn't your data.
Thanks for this suggestion -- I had only looked at the two extremes. If the
ability to configure the size, then I should be able to do some tests to see
how these window sizes affect performance for our application.