|
|
-
sync interval for AvroOutputFormat
Joe Crobak 2010-12-18, 21:05
AvroOutputFormat supports setting deflate level, but not the sync interval. Was this a conscious decision (i.e. would there be drawbacks of making the sync interval larger)?
In some tests that I've done, Avro data files were over 50% smaller when I upped the sync interval to 2MB (default is 16000 bytes). I also saw a modest speedup in building the files (I suspect my program was IO-bound).
Would folks support a patch to add setting a sync interval as a static configuration option to AvroOutputFormat?
Best, Joe
-
Re: sync interval for AvroOutputFormat
Scott Carey 2010-12-19, 23:14
On Dec 18, 2010, at 1:05 PM, Joe Crobak wrote:
> AvroOutputFormat supports setting deflate level, but not the sync interval. > Was this a conscious decision (i.e. would there be drawbacks of making the > sync interval larger)? > > In some tests that I've done, Avro data files were over 50% smaller when I > upped the sync interval to 2MB (default is 16000 bytes). I also saw a > modest speedup in building the files (I suspect my program was IO-bound). > > Would folks support a patch to add setting a sync interval as a static > configuration option to AvroOutputFormat?
Yes, it makes sense to expose that.
Out of curiosity, how much of an improvement do you get for going to 64000 bytes? A larger default for the MapReduce case makes sense, but 2MB may be on the large side. M/R has to split the file at sync boundaries and you don't want those to end up too far from the HDFS block boundaries.
The file format default is moderately sized because for many non M/R use cases, syncing to disk more regularly is a good idea. With the default deflate lookback window 32k, compression ratio as a function of block size tends to have a sharp elbow near that size. In my experiments, compression ratio did not go up after blocks that are about 120k in size, and was only moderately better than 16000 byte blocks. But my data isn't your data. > > Best, > Joe
-
Re: sync interval for AvroOutputFormat
Joe Crobak 2010-12-20, 19:15
On Sun, Dec 19, 2010 at 6:14 PM, Scott Carey <[EMAIL PROTECTED]>wrote:
> > On Dec 18, 2010, at 1:05 PM, Joe Crobak wrote: > > > AvroOutputFormat supports setting deflate level, but not the sync > interval. > > Was this a conscious decision (i.e. would there be drawbacks of making > the > > sync interval larger)? > > > > In some tests that I've done, Avro data files were over 50% smaller when > I > > upped the sync interval to 2MB (default is 16000 bytes). I also saw a > > modest speedup in building the files (I suspect my program was IO-bound). > > > > Would folks support a patch to add setting a sync interval as a static > > configuration option to AvroOutputFormat? > > Yes, it makes sense to expose that. >
In that case, I'd be happy to file a ticket and create a patch. > > Out of curiosity, how much of an improvement do you get for going to 64000 > bytes? A larger default for the MapReduce case makes sense, but 2MB may be > on the large side. M/R has to split the file at sync boundaries and you > don't want those to end up too far from the HDFS block boundaries. >
Here are the compression ratios I'm seeing (block size, compression ratio):
16384 0.217 32768 0.164 65536 0.132 131072 0.116 262144 0.108 524288 0.104 1048576 0.102 2097152 0.100
So the sweet-spot for this data seems to be around 128K-256K, which is within 7.7% - 16% of "optimal" (where optimal is the uncompressed file compressed with command-line gzip). > > The file format default is moderately sized because for many non M/R use > cases, syncing to disk more regularly is a good idea. With the default > deflate lookback window 32k, compression ratio as a function of block size > tends to have a sharp elbow near that size. In my experiments, compression > ratio did not go up after blocks that are about 120k in size, and was only > moderately better than 16000 byte blocks. But my data isn't your data. >
Thanks for this suggestion -- I had only looked at the two extremes. If the ability to configure the size, then I should be able to do some tests to see how these window sizes affect performance for our application.
Thanks, Joe
|
|