Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # dev >> sync interval for AvroOutputFormat


Copy link to this message
-
Re: sync interval for AvroOutputFormat
On Sun, Dec 19, 2010 at 6:14 PM, Scott Carey <[EMAIL PROTECTED]>wrote:

>
> On Dec 18, 2010, at 1:05 PM, Joe Crobak wrote:
>
> > AvroOutputFormat supports setting deflate level, but not the sync
> interval.
> > Was this a conscious decision (i.e. would there be drawbacks of making
> the
> > sync interval larger)?
> >
> > In some tests that I've done, Avro data files were over 50% smaller when
> I
> > upped the sync interval to 2MB (default is 16000 bytes).  I also saw a
> > modest speedup in building the files (I suspect my program was IO-bound).
> >
> > Would folks support a patch to add setting a sync interval as a static
> > configuration option to AvroOutputFormat?
>
> Yes, it makes sense to expose that.
>

In that case, I'd be happy to file a ticket and create a patch.
>
> Out of curiosity, how much of an improvement do you get for going to 64000
> bytes?  A larger default for the MapReduce case makes sense, but 2MB may be
> on the large side.  M/R has to split the file at sync boundaries and you
> don't want those to end up too far from the HDFS block boundaries.
>

Here are the compression ratios I'm seeing (block size, compression ratio):

16384 0.217
32768 0.164
65536 0.132
131072 0.116
262144 0.108
524288 0.104
1048576 0.102
2097152 0.100

So the sweet-spot for this data seems to be around 128K-256K, which is
within 7.7% - 16% of "optimal" (where optimal is the uncompressed file
compressed with command-line gzip).
>
> The file format default is moderately sized because for many non M/R use
> cases, syncing to disk more regularly is a good idea.  With the default
> deflate lookback window 32k, compression ratio as a function of block size
> tends to have a sharp elbow near that size.  In my experiments,  compression
> ratio did not go up after blocks that are about 120k in size, and was only
> moderately better than 16000 byte blocks.  But my data isn't your data.
>

Thanks for this suggestion -- I had only looked at the two extremes.  If the
ability to configure the size, then I should be able to do some tests to see
how these window sizes affect performance for our application.

Thanks,
Joe