Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # dev >> sync interval for AvroOutputFormat


Copy link to this message
-
Re: sync interval for AvroOutputFormat
On Sun, Dec 19, 2010 at 6:14 PM, Scott Carey <[EMAIL PROTECTED]>wrote:

>
> On Dec 18, 2010, at 1:05 PM, Joe Crobak wrote:
>
> > AvroOutputFormat supports setting deflate level, but not the sync
> interval.
> > Was this a conscious decision (i.e. would there be drawbacks of making
> the
> > sync interval larger)?
> >
> > In some tests that I've done, Avro data files were over 50% smaller when
> I
> > upped the sync interval to 2MB (default is 16000 bytes).  I also saw a
> > modest speedup in building the files (I suspect my program was IO-bound).
> >
> > Would folks support a patch to add setting a sync interval as a static
> > configuration option to AvroOutputFormat?
>
> Yes, it makes sense to expose that.
>

In that case, I'd be happy to file a ticket and create a patch.
>
> Out of curiosity, how much of an improvement do you get for going to 64000
> bytes?  A larger default for the MapReduce case makes sense, but 2MB may be
> on the large side.  M/R has to split the file at sync boundaries and you
> don't want those to end up too far from the HDFS block boundaries.
>

Here are the compression ratios I'm seeing (block size, compression ratio):

16384 0.217
32768 0.164
65536 0.132
131072 0.116
262144 0.108
524288 0.104
1048576 0.102
2097152 0.100

So the sweet-spot for this data seems to be around 128K-256K, which is
within 7.7% - 16% of "optimal" (where optimal is the uncompressed file
compressed with command-line gzip).
>
> The file format default is moderately sized because for many non M/R use
> cases, syncing to disk more regularly is a good idea.  With the default
> deflate lookback window 32k, compression ratio as a function of block size
> tends to have a sharp elbow near that size.  In my experiments,  compression
> ratio did not go up after blocks that are about 120k in size, and was only
> moderately better than 16000 byte blocks.  But my data isn't your data.
>

Thanks for this suggestion -- I had only looked at the two extremes.  If the
ability to configure the size, then I should be able to do some tests to see
how these window sizes affect performance for our application.

Thanks,
Joe
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB