Yes, Avro Data Files are always splittable.
You may want to up the default block size in the files if this is for
MapReduce. The block size can often have a bigger impact on the
compression ratio than the compression level setting.
If you are sensitive to the write performance, you might want lower
deflate compression levels as well. The read performance is relatively
constant for deflate as the compression level changes (except for
uncompressed level 0), but the write performance varies a quite a bit
between compression level 1 and 9 -- typically a factor of 5 or 6.
On 9/30/11 6:42 PM, "Eric Hauser" <[EMAIL PROTECTED]> wrote:
>A coworker and I were having a conversation today about choosing a
>compression algorithm for some data we are storing in Hadoop. We have
>been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce
>jobs and Haivvreo for integration with Hive. By default, the
>avro-utils OutputFormat uses deflate compression. Even though
>default/zlib/gzip files are not splittable, we decided that Avro data
>files are always splittable because individual blocks within the file
>are compressed instead of the entire file.
>Is this accurate? Thanks.