Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # dev - Snappy compression block sizes


Copy link to this message
-
Snappy compression block sizes
Tim Broberg 2012-01-26, 20:56
I'm confused about the disparity of block sizes between BlockCompressorStream and SnappyCompressor.

BlockCompressorStream has default MAX_INPUT_SIZE on the order of 512 bytes, whereas SnappyCompressor has IO_COMPRESSION_CODEC_SNAPPY_BUFFERSIZE_DEFAULT of 256kB.

In BlockCompressorStream.write() (reproduced below), I see no case where we can ever write more than MAX_INPUT_SIZE to the compressor before calling compressor.finish(), flushing the output, and resetting.

So, if we only ever process 512 bytes at a time, why do we have 256k of buffer in the compressor?

Shouldn't we be flushing every 256kB, not every 1/2 kB?

I feel like I must be missing something obvious or this would be getting terrible compression since we would have only 256 bytes of compression history available on average in Snappy (and lz4).

What am I missing?

TIA,
    - Tim.

    long limlen = compressor.getBytesRead();
    if (len + limlen > MAX_INPUT_SIZE && limlen > 0) {
      // Adding this segment would exceed the maximum size.
      // Flush data if we have it.
      finish();
      compressor.reset();
    }

    if (len > MAX_INPUT_SIZE) {
      // The data we're given exceeds the maximum size. Any data
      // we had have been flushed, so we write out this chunk in segments
      // not exceeding the maximum size until it is exhausted.
      rawWriteInt(len);
      do {
        int bufLen = Math.min(len, MAX_INPUT_SIZE);
        compressor.setInput(b, off, bufLen);
        compressor.finish();
        while (!compressor.finished()) {
          compress();
        }
        compressor.reset();
        off += bufLen;
        len -= bufLen;
      } while (len > 0);
      return;
    }

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.