On Wed, Feb 29, 2012 at 19:13, Robert Evans <[EMAIL PROTECTED]> wrote:
> What I really want to know is how well does this new CompressionCodec
> perform in comparison to the regular gzip codec in
various different conditions and what type of impact does it have on
> network traffic and datanode load. My gut feeling is that
the speedup is going to be relatively small except when there is a lot of
> computation happening in the mapper
I agree, I made the same assesment.
In the javadoc I wrote under "When is this useful?"
*"Assume you have a heavy map phase for which the input is a 1GiB Apache
httpd logfile. Now assume this map takes 60 minutes of CPU time to run."*
> and the added load and network traffic outweighs the speedup in most
No, the trick to solve that one is to upload the gzipped files with a HDFS
blocksize equal (or 1 byte larger) than the filesize.
This setting will help in speeding up Gzipped input files in any situation
(no more network overhead).
>From there the HDFS file replication factor of the file dictates the
optimal number of splits for this codec.
> but like all performance on a complex system gut feelings are
almost worthless and hard numbers are what is needed to make a judgment
> Niels, I assume you have tested this on your cluster(s). Can you share
> with us some of the numbers?
No I haven't tested it beyond a multiple core system.
The simple reason for that is that when this was under review last summer
the whole "Yarn" thing happened
and I was unable to run it at all for a long time.
I only got it running again last december when the restructuring of the
source tree was mostly done.
At this moment I'm building a experimentation setup at work that can be
used for various things.
Given the current state of Hadoop 2.0 I think it's time to produce some
Best regards / Met vriendelijke groeten,