Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Should splittable Gzip be a "core" hadoop feature?

Copy link to this message
Re: Should splittable Gzip be a "core" hadoop feature?

On Wed, Feb 29, 2012 at 19:13, Robert Evans <[EMAIL PROTECTED]> wrote:
> What I really want to know is how well does this new CompressionCodec
> perform in comparison to the regular gzip codec in

various different conditions and what type of impact does it have on
> network traffic and datanode load.  My gut feeling is that

the speedup is going to be relatively small except when there is a lot of
> computation happening in the mapper
I agree, I made the same assesment.
In the javadoc I wrote under "When is this useful?"
*"Assume you have a heavy map phase for which the input is a 1GiB Apache
httpd logfile. Now assume this map takes 60 minutes of CPU time to run."*
> and the added load and network traffic outweighs the speedup in most
> cases,
No, the trick to solve that one is to upload the gzipped files with a HDFS
blocksize equal (or 1 byte larger) than the filesize.
This setting will help in speeding up Gzipped input files in any situation
(no more network overhead).
>From there the HDFS file replication factor of the file dictates the
optimal number of splits for this codec.
> but like all performance on a complex system gut feelings are

almost worthless and hard numbers are what is needed to make a judgment
> call.
> Niels, I assume you have tested this on your cluster(s).  Can you share
> with us some of the numbers?

No I haven't tested it beyond a multiple core system.
The simple reason for that is that when this was under review last summer
the whole "Yarn" thing happened
and I was unable to run it at all for a long time.
I only got it running again last december when the restructuring of the
source tree was mostly done.

At this moment I'm building a experimentation setup at work that can be
used for various things.
Given the current state of Hadoop 2.0 I think it's time to produce some
actual results.

Best regards / Met vriendelijke groeten,

Niels Basjes