Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Should splittable Gzip be a "core" hadoop feature?


Copy link to this message
-
Re: Should splittable Gzip be a "core" hadoop feature?
Hi,

On Wed, Feb 29, 2012 at 19:13, Robert Evans <[EMAIL PROTECTED]> wrote:
> What I really want to know is how well does this new CompressionCodec
> perform in comparison to the regular gzip codec in

various different conditions and what type of impact does it have on
> network traffic and datanode load.  My gut feeling is that

the speedup is going to be relatively small except when there is a lot of
> computation happening in the mapper
I agree, I made the same assesment.
In the javadoc I wrote under "When is this useful?"
*"Assume you have a heavy map phase for which the input is a 1GiB Apache
httpd logfile. Now assume this map takes 60 minutes of CPU time to run."*
> and the added load and network traffic outweighs the speedup in most
> cases,
No, the trick to solve that one is to upload the gzipped files with a HDFS
blocksize equal (or 1 byte larger) than the filesize.
This setting will help in speeding up Gzipped input files in any situation
(no more network overhead).
>From there the HDFS file replication factor of the file dictates the
optimal number of splits for this codec.
> but like all performance on a complex system gut feelings are

almost worthless and hard numbers are what is needed to make a judgment
> call.
Yes
> Niels, I assume you have tested this on your cluster(s).  Can you share
> with us some of the numbers?
>

No I haven't tested it beyond a multiple core system.
The simple reason for that is that when this was under review last summer
the whole "Yarn" thing happened
and I was unable to run it at all for a long time.
I only got it running again last december when the restructuring of the
source tree was mostly done.

At this moment I'm building a experimentation setup at work that can be
used for various things.
Given the current state of Hadoop 2.0 I think it's time to produce some
actual results.

--
Best regards / Met vriendelijke groeten,

Niels Basjes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB