Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?


Copy link to this message
-
Re: Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?
You might post this same question to mapred-user@hadoop.  I know Owen and Arun have done a lot of analysis of these kinds of things when optimizing the terasort.  Others may have valuable feedback there as well.

Alan.

On May 22, 2012, at 12:23 PM, Jonathan Coveney wrote:

> I've been dealing some with the intermediate serialization in Pig, and will
> probably be dealing with it more in the future. When serializing, there is
> generally the time to serialize vs. space on disk tradeoff (an extreme
> example being compression vs. no compression, a more nuanced one being
> varint vs full int, that sort of thing). With Hadoop, generally network io
> is the bottleneck, but I'm not sure of the best way to evaluate something
> like: method X takes 3x as long to serialize, but is potentially 1/2 as
> large on disk.
>
> What are people doing in the wild?
> Jon
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB