Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?


Copy link to this message
-
Re: Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?
Alan Gates 2012-05-22, 21:26
You might post this same question to mapred-user@hadoop.  I know Owen and Arun have done a lot of analysis of these kinds of things when optimizing the terasort.  Others may have valuable feedback there as well.

Alan.

On May 22, 2012, at 12:23 PM, Jonathan Coveney wrote:

> I've been dealing some with the intermediate serialization in Pig, and will
> probably be dealing with it more in the future. When serializing, there is
> generally the time to serialize vs. space on disk tradeoff (an extreme
> example being compression vs. no compression, a more nuanced one being
> varint vs full int, that sort of thing). With Hadoop, generally network io
> is the bottleneck, but I'm not sure of the best way to evaluate something
> like: method X takes 3x as long to serialize, but is potentially 1/2 as
> large on disk.
>
> What are people doing in the wild?
> Jon