probably the map stage will output 10% of the total input, and the reduce
stage will output 40% of intermediate results (10% of total input).
For example, 500GB input, after the map stage, it will be 50GB and it will
become 20GB after the reduce stage.
It may be similar to the loadgen in hadoop test example.
Anyone has suggestion?
System Architect Intern @ ZData
PhD student@CSE Dept.
On Thu, Jun 14, 2012 at 1:58 AM, Nan Zhu <[EMAIL PROTECTED]> wrote:
> Hi, all
> I'm using gridmix2 to test my cluster, while in its README file, there are
> statements like the following:
> +1) Three stage map/reduce job
> + Input: 500GB compressed (2TB uncompressed) SequenceFile
> + (k,v) = (5 words, 100 words)
> + hadoop-env: FIXCOMPSEQ
> + *Compute1: keep 10% map, 40% reduce
> + Compute2: keep 100% map, 77% reduce
> + Input from Compute1
> + Compute3: keep 116% map, 91% reduce
> + Input from Compute2
> + *Motivation: Many user workloads are implemented as pipelined
> + jobs, including Pig workloads
> Can anyone tell me what does "keep 10% map, 40% reduce" mean here?
> Nan Zhu
> School of Electronic, Information and Electrical Engineering,229
> Shanghai Jiao Tong University
> 800,Dongchuan Road,Shanghai,China
> E-Mail: [EMAIL PROTECTED]