|
|
-
reducing mappers for a job
Jay Vyas 2011-11-16, 18:05
Hi guys : In a shared cluster environment, whats the best way to reduce the number of mappers per job ? Should you do it with inputSplits ? Or simply toggle the values in the JobConf (i.e. increase the number of bytes in an input split) ?
-- Jay Vyas MMSB/UCHC
-
Re: reducing mappers for a job
ke yuan 2011-11-17, 03:42
just the blocksize 128M or 256M,it may reduce the number of mappers per job
2011/11/17 Jay Vyas <[EMAIL PROTECTED]>
> Hi guys : In a shared cluster environment, whats the best way to reduce the > number of mappers per job ? Should you do it with inputSplits ? Or simply > toggle the values in the JobConf (i.e. increase the number of bytes in an > input split) ? > > > > > > -- > Jay Vyas > MMSB/UCHC >
-
Re: reducing mappers for a job
He Chen 2011-11-17, 04:00
Hi Jay Vyas
Ke yuan's method may decrease the number of mapper because in default
the number of mapper for a job = the number of blocks in this job's input file.
Make sure you only change the block size for your specific job's input file. Not Hadoop cluster's configuration.
If you change the block size for your Hadoop cluster configureation (in the hdfs-site.xml file), this method may bring some side-effects.
1) waste of disk space; 2) difficulty to balance HDFS; 3) low Map stage data locality;
Bests!
Chen
On Wed, Nov 16, 2011 at 9:42 PM, ke yuan <[EMAIL PROTECTED]> wrote:
> just the blocksize 128M or 256M,it may reduce the number of mappers per job > > 2011/11/17 Jay Vyas <[EMAIL PROTECTED]> > > > Hi guys : In a shared cluster environment, whats the best way to reduce > the > > number of mappers per job ? Should you do it with inputSplits ? Or > simply > > toggle the values in the JobConf (i.e. increase the number of bytes in an > > input split) ? > > > > > > > > > > > > -- > > Jay Vyas > > MMSB/UCHC > > >
-
Re: reducing mappers for a job
ke yuan 2011-11-17, 07:29
yes ,you're right,but 1)waste of disk space ,this is not right,this will not waster the disk space of datanode,if you don't believe ,you can see the code 2) difficulty to balance HDFS,this may be true 3) low Map stage data locality; why?
2011/11/17 He Chen <[EMAIL PROTECTED]>
> Hi Jay Vyas > > Ke yuan's method may decrease the number of mapper because in default > > the number of mapper for a job = the number of blocks in this job's input > file. > > Make sure you only change the block size for your specific job's input > file. Not Hadoop cluster's configuration. > > If you change the block size for your Hadoop cluster configureation (in the > hdfs-site.xml file), this method may bring some side-effects. > > 1) waste of disk space; > 2) difficulty to balance HDFS; > 3) low Map stage data locality; > > Bests! > > Chen > > On Wed, Nov 16, 2011 at 9:42 PM, ke yuan <[EMAIL PROTECTED]> wrote: > > > just the blocksize 128M or 256M,it may reduce the number of mappers per > job > > > > 2011/11/17 Jay Vyas <[EMAIL PROTECTED]> > > > > > Hi guys : In a shared cluster environment, whats the best way to reduce > > the > > > number of mappers per job ? Should you do it with inputSplits ? Or > > simply > > > toggle the values in the JobConf (i.e. increase the number of bytes in > an > > > input split) ? > > > > > > > > > > > > > > > > > > -- > > > Jay Vyas > > > MMSB/UCHC > > > > > >
-
Re: reducing mappers for a job
Harsh J 2011-11-17, 07:42
On Thu, Nov 17, 2011 at 12:59 PM, ke yuan <[EMAIL PROTECTED]> wrote: > yes ,you're right,but > 1)waste of disk space ,this is not right,this will not waster the disk > space of datanode,if you don't believe ,you can see the code
Agree that this is wrong, there should be zero wastage. You only store what you have, no "whole" block allocations.
-- Harsh J
-
Re: reducing mappers for a job
Paolo Rodeghiero 2011-11-17, 10:42
Il 17/11/2011 05:00, He Chen ha scritto: > Hi Jay Vyas > > Ke yuan's method may decrease the number of mapper because in default > > the number of mapper for a job = the number of blocks in this job's input > file. >
Hi, I'm not in production phase, so I just reference things that I read. First, may be obvious, but remember that in case you are using multiple input files, a minimum of a map task is assigned for each file.
So to minimize the number of map task you can: - aggregate data input in a single file , maybe in a splittable sequence file (using SequenceFile class) - increase HDFS block size and input split size which are controlled by different properties (dfs.block.size mapred.max.split.size and mapred.min.split.size)
Note that locality can be low decrease if mapred.min.split > dfs.block.size, because you are forcing to take as a split more than a block. In the end, mapred.*.split.size behave in slightly different way on which API and which FileInputFormat subclass are you using.
Refecences ---- Tom White, Hadoop: the Definitive Guite (Second Edition) pag. 116-120, 202-203
|
|