Short answer: FileInputFormat & friends generate splits based on byte
Assuming your records are all equally sized, you'll get half your records in
each mapper. If your records have many different sizes represented, then
your mileage may vary.
On Fri, Jun 18, 2010 at 4:27 PM, Eric Sammer <[EMAIL PROTECTED]> wrote:
> In general, you should let Hadoop pick the number of mappers to use.
> In the case of only 1000 records @ 12k, performance will be better
> with a single mapper for IO bound jobs. When you force the number of
> map tasks, Hadoop will do the following:
> (Assuming FileInputFormat#getSplits(conf, numSplits) gets called)
> totalSize is sum size of all input files in bytes
> goalSize is totalSize / numSplits
> minSplitSize is conf value mapred.min.split.size (default 1)
> For each input file:
> length = file.size()
> while isSplitable(file) and length != 0
> fileBlockSize is the block size of the file
> minOfGoalBlock is min(goalSize, fileBlockSize)
> realSplitSize is max(minSplitSize, minOfGoalBlock)
> length is length minus realSplitSize (give or take)
> Note that it's actually more confusing than this, but this is the
> general idea. Let's plug in some numbers:
> 1 file
> totalSize = 12k file size
> blockSize = 64MB block
> numSplits = 2
> goalSize = 6k (12k / 2)
> minSplitSize = 1 (for FileInputFormat)
> minOfGoalBlock = 6k (6k < 64MB)
> realSplitSize = 6k (6k > 1)
> We end up with 2 splits, 6k each. RecordReaders then parse this into
> Note that this applies to the old APIs. The newer APIs work slightly
> different but I think the result is equivalent.
> (If anyone wants to double check my summation, I welcome it. This is
> some hairy code and these questions frequently come up.)
> Hope this helps.
> On Wed, Jun 16, 2010 at 8:10 AM, Karan Jindal
> <[EMAIL PROTECTED]> wrote:
> > Hi all,
> > Given a scenario in which a input file contains total 1000 records
> > in a line) of total size 12k and I set number of map tasks to 2.
> > How many records will be passed to each map task? Is it the equal
> > distribution?
> > InputFormat = Text
> > Block size = default block of hdfs
> > Hoping for a reply..
> > Regards
> > Karan
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com