-Re: number of mapper tasks
Vinod Kumar Vavilapalli 2013-01-29, 20:08
Tried looking at your code, it's a bit involved. Instead of trying to run
the job, try unit-testing your input format. Test for getSplits(), whatever
number of splits that method returns, that will be the number of mappers
that will run.
You can also use LocalJobRunner also for this - set mapred.job.tracker to
local and run your job locally on your machine instead of trying on a
On Tue, Jan 29, 2013 at 4:53 AM, Marcelo Elias Del Valle <[EMAIL PROTECTED]
> I have been able to make this work. I don't know why, but when but
> input file is zipped (read as a input stream) it creates only 1 mapper.
> However, when it's not zipped, it creates more mappers (running 3 instances
> it created 4 mappers and running 5 instances, it created 8 mappers).
> I really would like to know why this happens and even with this number
> of mappers, I would like to know why more mappers aren't created. I was
> reading part of the book "Hadoop - The definitive guide" (
> which says:
> "The JobClient calls the getSplits() method, passing the desired number
> of map tasks as the numSplits argument. This number is treated as a hint,
> as InputFormat implementations are free to return a different number of
> splits to the number specified in numSplits. Having calculated the
> splits, the client sends them to the jobtracker, which uses their storage
> locations to schedule map tasks to process them on the tasktrackers. ..."
> I am not sure on how to get more info.
> Would you recommend me to try to find the answer on the book? Or
> should I read hadoop source code directly?
> Best regards,
> 2013/1/29 Marcelo Elias Del Valle <[EMAIL PROTECTED]>
>> I implemented my custom input format. Here is how I used it:
>> As you can see, I do:
>> And here is the Input format and the linereader:
>> In this input format, I completely ignore these other parameters and get
>> the splits by the number of lines. The amount of lines per map can be
>> controlled by the same parameter used in NLineInputFormat:
>> public static final String LINES_PER_MAP >> "mapreduce.input.lineinputformat.linespermap";
>> However, it has really no effect on the number of maps.
>> 2013/1/29 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]>
>>> Regarding your original question, you can use the min and max split
>>> settings to control the number of maps:
>>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>>> use mapred.min.split.size directly.
>>> W.r.t your custom inputformat, are you sure you job is using this
>>> InputFormat and not the default one?
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>> Just to complement the last question, I have implemented the getSplits
>>> method in my input format:
>>> However, it still doesn't create more than 2 map tasks. Is there
>>> something I could do about it to assure more map tasks are created?
>>> 2013/1/28 Marcelo Elias Del Valle <[EMAIL PROTECTED]>
>>>> Sorry for asking too many questions, but the answers are really