Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - number of mapper tasks


Copy link to this message
-
Re: number of mapper tasks
Marcelo Elias Del Valle 2013-01-29, 10:52
I implemented my custom input format. Here is how I used it:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java

As you can see, I do:
importerJob.setInputFormatClass(CSVNLineInputFormat.class);

And here is the Input format and the linereader:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java

In this input format, I completely ignore these other parameters and get
the splits by the number of lines. The amount of lines per map can be
controlled by the same parameter used in NLineInputFormat:

public static final String LINES_PER_MAP "mapreduce.input.lineinputformat.linespermap";
However, it has really no effect on the number of maps.

2013/1/29 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]>

>
> Regarding your original question, you can use the min and max split
> settings to control the number of maps:
> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
> use mapred.min.split.size directly.
>
> W.r.t your custom inputformat, are you sure you job is using this
> InputFormat and not the default one?
>
> HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>
> Just to complement the last question, I have implemented the getSplits
> method in my input format:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> However, it still doesn't create more than 2 map tasks. Is there something
> I could do about it to assure more map tasks are created?
>
> Thanks
> Marcelo.
>
>
> 2013/1/28 Marcelo Elias Del Valle <[EMAIL PROTECTED]>
>
>> Sorry for asking too many questions, but the answers are really happening.
>>
>>
>> 2013/1/28 Harsh J <[EMAIL PROTECTED]>
>>
>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>> .
>>> This should let you spawn more maps as we, based on your N factor.
>>>
>>
>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>> Actually, I wrote my own InputFormat, to be able to process multiline
>> CSVs: https://github.com/mvallebr/CSVInputFormat
>> I could change it to read several lines at a time, but would this alone
>> allow more tasks running in parallel?
>>
>>
>>> Not really - "Slots" are capacities, rather than split factors
>>> themselves. You can have N slots always available, but your job has to
>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>> up.
>>>
>>
>> But how can I do that (supply map tasks) in my job? changing its code?
>> hadoop config?
>>
>>
>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>> reducer is always run that waits to see if it has any outputs from
>>> maps. If it does not receive any outputs after maps have all
>>> completed, it dies out with behavior equivalent to a NOP.
>>>
>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>> thanks!
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr