|
|
Marcelo Elias Del Valle 2013-01-28, 15:54
Hello, I am using hadoop with TextInputFormat, a mapper and no reducers. I am running my jobs at Amazon EMR. When I run my job, I set both following options: -s,mapred.tasktracker.map.tasks.maximum=10 -jobconf,mapred.map.tasks=10 When I run my job with just 1 instance, I see it only creates 1 mapper. When I run my job with 5 instances (1 master and 4 cores), I can see only 2 mapper slots are used and 6 stay open. I am trying to figure why I am not being able to run more mappers in parallel. When I see the logs, I find some messages like these: INFO org.apache.hadoop.mapred.ReduceTask (main): attempt_201301281437_0001_r_000003_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts) org.apache.hadoop.mapred.ReduceTask (main): attempt_201301281437_0001_r_000003_0 Need another 1 map output(s) where 0 is already in progress Any hints? They would be highly appreciatted. Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
-
Re: number of mapper tasks
Harsh J 2013-01-28, 16:02
I'm unfamiliar with EMR myself (perhaps the question fits EMR's own boards) but here's my take anyway:
On Mon, Jan 28, 2013 at 9:24 PM, Marcelo Elias Del Valle <[EMAIL PROTECTED]> wrote: > Hello, > > I am using hadoop with TextInputFormat, a mapper and no reducers. I am > running my jobs at Amazon EMR. When I run my job, I set both following > options: > -s,mapred.tasktracker.map.tasks.maximum=10 > -jobconf,mapred.map.tasks=10
The first property you've given, refers to a single tasktracker's maximum concurrency. This means, if you have 4 TaskTrackers, with this property at each of them, then you have 40 total concurrent map slots available in all - perhaps more than you intended to configure?
Again, this may be an EMR specific and I may be wrong, since I haven't seen anyone pass this via CLI before and it is generally to be configured at a service level.
The second property is more to do with your problem. MR typically decides the number of map tasks it requires for a job, based on the input size. In the stable API (the org.apache.hadoop.mapred one), the mapred.map.tasks can be passed in the way you seem to be passing above, for an input format to take it as a 'hint' to decide number of map splits to enforce out of the input, no matter if it isn't large enough to necessitate that many maps.
However, the new API code accepts no such config-based hints (and such logic changes need to be done in the programs' own code).
So depending on your implementation of the job here, you may or may not see it act in effect. Hope this helps.
> When I run my job with just 1 instance, I see it only creates 1 mapper. > When I run my job with 5 instances (1 master and 4 cores), I can see only 2 > mapper slots are used and 6 stay open.
Perhaps the job itself launched with 2 total map tasks? You can check this on the JobTracker UI or whatever EMR offers as a job viewer.
> I am trying to figure why I am not being able to run more mappers in > parallel. When I see the logs, I find some messages like these: > > INFO org.apache.hadoop.mapred.ReduceTask (main): > attempt_201301281437_0001_r_000003_0 Scheduled 0 outputs (0 slow hosts and0 > dup hosts) > org.apache.hadoop.mapred.ReduceTask (main): > attempt_201301281437_0001_r_000003_0 Need another 1 map output(s) where 0 is > already in progress
This is a typical waiting reduce task log, what are you asking here specifically?
-- Harsh J
-
Re: number of mapper tasks
Marcelo Elias Del Valle 2013-01-28, 16:31
Hello Harsh, First of all, thanks for the answer! 2013/1/28 Harsh J <[EMAIL PROTECTED]> > > So depending on your implementation of the job here, you may or may > not see it act in effect. Hope this helps. > Is there anything I can do in my job, my code or in my inputFormat so that hadoop would choose to run more mappers? My text file and 10 million lines and each mapper task process 1 line at a time, very fastly. I would like to have 40 threads in parallel or even more processing those lines. > > When I run my job with just 1 instance, I see it only creates 1 > mapper. > > When I run my job with 5 instances (1 master and 4 cores), I can see > only 2 > > mapper slots are used and 6 stay open. > > Perhaps the job itself launched with 2 total map tasks? You can check > this on the JobTracker UI or whatever EMR offers as a job viewer. > I am trying to figure this out. Here is what I have from EMR: http://mvalle.com/downloads/hadoop_monitor.pngI will try to get their support to understand this, but I didn't understand what you said about the job being launched with 2 total map tasks... if I have 8 slots, shouldn't all of them be filled always? > > This is a typical waiting reduce task log, what are you asking here > specifically? > I have no reduce tasks. My map does the job without putting anything in the output. Is it happening because reduce tasks receive nothing as input? -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
-
Re: number of mapper tasks
Harsh J 2013-01-28, 16:41
Hi again, (Inline) On Mon, Jan 28, 2013 at 10:01 PM, Marcelo Elias Del Valle <[EMAIL PROTECTED]> wrote: > Hello Harsh, > > First of all, thanks for the answer! > > > 2013/1/28 Harsh J <[EMAIL PROTECTED]> >> >> So depending on your implementation of the job here, you may or may >> not see it act in effect. Hope this helps. > > > Is there anything I can do in my job, my code or in my inputFormat so that > hadoop would choose to run more mappers? My text file and 10 million lines > and each mapper task process 1 line at a time, very fastly. I would like to > have 40 threads in parallel or even more processing those lines. This seems CPU-oriented. You probably want the NLineInputFormat? See http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html. This should let you spawn more maps as we, based on your N factor. >> >> > When I run my job with just 1 instance, I see it only creates 1 >> > mapper. >> > When I run my job with 5 instances (1 master and 4 cores), I can see >> > only 2 >> > mapper slots are used and 6 stay open. >> >> Perhaps the job itself launched with 2 total map tasks? You can check >> this on the JobTracker UI or whatever EMR offers as a job viewer. > > > I am trying to figure this out. Here is what I have from EMR: > http://mvalle.com/downloads/hadoop_monitor.png> I will try to get their support to understand this, but I didn't understand > what you said about the job being launched with 2 total map tasks... if I > have 8 slots, shouldn't all of them be filled always? Not really - "Slots" are capacities, rather than split factors themselves. You can have N slots always available, but your job has to supply as many map tasks (based on its input/needs/etc.) to use them up. >> >> >> This is a typical waiting reduce task log, what are you asking here >> specifically? > > > I have no reduce tasks. My map does the job without putting anything in the > output. Is it happening because reduce tasks receive nothing as input? Unless your job sets the number of reducers to 0 manually, 1 default reducer is always run that waits to see if it has any outputs from maps. If it does not receive any outputs after maps have all completed, it dies out with behavior equivalent to a NOP. Hope this helps! -- Harsh J
-
Re: number of mapper tasks
Marcelo Elias Del Valle 2013-01-28, 16:55
Sorry for asking too many questions, but the answers are really happening. 2013/1/28 Harsh J <[EMAIL PROTECTED]> > This seems CPU-oriented. You probably want the NLineInputFormat? See > > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html> . > This should let you spawn more maps as we, based on your N factor. > Indeed, CPU is my bottleneck. That's why I want more things in parallel. Actually, I wrote my own InputFormat, to be able to process multiline CSVs: https://github.com/mvallebr/CSVInputFormatI could change it to read several lines at a time, but would this alone allow more tasks running in parallel? > Not really - "Slots" are capacities, rather than split factors > themselves. You can have N slots always available, but your job has to > supply as many map tasks (based on its input/needs/etc.) to use them > up. > But how can I do that (supply map tasks) in my job? changing its code? hadoop config? > Unless your job sets the number of reducers to 0 manually, 1 default > reducer is always run that waits to see if it has any outputs from > maps. If it does not receive any outputs after maps have all > completed, it dies out with behavior equivalent to a NOP. > Ok, I did job.setNumReduceTasks(0); , guess this will solve this part, thanks! -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
-
Re: number of mapper tasks
Marcelo Elias Del Valle 2013-01-28, 20:56
Just to complement the last question, I have implemented the getSplits method in my input format: https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.javaHowever, it still doesn't create more than 2 map tasks. Is there something I could do about it to assure more map tasks are created? Thanks Marcelo. 2013/1/28 Marcelo Elias Del Valle <[EMAIL PROTECTED]> > Sorry for asking too many questions, but the answers are really happening. > > > 2013/1/28 Harsh J <[EMAIL PROTECTED]> > >> This seems CPU-oriented. You probably want the NLineInputFormat? See >> >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html>> . >> This should let you spawn more maps as we, based on your N factor. >> > > Indeed, CPU is my bottleneck. That's why I want more things in parallel. > Actually, I wrote my own InputFormat, to be able to process multiline > CSVs: https://github.com/mvallebr/CSVInputFormat> I could change it to read several lines at a time, but would this alone > allow more tasks running in parallel? > > >> Not really - "Slots" are capacities, rather than split factors >> themselves. You can have N slots always available, but your job has to >> supply as many map tasks (based on its input/needs/etc.) to use them >> up. >> > > But how can I do that (supply map tasks) in my job? changing its code? > hadoop config? > > >> Unless your job sets the number of reducers to 0 manually, 1 default >> reducer is always run that waits to see if it has any outputs from >> maps. If it does not receive any outputs after maps have all >> completed, it dies out with behavior equivalent to a NOP. >> > Ok, I did job.setNumReduceTasks(0); , guess this will solve this part, > thanks! > > > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr > -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
-
Re: number of mapper tasks
Vinod Kumar Vavilapalli 2013-01-29, 02:08
Regarding your original question, you can use the min and max split settings to control the number of maps: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html . See #setMinInputSplitSize and #setMaxInputSplitSize. Or use mapred.min.split.size directly. W.r.t your custom inputformat, are you sure you job is using this InputFormat and not the default one? HTH, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote: > Just to complement the last question, I have implemented the getSplits method in my input format: > https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java> > However, it still doesn't create more than 2 map tasks. Is there something I could do about it to assure more map tasks are created? > > Thanks > Marcelo. > > > 2013/1/28 Marcelo Elias Del Valle <[EMAIL PROTECTED]> > Sorry for asking too many questions, but the answers are really happening. > > > 2013/1/28 Harsh J <[EMAIL PROTECTED]> > This seems CPU-oriented. You probably want the NLineInputFormat? See > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html. > This should let you spawn more maps as we, based on your N factor. > > Indeed, CPU is my bottleneck. That's why I want more things in parallel. > Actually, I wrote my own InputFormat, to be able to process multiline CSVs: https://github.com/mvallebr/CSVInputFormat> I could change it to read several lines at a time, but would this alone allow more tasks running in parallel? > > Not really - "Slots" are capacities, rather than split factors > themselves. You can have N slots always available, but your job has to > supply as many map tasks (based on its input/needs/etc.) to use them > up. > > But how can I do that (supply map tasks) in my job? changing its code? hadoop config? > > Unless your job sets the number of reducers to 0 manually, 1 default > reducer is always run that waits to see if it has any outputs from > maps. If it does not receive any outputs after maps have all > completed, it dies out with behavior equivalent to a NOP. > Ok, I did job.setNumReduceTasks(0); , guess this will solve this part, thanks! > > > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr > > > > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr
-
Re: number of mapper tasks
Marcelo Elias Del Valle 2013-01-29, 10:52
I implemented my custom input format. Here is how I used it: https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.javaAs you can see, I do: importerJob.setInputFormatClass(CSVNLineInputFormat.class); And here is the Input format and the linereader: https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.javahttps://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.javaIn this input format, I completely ignore these other parameters and get the splits by the number of lines. The amount of lines per map can be controlled by the same parameter used in NLineInputFormat: public static final String LINES_PER_MAP "mapreduce.input.lineinputformat.linespermap"; However, it has really no effect on the number of maps. 2013/1/29 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]> > > Regarding your original question, you can use the min and max split > settings to control the number of maps: > http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or > use mapred.min.split.size directly. > > W.r.t your custom inputformat, are you sure you job is using this > InputFormat and not the default one? > > HTH, > +Vinod Kumar Vavilapalli > Hortonworks Inc. > http://hortonworks.com/> > On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote: > > Just to complement the last question, I have implemented the getSplits > method in my input format: > > https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java> > However, it still doesn't create more than 2 map tasks. Is there something > I could do about it to assure more map tasks are created? > > Thanks > Marcelo. > > > 2013/1/28 Marcelo Elias Del Valle <[EMAIL PROTECTED]> > >> Sorry for asking too many questions, but the answers are really happening. >> >> >> 2013/1/28 Harsh J <[EMAIL PROTECTED]> >> >>> This seems CPU-oriented. You probably want the NLineInputFormat? See >>> >>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html>>> . >>> This should let you spawn more maps as we, based on your N factor. >>> >> >> Indeed, CPU is my bottleneck. That's why I want more things in parallel. >> Actually, I wrote my own InputFormat, to be able to process multiline >> CSVs: https://github.com/mvallebr/CSVInputFormat>> I could change it to read several lines at a time, but would this alone >> allow more tasks running in parallel? >> >> >>> Not really - "Slots" are capacities, rather than split factors >>> themselves. You can have N slots always available, but your job has to >>> supply as many map tasks (based on its input/needs/etc.) to use them >>> up. >>> >> >> But how can I do that (supply map tasks) in my job? changing its code? >> hadoop config? >> >> >>> Unless your job sets the number of reducers to 0 manually, 1 default >>> reducer is always run that waits to see if it has any outputs from >>> maps. If it does not receive any outputs after maps have all >>> completed, it dies out with behavior equivalent to a NOP. >>> >> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part, >> thanks! >> >> >> -- >> Marcelo Elias Del Valle >> http://mvalle.com - @mvallebr >> > > > > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr > > > -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
-
Re: number of mapper tasks
Marcelo Elias Del Valle 2013-01-29, 12:53
Hello, I have been able to make this work. I don't know why, but when but input file is zipped (read as a input stream) it creates only 1 mapper. However, when it's not zipped, it creates more mappers (running 3 instances it created 4 mappers and running 5 instances, it created 8 mappers). I really would like to know why this happens and even with this number of mappers, I would like to know why more mappers aren't created. I was reading part of the book "Hadoop - The definitive guide" ( https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)which says: "The JobClient calls the getSplits() method, passing the desired number of map tasks as the numSplits argument. This number is treated as a hint, as InputFormat implementations are free to return a different number of splits to the number specified in numSplits. Having calculated the splits, the client sends them to the jobtracker, which uses their storage locations to schedule map tasks to process them on the tasktrackers. ..." I am not sure on how to get more info. Would you recommend me to try to find the answer on the book? Or should I read hadoop source code directly? Best regards, Marcelo. 2013/1/29 Marcelo Elias Del Valle <[EMAIL PROTECTED]> > I implemented my custom input format. Here is how I used it: > > https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java> > As you can see, I do: > importerJob.setInputFormatClass(CSVNLineInputFormat.class); > > And here is the Input format and the linereader: > > https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java> > https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java> > In this input format, I completely ignore these other parameters and get > the splits by the number of lines. The amount of lines per map can be > controlled by the same parameter used in NLineInputFormat: > > public static final String LINES_PER_MAP > "mapreduce.input.lineinputformat.linespermap"; > However, it has really no effect on the number of maps. > > > > 2013/1/29 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]> > >> >> Regarding your original question, you can use the min and max split >> settings to control the number of maps: >> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or >> use mapred.min.split.size directly. >> >> W.r.t your custom inputformat, are you sure you job is using this >> InputFormat and not the default one? >> >> HTH, >> +Vinod Kumar Vavilapalli >> Hortonworks Inc. >> http://hortonworks.com/>> >> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote: >> >> Just to complement the last question, I have implemented the getSplits >> method in my input format: >> >> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java>> >> However, it still doesn't create more than 2 map tasks. Is there >> something I could do about it to assure more map tasks are created? >> >> Thanks >> Marcelo. >> >> >> 2013/1/28 Marcelo Elias Del Valle <[EMAIL PROTECTED]> >> >>> Sorry for asking too many questions, but the answers are really >>> happening. >>> >>> >>> 2013/1/28 Harsh J <[EMAIL PROTECTED]> >>> >>>> This seems CPU-oriented. You probably want the NLineInputFormat? See >>>> >>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html>>>> . >>>> This should let you spawn more maps as we, based on your N factor. >>>> >>> >>> Indeed, CPU is my bottleneck. That's why I want more things in parallel. >>> Actually, I wrote my own InputFormat, to be able to process multiline >>> CSVs: https://github.com/mvallebr/CSVInputFormat> Marcelo Elias Del Valle http://mvalle.com - @mvallebr
|
|