|
|
-
Idle nodes with terasort and MRv2/YARN (0.23.1)
Trevor Robinson 2012-05-29, 21:33
Hello,
I'm trying to tune terasort on a small cluster (4 identical slave nodes w/ 4 disks and 16GB RAM each), but I'm having problems with very uneven load.
For teragen, I specify 24 mappers, but for some reason, only 2 nodes out of 4 run them all, even though the web UI (for both YARN and HDFS) shows all 4 nodes available. Similarly, I specify 16 reducers for terasort, but the reducers seem to run on 3 nodes out of 4. Do I have something configured wrong, or does the scheduler not attempt to spread out the load? In addition to performing sub-optimally, this also causes me to run out of disk space for large jobs, since the data is not being spread out evenly.
Currently, I'm using these settings (not shown as XML for brevity):
yarn-site.xml: yarn.nodemanager.resource.memory-mb=13824
mapred-site.xml: mapreduce.map.memory.mb=768 mapreduce.map.java.opts=-Xmx512M mapreduce.reduce.memory.mb=2304 mapreduce.reduce.java.opts=-Xmx2048M mapreduce.task.io.sort.mb=512
In case it's significant, I've scripted the cluster setup and terasort jobs, so everything runs back-to-back instantly, except that I poll to ensure that HDFS is up and has active data nodes before running teragen. I've also tried adding delays, but they didn't seem to have any effect, so I don't *think* it's a start-up race issue.
Thanks for any advice, Trevor
-
RE: Idle nodes with terasort and MRv2/YARN (0.23.1)
Jeffrey Buell 2012-05-29, 22:10
I ran into the same issue. In the end I gave up and went back to 0.20 where I can specify the number of mappers and reducers per node (6 and 4 in your case). You can try increasing the memory.mb parameters which should force fewer map/reduce tasks per node, but then you won't be able to run your desired number of both kinds of tasks at the same time. If you find a solution please let the list know!
Jeff
> -----Original Message----- > From: Trevor Robinson [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, May 29, 2012 2:34 PM > To: [EMAIL PROTECTED] > Subject: Idle nodes with terasort and MRv2/YARN (0.23.1) > > Hello, > > I'm trying to tune terasort on a small cluster (4 identical slave > nodes w/ 4 disks and 16GB RAM each), but I'm having problems with very > uneven load. > > For teragen, I specify 24 mappers, but for some reason, only 2 nodes > out of 4 run them all, even though the web UI (for both YARN and HDFS) > shows all 4 nodes available. Similarly, I specify 16 reducers for > terasort, but the reducers seem to run on 3 nodes out of 4. Do I have > something configured wrong, or does the scheduler not attempt to > spread out the load? In addition to performing sub-optimally, this > also causes me to run out of disk space for large jobs, since the data > is not being spread out evenly. > > Currently, I'm using these settings (not shown as XML for brevity): > > yarn-site.xml: > yarn.nodemanager.resource.memory-mb=13824 > > mapred-site.xml: > mapreduce.map.memory.mb=768 > mapreduce.map.java.opts=-Xmx512M > mapreduce.reduce.memory.mb=2304 > mapreduce.reduce.java.opts=-Xmx2048M > mapreduce.task.io.sort.mb=512 > > In case it's significant, I've scripted the cluster setup and terasort > jobs, so everything runs back-to-back instantly, except that I poll to > ensure that HDFS is up and has active data nodes before running > teragen. I've also tried adding delays, but they didn't seem to have > any effect, so I don't *think* it's a start-up race issue. > > Thanks for any advice, > Trevor
-
Re: Idle nodes with terasort and MRv2/YARN (0.23.1)
Arun C Murthy 2012-05-29, 23:21
What is the minimum container size? i.e. yarn.scheduler.minimum-allocation-mb. I'd bump it up to at least 1G and use the CapacityScheduler for performance tests: http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.htmlIn case of teragen, the job has no locality at all (since it's just generating data from 'random' input-splits) and hence you are getting them stuck on fewer nodes since you have so many containers on each node. The reduces should be better spread if you are using CapacityScheduler and have https://issues.apache.org/jira/browse/MAPREDUCE-3641 in your build i.e. hadoop-0.23.1 or hadoop-2.0.0-alpha (I'd use the latter). Also, FYI, currently the CS makes the tradeoff that node-locality is almost same as rack-locality and hence you might see maps not spread out for terasort. I'll fix that one soon. hth, Arun On May 29, 2012, at 2:33 PM, Trevor Robinson wrote: > Hello, > > I'm trying to tune terasort on a small cluster (4 identical slave > nodes w/ 4 disks and 16GB RAM each), but I'm having problems with very > uneven load. > > For teragen, I specify 24 mappers, but for some reason, only 2 nodes > out of 4 run them all, even though the web UI (for both YARN and HDFS) > shows all 4 nodes available. Similarly, I specify 16 reducers for > terasort, but the reducers seem to run on 3 nodes out of 4. Do I have > something configured wrong, or does the scheduler not attempt to > spread out the load? In addition to performing sub-optimally, this > also causes me to run out of disk space for large jobs, since the data > is not being spread out evenly. > > Currently, I'm using these settings (not shown as XML for brevity): > > yarn-site.xml: > yarn.nodemanager.resource.memory-mb=13824 > > mapred-site.xml: > mapreduce.map.memory.mb=768 > mapreduce.map.java.opts=-Xmx512M > mapreduce.reduce.memory.mb=2304 > mapreduce.reduce.java.opts=-Xmx2048M > mapreduce.task.io.sort.mb=512 > > In case it's significant, I've scripted the cluster setup and terasort > jobs, so everything runs back-to-back instantly, except that I poll to > ensure that HDFS is up and has active data nodes before running > teragen. I've also tried adding delays, but they didn't seem to have > any effect, so I don't *think* it's a start-up race issue. > > Thanks for any advice, > Trevor -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
-
Re: Idle nodes with terasort and MRv2/YARN (0.23.1)
Trevor Robinson 2012-05-30, 20:38
Jeff: Thanks for the corroboration and advice. I can't retreat to 0.20, and must forge ahead with 2.0, so I'll share any progress. Arun: I haven't set the minimum container size. Do you know the default? Is there a way to easily find defaults (more complete/reliable than docs)? Thanks, I'll give that a try. How does minimum container size relate to settings mapreduce.map.memory.mb? Would it essentially raise my 768M map memory allocation to 1G? Does CapacityScheduler need any additional configuration to function optimally? BTW, what is the default scheduler? I should have MAPREDUCE-3641. I'm using 0.23.1 with CDH4b2 patches (and a few Java 7/Ubuntu 12.04 build patches). How does 2.0.0-alpha compare to 0.23.1? If there's anything I can do to assist with the issue of spreading out map tasks, please let me know. Is there a JIRA issue for it (or if not, should there be)? Incidentally, my current benchmarking work on x86 is only a training ground and baseline before moving onto ARM-based systems, which have 4GB RAM and generally fewer, smaller (2.5" form factor) disks per node. It sounds like the smaller RAM will force better distribution, but the disk capacity/utilization situation will be more severe. Thanks, Trevor On Tue, May 29, 2012 at 6:21 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > What is the minimum container size? i.e. > yarn.scheduler.minimum-allocation-mb. > > I'd bump it up to at least 1G and use the CapacityScheduler for performance > tests: > http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html> > In case of teragen, the job has no locality at all (since it's just > generating data from 'random' input-splits) and hence you are getting them > stuck on fewer nodes since you have so many containers on each node. > > The reduces should be better spread if you are using CapacityScheduler and > haveĀ https://issues.apache.org/jira/browse/MAPREDUCE-3641 in your build i.e. > hadoop-0.23.1 or hadoop-2.0.0-alpha (I'd use the latter). > > Also, FYI, currently the CS makes the tradeoff that node-locality is almost > same as rack-locality and hence you might see maps not spread out for > terasort. I'll fix that one soon. > > hth, > Arun > > On May 29, 2012, at 2:33 PM, Trevor Robinson wrote: > > Hello, > > I'm trying to tune terasort on a small cluster (4 identical slave > nodes w/ 4 disks and 16GB RAM each), but I'm having problems with very > uneven load. > > For teragen, I specify 24 mappers, but for some reason, only 2 nodes > out of 4 run them all, even though the web UI (for both YARN and HDFS) > shows all 4 nodes available. Similarly, I specify 16 reducers for > terasort, but the reducers seem to run on 3 nodes out of 4. Do I have > something configured wrong, or does the scheduler not attempt to > spread out the load? In addition to performing sub-optimally, this > also causes me to run out of disk space for large jobs, since the data > is not being spread out evenly. > > Currently, I'm using these settings (not shown as XML for brevity): > > yarn-site.xml: > yarn.nodemanager.resource.memory-mb=13824 > > mapred-site.xml: > mapreduce.map.memory.mb=768 > mapreduce.map.java.opts=-Xmx512M > mapreduce.reduce.memory.mb=2304 > mapreduce.reduce.java.opts=-Xmx2048M > mapreduce.task.io.sort.mb=512 > > In case it's significant, I've scripted the cluster setup and terasort > jobs, so everything runs back-to-back instantly, except that I poll to > ensure that HDFS is up and has active data nodes before running > teragen. I've also tried adding delays, but they didn't seem to have > any effect, so I don't *think* it's a start-up race issue. > > Thanks for any advice, > Trevor > > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/> >
-
Re: Idle nodes with terasort and MRv2/YARN (0.23.1)
Robert Evans 2012-05-31, 14:22
Trevor, The best way to know what the configs were for your job is to look it up on the web UI. Once you have selected the job you want there should be a Configuration link on the left hand side. It will give you access to all of the configs used to run your job. This means the config that the AM saw it will not have any of the config values that are set by the task itself to indicate which partition etc that task is handling. --Bobby Evans On 5/30/12 3:38 PM, "Trevor Robinson" <[EMAIL PROTECTED]> wrote: Jeff: Thanks for the corroboration and advice. I can't retreat to 0.20, and must forge ahead with 2.0, so I'll share any progress. Arun: I haven't set the minimum container size. Do you know the default? Is there a way to easily find defaults (more complete/reliable than docs)? Thanks, I'll give that a try. How does minimum container size relate to settings mapreduce.map.memory.mb? Would it essentially raise my 768M map memory allocation to 1G? Does CapacityScheduler need any additional configuration to function optimally? BTW, what is the default scheduler? I should have MAPREDUCE-3641. I'm using 0.23.1 with CDH4b2 patches (and a few Java 7/Ubuntu 12.04 build patches). How does 2.0.0-alpha compare to 0.23.1? If there's anything I can do to assist with the issue of spreading out map tasks, please let me know. Is there a JIRA issue for it (or if not, should there be)? Incidentally, my current benchmarking work on x86 is only a training ground and baseline before moving onto ARM-based systems, which have 4GB RAM and generally fewer, smaller (2.5" form factor) disks per node. It sounds like the smaller RAM will force better distribution, but the disk capacity/utilization situation will be more severe. Thanks, Trevor On Tue, May 29, 2012 at 6:21 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > What is the minimum container size? i.e. > yarn.scheduler.minimum-allocation-mb. > > I'd bump it up to at least 1G and use the CapacityScheduler for performance > tests: > http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html> > In case of teragen, the job has no locality at all (since it's just > generating data from 'random' input-splits) and hence you are getting them > stuck on fewer nodes since you have so many containers on each node. > > The reduces should be better spread if you are using CapacityScheduler and > have https://issues.apache.org/jira/browse/MAPREDUCE-3641 in your build i.e. > hadoop-0.23.1 or hadoop-2.0.0-alpha (I'd use the latter). > > Also, FYI, currently the CS makes the tradeoff that node-locality is almost > same as rack-locality and hence you might see maps not spread out for > terasort. I'll fix that one soon. > > hth, > Arun > > On May 29, 2012, at 2:33 PM, Trevor Robinson wrote: > > Hello, > > I'm trying to tune terasort on a small cluster (4 identical slave > nodes w/ 4 disks and 16GB RAM each), but I'm having problems with very > uneven load. > > For teragen, I specify 24 mappers, but for some reason, only 2 nodes > out of 4 run them all, even though the web UI (for both YARN and HDFS) > shows all 4 nodes available. Similarly, I specify 16 reducers for > terasort, but the reducers seem to run on 3 nodes out of 4. Do I have > something configured wrong, or does the scheduler not attempt to > spread out the load? In addition to performing sub-optimally, this > also causes me to run out of disk space for large jobs, since the data > is not being spread out evenly. > > Currently, I'm using these settings (not shown as XML for brevity): > > yarn-site.xml: > yarn.nodemanager.resource.memory-mb=13824 > > mapred-site.xml: > mapreduce.map.memory.mb=768 > mapreduce.map.java.opts=-Xmx512M > mapreduce.reduce.memory.mb=2304 > mapreduce.reduce.java.opts=-Xmx2048M > mapreduce.task.io.sort.mb=512 > > In case it's significant, I've scripted the cluster setup and terasort > jobs, so everything runs back-to-back instantly, except that I poll to > ensure that HDFS is up and has active data nodes before running
-
Re: Idle nodes with terasort and MRv2/YARN (0.23.1)
Arun C Murthy 2012-06-05, 12:35
Trevor, On May 30, 2012, at 1:38 PM, Trevor Robinson wrote: > Jeff: > > Thanks for the corroboration and advice. I can't retreat to 0.20, and > must forge ahead with 2.0, so I'll share any progress. > > Arun: > > I haven't set the minimum container size. Do you know the default? Is > there a way to easily find defaults (more complete/reliable than > docs)? The default is way too low (128M) which is non-optimal. I've opened https://issues.apache.org/jira/browse/MAPREDUCE-4316 to fix it. > > Thanks, I'll give that a try. How does minimum container size relate > to settings mapreduce.map.memory.mb? Would it essentially raise my > 768M map memory allocation to 1G? Does CapacityScheduler need any > additional configuration to function optimally? BTW, what is the > default scheduler? The min. container size should be >= mapreduce.map.memory.mb. If you have a heap size of 768M for your maps, it should be perfectly ok to have min-container size as 1024M. Hmm... I'd also look into bumping mapreduce.reduce.memory.mb to be 2*min-container-size i.e. use 2 containers. The default scheduler is FifoScheduler which isn't as mature as CS (which is what we use for tuning etc., see http://hortonworks.com/blog/delivering-on-hadoop-next-benchmarking-performance/ for more details). > > I should have MAPREDUCE-3641. I'm using 0.23.1 with CDH4b2 patches > (and a few Java 7/Ubuntu 12.04 build patches). How does 2.0.0-alpha > compare to 0.23.1? > I'm not familiar with patchsets on CDH, but hadoop-2.0.0-alpha is a significant improvement on hadoop-0.23.1... > If there's anything I can do to assist with the issue of spreading out > map tasks, please let me know. Is there a JIRA issue for it (or if > not, should there be)? > For smaller clusters https://issues.apache.org/jira/browse/MAPREDUCE-3210 could help. > Incidentally, my current benchmarking work on x86 is only a training > ground and baseline before moving onto ARM-based systems, which have > 4GB RAM and generally fewer, smaller (2.5" form factor) disks per > node. It sounds like the smaller RAM will force better distribution, > but the disk capacity/utilization situation will be more severe. > Right, smaller RAM should force better distribution. Love to get more f/b from your ARM work, thanks! Arun > Thanks, > Trevor > > On Tue, May 29, 2012 at 6:21 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: >> What is the minimum container size? i.e. >> yarn.scheduler.minimum-allocation-mb. >> >> I'd bump it up to at least 1G and use the CapacityScheduler for performance >> tests: >> http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html>> >> In case of teragen, the job has no locality at all (since it's just >> generating data from 'random' input-splits) and hence you are getting them >> stuck on fewer nodes since you have so many containers on each node. >> >> The reduces should be better spread if you are using CapacityScheduler and >> have https://issues.apache.org/jira/browse/MAPREDUCE-3641 in your build i.e. >> hadoop-0.23.1 or hadoop-2.0.0-alpha (I'd use the latter). >> >> Also, FYI, currently the CS makes the tradeoff that node-locality is almost >> same as rack-locality and hence you might see maps not spread out for >> terasort. I'll fix that one soon. >> >> hth, >> Arun >> >> On May 29, 2012, at 2:33 PM, Trevor Robinson wrote: >> >> Hello, >> >> I'm trying to tune terasort on a small cluster (4 identical slave >> nodes w/ 4 disks and 16GB RAM each), but I'm having problems with very >> uneven load. >> >> For teragen, I specify 24 mappers, but for some reason, only 2 nodes >> out of 4 run them all, even though the web UI (for both YARN and HDFS) >> shows all 4 nodes available. Similarly, I specify 16 reducers for >> terasort, but the reducers seem to run on 3 nodes out of 4. Do I have >> something configured wrong, or does the scheduler not attempt to >> spread out the load? In addition to performing sub-optimally, this Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
-
Re: Idle nodes with terasort and MRv2/YARN (0.23.1)
Trevor 2012-06-29, 22:48
Thanks, Arun. Switching to CapacityScheduler seems to have fixed much of the issue: TeraGen and TeraSort are now evenly distributed and run almost twice as fast. However, TeraValidate only ran on one node, leaving 3 completely idle (except for the AM). I browsed the block locations of the output partitions and they seem to be pretty evenly distributed. Any idea why all 17 containers (16 map, 1 reduce) would run on a single node for this job? I left the minimum container size at 768 MB (the mapper size), so this allocation is valid for my 16 GB machine, but I don't know why the CS didn't distribute this job like it did the others. BTW, I just noticed MAPREDUCE-4335< https://issues.apache.org/jira/browse/MAPREDUCE-4335>(Changethe default scheduler to the CapacityScheduler), so it looks like this will be fixed by default in the next release. I'll keep you posted on the ARM work. Just keeping it building is a bit of work (e.g. HADOOP-8538), but now that I'm getting a handle on x86 performance tuning with MRv2, I should have some ARM results soon. -Trevor On Tue, Jun 5, 2012 at 7:35 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > Trevor, > > On May 30, 2012, at 1:38 PM, Trevor Robinson wrote: > > Jeff: > > Thanks for the corroboration and advice. I can't retreat to 0.20, and > must forge ahead with 2.0, so I'll share any progress. > > Arun: > > I haven't set the minimum container size. Do you know the default? Is > there a way to easily find defaults (more complete/reliable than > docs)? > > > The default is way too low (128M) which is non-optimal. > > I've opened https://issues.apache.org/jira/browse/MAPREDUCE-4316 to fix > it. > > > Thanks, I'll give that a try. How does minimum container size relate > to settings mapreduce.map.memory.mb? Would it essentially raise my > 768M map memory allocation to 1G? Does CapacityScheduler need any > additional configuration to function optimally? BTW, what is the > default scheduler? > > > The min. container size should be >= mapreduce.map.memory.mb. > > If you have a heap size of 768M for your maps, it should be perfectly ok > to have min-container size as 1024M. > > Hmm... I'd also look into bumping mapreduce.reduce.memory.mb to be > 2*min-container-size i.e. use 2 containers. > > The default scheduler is FifoScheduler which isn't as mature as CS (which > is what we use for tuning etc., see > http://hortonworks.com/blog/delivering-on-hadoop-next-benchmarking-performance/for more details). > > > I should have MAPREDUCE-3641. I'm using 0.23.1 with CDH4b2 patches > (and a few Java 7/Ubuntu 12.04 build patches). How does 2.0.0-alpha > compare to 0.23.1? > > > I'm not familiar with patchsets on CDH, but hadoop-2.0.0-alpha is a > significant improvement on hadoop-0.23.1... > > If there's anything I can do to assist with the issue of spreading out > map tasks, please let me know. Is there a JIRA issue for it (or if > not, should there be)? > > > For smaller clusters https://issues.apache.org/jira/browse/MAPREDUCE-3210could help. > > Incidentally, my current benchmarking work on x86 is only a training > ground and baseline before moving onto ARM-based systems, which have > 4GB RAM and generally fewer, smaller (2.5" form factor) disks per > node. It sounds like the smaller RAM will force better distribution, > but the disk capacity/utilization situation will be more severe. > > > Right, smaller RAM should force better distribution. > > Love to get more f/b from your ARM work, thanks! > > Arun > > Thanks, > Trevor > > On Tue, May 29, 2012 at 6:21 PM, Arun C Murthy <[EMAIL PROTECTED]> > wrote: > > What is the minimum container size? i.e. > > yarn.scheduler.minimum-allocation-mb. > > > I'd bump it up to at least 1G and use the CapacityScheduler for performance > > tests: > > > http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html> > > In case of teragen, the job has no locality at all (since it's just > > generating data from 'random' input-splits) and hence you are getting them
|
|