|
edward choi
2010-07-08, 09:07
Vitaliy Semochkin
2010-07-08, 10:15
Jones, Nick
2010-07-08, 13:49
Yu Li
2010-07-08, 15:57
Ken Goodhope
2010-07-08, 16:08
edward choi
2010-07-09, 01:14
Michael Segel
2010-07-09, 01:35
edward choi
2010-07-09, 01:42
Vitaliy Semochkin
2010-07-21, 16:17
Allen Wittenauer
2010-07-21, 21:07
Vitaliy Semochkin
2010-07-22, 11:40
Allen Wittenauer
2010-07-22, 17:23
Vitaliy Semochkin
2010-07-23, 12:25
|
-
How to control the number of map tasks for each nodes?edward choi 2010-07-08, 09:07
Hi,
I have a cluster consisting of 11 slaves and a single master. The thing is that 3 of my slaves have i7 cpu which means that they can have up to 8 simultaneous processes. But other slaves only have dual core cpus. So I was wondering if I can specify the number of map tasks for each of my slaves. For example, I want to give 8 map tasks to the slaves that have i7 cpus and only two map tasks to the others. Is there a way to do this?
-
Re: How to control the number of map tasks for each nodes?Vitaliy Semochkin 2010-07-08, 10:15
Hi,
in mapred-site.xml you should place <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> <description>the number of available cores on the tasktracker machines for map tasks </description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>8</value> <description>the number of available cores on the tasktracker machines for reduce tasks </description> </property> where 8 is number of your CORES not CPUS, if you have 8 dual core processors place 16 there. I found out that having number of map tasks a bit bigger than number of cores is better cause sometimes hadoop waits for IO operations and task do nothing. Regards, Vitaliy S On Thu, Jul 8, 2010 at 1:07 PM, edward choi <[EMAIL PROTECTED]> wrote: > Hi, > > I have a cluster consisting of 11 slaves and a single master. > > The thing is that 3 of my slaves have i7 cpu which means that they can have > up to 8 simultaneous processes. > But other slaves only have dual core cpus. > > So I was wondering if I can specify the number of map tasks for each of my > slaves. > For example, I want to give 8 map tasks to the slaves that have i7 cpus and > only two map tasks to the others. > > Is there a way to do this? >
-
RE: How to control the number of map tasks for each nodes?Jones, Nick 2010-07-08, 13:49
Vitaliy/Edward,
One thing to keep in mind is that overcommitting the number of cores can lead to map timeouts unless the map task submits progress updates to jobtracker. I found out the hard way that with a few computationally expensive maps. Nick Jones -----Original Message----- From: Vitaliy Semochkin [mailto:[EMAIL PROTECTED]] Sent: Thursday, July 08, 2010 5:15 AM To: [EMAIL PROTECTED] Subject: Re: How to control the number of map tasks for each nodes? Hi, in mapred-site.xml you should place <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> <description>the number of available cores on the tasktracker machines for map tasks </description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>8</value> <description>the number of available cores on the tasktracker machines for reduce tasks </description> </property> where 8 is number of your CORES not CPUS, if you have 8 dual core processors place 16 there. I found out that having number of map tasks a bit bigger than number of cores is better cause sometimes hadoop waits for IO operations and task do nothing. Regards, Vitaliy S On Thu, Jul 8, 2010 at 1:07 PM, edward choi <[EMAIL PROTECTED]> wrote: > Hi, > > I have a cluster consisting of 11 slaves and a single master. > > The thing is that 3 of my slaves have i7 cpu which means that they can have > up to 8 simultaneous processes. > But other slaves only have dual core cpus. > > So I was wondering if I can specify the number of map tasks for each of my > slaves. > For example, I want to give 8 map tasks to the slaves that have i7 cpus and > only two map tasks to the others. > > Is there a way to do this? >
-
Re: How to control the number of map tasks for each nodes?Yu Li 2010-07-08, 15:57
Hi all,
Just a reminder that there're also datanode and tasktracker daemon which also take CPU resource, so if we have 8 dual cores, the sum of the map and reduce maximum task number should be no more than 8*2-2=14. Best Regards, Carp 2010/7/8 Vitaliy Semochkin <[EMAIL PROTECTED]> > Hi, > > in mapred-site.xml you should place > > <property> > <name>mapred.tasktracker.map.tasks.maximum</name> > <value>8</value> > <description>the number of available cores on the tasktracker machines > for map tasks > </description> > </property> > <property> > <name>mapred.tasktracker.reduce.tasks.maximum</name> > <value>8</value> > <description>the number of available cores on the tasktracker machines > for reduce tasks > </description> > </property> > > where 8 is number of your CORES not CPUS, if you have 8 dual core > processors > place 16 there. > I found out that having number of map tasks a bit bigger than number of > cores is better cause sometimes hadoop waits for IO operations and task do > nothing. > > Regards, > Vitaliy S > > On Thu, Jul 8, 2010 at 1:07 PM, edward choi <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I have a cluster consisting of 11 slaves and a single master. > > > > The thing is that 3 of my slaves have i7 cpu which means that they can > have > > up to 8 simultaneous processes. > > But other slaves only have dual core cpus. > > > > So I was wondering if I can specify the number of map tasks for each of > my > > slaves. > > For example, I want to give 8 map tasks to the slaves that have i7 cpus > and > > only two map tasks to the others. > > > > Is there a way to do this? > > >
-
Re: How to control the number of map tasks for each nodes?Ken Goodhope 2010-07-08, 16:08
If you want to have a different number of tasks for different nodes, you
will need to look at one of the more advanced schedulers. FairScheduler and CapacityScheduler are the most common. FairScheduler has extensibility points where you can add your own logic for deciding if a particular node can schedule another task. I believe CapacityScheduler does this too, but i haven't used it as much. On Thu, Jul 8, 2010 at 6:49 AM, Jones, Nick <[EMAIL PROTECTED]> wrote: > Vitaliy/Edward, > One thing to keep in mind is that overcommitting the number of cores can > lead to map timeouts unless the map task submits progress updates to > jobtracker. I found out the hard way that with a few computationally > expensive maps. > > Nick Jones > > -----Original Message----- > From: Vitaliy Semochkin [mailto:[EMAIL PROTECTED]] > Sent: Thursday, July 08, 2010 5:15 AM > To: [EMAIL PROTECTED] > Subject: Re: How to control the number of map tasks for each nodes? > > Hi, > > in mapred-site.xml you should place > > <property> > <name>mapred.tasktracker.map.tasks.maximum</name> > <value>8</value> > <description>the number of available cores on the tasktracker machines > for map tasks > </description> > </property> > <property> > <name>mapred.tasktracker.reduce.tasks.maximum</name> > <value>8</value> > <description>the number of available cores on the tasktracker machines > for reduce tasks > </description> > </property> > > where 8 is number of your CORES not CPUS, if you have 8 dual core > processors > place 16 there. > I found out that having number of map tasks a bit bigger than number of > cores is better cause sometimes hadoop waits for IO operations and task do > nothing. > > Regards, > Vitaliy S > > On Thu, Jul 8, 2010 at 1:07 PM, edward choi <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I have a cluster consisting of 11 slaves and a single master. > > > > The thing is that 3 of my slaves have i7 cpu which means that they can > have > > up to 8 simultaneous processes. > > But other slaves only have dual core cpus. > > > > So I was wondering if I can specify the number of map tasks for each of > my > > slaves. > > For example, I want to give 8 map tasks to the slaves that have i7 cpus > and > > only two map tasks to the others. > > > > Is there a way to do this? > > > >
-
Re: How to control the number of map tasks for each nodes?edward choi 2010-07-09, 01:14
Thanks you all for the info.
I have successfully configured my cluster. Thanks again. 2010/7/8 edward choi <[EMAIL PROTECTED]> > Hi, > > I have a cluster consisting of 11 slaves and a single master. > > The thing is that 3 of my slaves have i7 cpu which means that they can have > up to 8 simultaneous processes. > But other slaves only have dual core cpus. > > So I was wondering if I can specify the number of map tasks for each of my > slaves. > For example, I want to give 8 map tasks to the slaves that have i7 cpus and > only two map tasks to the others. > > Is there a way to do this? >
-
RE: How to control the number of map tasks for each nodes?Michael Segel 2010-07-09, 01:35
Just a nit. i7s are quad core. While Intel has 2 'threads' per core. (I'm assuming you're using the i7's (800 series and below) in a single socket mother board.) Based on experience with tuning other databases, I am a bit skeptical about going beyond the number of cores. Also your memory is probably going to be more of a driver on the number of map/reduce tasks you can run simultaneously. JMHO, YMMV. -Mike > Date: Fri, 9 Jul 2010 10:14:14 +0900 > Subject: Re: How to control the number of map tasks for each nodes? > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Thanks you all for the info. > > I have successfully configured my cluster. > Thanks again. > > 2010/7/8 edward choi <[EMAIL PROTECTED]> > > > Hi, > > > > I have a cluster consisting of 11 slaves and a single master. > > > > The thing is that 3 of my slaves have i7 cpu which means that they can have > > up to 8 simultaneous processes. > > But other slaves only have dual core cpus. > > > > So I was wondering if I can specify the number of map tasks for each of my > > slaves. > > For example, I want to give 8 map tasks to the slaves that have i7 cpus and > > only two map tasks to the others. > > > > Is there a way to do this? > > _________________________________________________________________ The New Busy is not the old busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3
-
Re: How to control the number of map tasks for each nodes?edward choi 2010-07-09, 01:42
Mike,
Thanks for the heads up. Actually, my i7s are 930. Does this make any difference? Ed 2010/7/9 Michael Segel <[EMAIL PROTECTED]> > > Just a nit. > > i7s are quad core. While Intel has 2 'threads' per core. (I'm assuming > you're using the i7's (800 series and below) in a single socket mother > board.) > Based on experience with tuning other databases, I am a bit skeptical about > going beyond the number of cores. Also your memory is probably going to be > more of a driver on the number of map/reduce tasks you can run > simultaneously. > > JMHO, YMMV. > > -Mike > > > > Date: Fri, 9 Jul 2010 10:14:14 +0900 > > Subject: Re: How to control the number of map tasks for each nodes? > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > > > Thanks you all for the info. > > > > I have successfully configured my cluster. > > Thanks again. > > > > 2010/7/8 edward choi <[EMAIL PROTECTED]> > > > > > Hi, > > > > > > I have a cluster consisting of 11 slaves and a single master. > > > > > > The thing is that 3 of my slaves have i7 cpu which means that they can > have > > > up to 8 simultaneous processes. > > > But other slaves only have dual core cpus. > > > > > > So I was wondering if I can specify the number of map tasks for each of > my > > > slaves. > > > For example, I want to give 8 map tasks to the slaves that have i7 cpus > and > > > only two map tasks to the others. > > > > > > Is there a way to do this? > > > > > _________________________________________________________________ > The New Busy is not the old busy. Search, chat and e-mail from your inbox. > > http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3
-
Re: How to control the number of map tasks for each nodes?Vitaliy Semochkin 2010-07-21, 16:17
Hi,
might I ask how did you come to such result? In my cluster I use number of mappers and reducers twice more than I have cpu*cores How did I come to this solution, - first I noticed that in TOP avg load is very law (3-4%) and I noticed that cpu do a lot of WA. After several experiments I found out that having number of mappers reducers TWICE more than I have cpu*core does the best result (the result was almost TWICE BETTER). That I can explain by the fast that I do relativly simple log counting (count number of visitors,hits, etc) and in this case I have relativly huge amount of IO (logs are huge) and small amount computation. I also use mapred.job.reuse.jvm.num.tasks=-1 What I do not understand is why mapred.child.java.opts=-Xmx256m boosts performance in comparison to -Xmx160m how bigger amount of RAM can give me any benefit even if I don't receive out of memory errors with smaller -Xmx values?! Regards, Vitaliy S On Thu, Jul 8, 2010 at 7:57 PM, Yu Li <[EMAIL PROTECTED]> wrote: > Hi all, > > Just a reminder that there're also datanode and tasktracker daemon which > also take CPU resource, so if we have 8 dual cores, the sum of the map and > reduce maximum task number should be no more than 8*2-2=14. > > Best Regards, > Carp > > 2010/7/8 Vitaliy Semochkin <[EMAIL PROTECTED]> > > > Hi, > > > > in mapred-site.xml you should place > > > > <property> > > <name>mapred.tasktracker.map.tasks.maximum</name> > > <value>8</value> > > <description>the number of available cores on the tasktracker machines > > for map tasks > > </description> > > </property> > > <property> > > <name>mapred.tasktracker.reduce.tasks.maximum</name> > > <value>8</value> > > <description>the number of available cores on the tasktracker machines > > for reduce tasks > > </description> > > </property> > > > > where 8 is number of your CORES not CPUS, if you have 8 dual core > > processors > > place 16 there. > > I found out that having number of map tasks a bit bigger than number of > > cores is better cause sometimes hadoop waits for IO operations and task > do > > nothing. > > > > Regards, > > Vitaliy S > > > > On Thu, Jul 8, 2010 at 1:07 PM, edward choi <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I have a cluster consisting of 11 slaves and a single master. > > > > > > The thing is that 3 of my slaves have i7 cpu which means that they can > > have > > > up to 8 simultaneous processes. > > > But other slaves only have dual core cpus. > > > > > > So I was wondering if I can specify the number of map tasks for each of > > my > > > slaves. > > > For example, I want to give 8 map tasks to the slaves that have i7 cpus > > and > > > only two map tasks to the others. > > > > > > Is there a way to do this? > > > > > >
-
Re: How to control the number of map tasks for each nodes?Allen Wittenauer 2010-07-21, 21:07
On Jul 21, 2010, at 9:17 AM, Vitaliy Semochkin wrote: > might I ask how did you come to such result? > > In my cluster I use number of mappers and reducers twice more than I have > cpu*cores This is probably a sign that your data is in too many small files. > How did I come to this solution, - first I noticed that in TOP avg load is > very law (3-4%) and I noticed that cpu do a lot of WA. > > After several experiments I found out that having number of mappers reducers > TWICE more than I have cpu*core does the best result (the result was almost > TWICE BETTER). But that should put more strain on the IO system since now more tasks are waiting for input.... so chances are good that your wait isn't IO, but in context switching.... Another good sign that you have too many files in too many blocks. > That I can explain by the fast that I do relativly simple log counting > (count number of visitors,hits, etc) > and in this case I have relativly huge amount of IO (logs are huge) and > small amount computation. > I also use mapred.job.reuse.jvm.num.tasks=-1 How many files, and what is your block count, and how large is the average file? 'huge' is fairly relative. :) > What I do not understand is why > mapred.child.java.opts=-Xmx256m boosts performance in comparison to -Xmx160m > how bigger amount of RAM can give me any benefit even if I don't receive out > of memory errors with smaller -Xmx values?! More memory means that Hadoop doesn't have to spill to disk as often due to being able to use a larger buffer in RAM.
-
Re: How to control the number of map tasks for each nodes?Vitaliy Semochkin 2010-07-22, 11:40
On Thu, Jul 22, 2010 at 1:07 AM, Allen Wittenauer
<[EMAIL PROTECTED]>wrote: > > On Jul 21, 2010, at 9:17 AM, Vitaliy Semochkin wrote: > > might I ask how did you come to such result? > > > > In my cluster I use number of mappers and reducers twice more than I have > > cpu*cores > > This is probably a sign that your data is in too many small files. > > > How did I come to this solution, - first I noticed that in TOP avg load > is > > very law (3-4%) and I noticed that cpu do a lot of WA. > > > > After several experiments I found out that having number of mappers > reducers > > TWICE more than I have cpu*core does the best result (the result was > almost > > TWICE BETTER). > > But that should put more strain on the IO system since now more tasks are > waiting for input.... so chances are good that your wait isn't IO, but in > context switching.... Another good sign that you have too many files in too > many blocks. > If it was a context switching would the increasing number of mappers/reducers lead to performance improvement? > > > That I can explain by the fast that I do relativly simple log counting > > (count number of visitors,hits, etc) > > and in this case I have relativly huge amount of IO (logs are huge) and > > small amount computation. > > I also use mapred.job.reuse.jvm.num.tasks=-1 > > How many files, and what is your block count, and how large is the average > file? 'huge' is fairly relative. :) I have one log file ~140GB I use default hdfs block size (64mb) also I set dfs.replication=1 Am I right that the higher dfs.replication the faster map reduce will work because the probability that split will be on a local node will be equal to 1? Also, is it correct that it will slow down put operations? (technically put operations will run in parallel so I'm not sure if it will slow down performance or not) > > What I do not understand is why > > mapred.child.java.opts=-Xmx256m boosts performance in comparison to > -Xmx160m > > how bigger amount of RAM can give me any benefit even if I don't receive > out > > of memory errors with smaller -Xmx values?! > > More memory means that Hadoop doesn't have to spill to disk as often due to > being able to use a larger buffer in RAM. > > Does hadoop check if it has enough memory for such operation? Regards, Vitaliy S
-
Re: How to control the number of map tasks for each nodes?Allen Wittenauer 2010-07-22, 17:23
On Jul 22, 2010, at 4:40 AM, Vitaliy Semochkin wrote: >> > If it was a context switching would the increasing number of > mappers/reducers lead to performance improvement? Woops, I misspoke. I meant process switching (which I guess is a form of context switching). More on that later. > I have one log file ~140GB I use default hdfs block size (64mb) So that's not 'huge' by Hadoop standards at all. If I did my math correctly and it really is one big file, you're likely seeing somewhere on the order of 2300 or so maps to process that file, right? What is the average time per task? What scheduler is being used? What version of Hadoop? How many machines? > also I set dfs.replication=1 Eek. I wonder what your locality hit rate is. > Am I right that the higher dfs.replication the faster map reduce will work > because the probability that split will be on a local node will be equal to > 1? If you mena the block for the map input, yes, you have a much higher probability of it being local and therefore faster. > Also, is it correct that it will slow down put operations? (technically put > operations will run in parallel so I'm not sure if it will slow down > performance or not) I don't know if anyone has studied output replications factor on job performance. I'm sure someone has though. I'm not fully awake yet (despite it being 10am), but I'm fairly certain that the job of replicating the local block falls onto the DN not the client, so the client may not be held up by replication at all. >> >> More memory means that Hadoop doesn't have to spill to disk as often due to >> being able to use a larger buffer in RAM. >> >> > Does hadoop check if it has enough memory for such operation? That depends upon what you mean by 'check'. By default, Hadoop will spawn whatever size heap you want. The OS, however, may have different ideas as to what is allowable. :) ... What I suspect is really happening is that your tasks are not very CPU intensive and don't take long to run through 64mb of data. So your task turn around time is very very fast. So fast, in fact, that the scheduler can't keep up. Boosting the number of tasks per node actually helps because the *initial* scheduling puts you that much farther ahead. An interesting test to perform is to bump the block size up to 128mb. You should see fewer tasks that stick around a bit longer. But you could very well see overall throughput go up because you are spending less time cycling through JVMs. [Even with reuse turned on.]
-
Re: How to control the number of map tasks for each nodes?Vitaliy Semochkin 2010-07-23, 12:25
On Thu, Jul 22, 2010 at 9:23 PM, Allen Wittenauer
<[EMAIL PROTECTED]>wrote: > > On Jul 22, 2010, at 4:40 AM, Vitaliy Semochkin wrote: > >> > > If it was a context switching would the increasing number of > > mappers/reducers lead to performance improvement? > > Woops, I misspoke. I meant process switching (which I guess is a form of > context switching). More on that later. > > > I have one log file ~140GB I use default hdfs block size (64mb) > > So that's not 'huge' by Hadoop standards at all. > > If I did my math correctly and it really is one big file, you're likely > seeing somewhere on the order of 2300 or so maps to process that file, > right? What is the average time per task? What scheduler is being used? > What version of Hadoop? How many machines? > I use hadoop 0.20.2 with default scheduler on 5 servers (Xeon(R) CPU E5320 2 CPU 4 CORES 4 GM RAM) I start Datanode/Tasktracker on NameNode/JobTracker Your maths is perfect 2100 maps, an avg time per task is about 8sec (btw, does hadoop have tools to count avg value spent on task?) > > > also I set dfs.replication=1 > > Eek. I wonder what your locality hit rate is. Is it possible to check it? > > Am I right that the higher dfs.replication the faster map reduce will > work > > because the probability that split will be on a local node will be equal > to > > 1? > > If you mena the block for the map input, yes, you have a much higher > probability of it being local and therefore faster. > I mean the block. (still getting used to terminology) > > > Also, is it correct that it will slow down put operations? (technically > put > > operations will run in parallel so I'm not sure if it will slow down > > performance or not) > > I don't know if anyone has studied output replications factor on job > performance. I'm sure someone has though. I'm not fully awake yet (despite > it being 10am), but I'm fairly certain that the job of replicating the local > block falls onto the DN not the client, so the client may not be held up by > replication at all. > > >> > >> More memory means that Hadoop doesn't have to spill to disk as often due > to > >> being able to use a larger buffer in RAM. > >> > >> > > Does hadoop check if it has enough memory for such operation? > > That depends upon what you mean by 'check'. By default, Hadoop will spawn > whatever size heap you want. The OS, however, may have different ideas as > to what is allowable. :) > > ... > > > What I suspect is really happening is that your tasks are not very CPU > intensive and don't take long to run through 64mb of data. So your task > turn around time is very very fast. So fast, in fact, that the scheduler > can't keep up. Boosting the number of tasks per node actually helps because > the *initial* scheduling puts you that much farther ahead. > > An interesting test to perform is to bump the block size up to 128mb. You > should see fewer tasks that stick around a bit longer. But you could very > well see overall throughput go up because you are spending less time cycling > through JVMs. [Even with reuse turned on.] > Now most tasks are in state initializing, but overall time spent didn't changed. Should I reformat HDFS after I changed the block size? Looks like you have big experience in this field, what are the books/articles you recommend to check? Thanks in advance, Vitaliy S |