|
Raj V
2011-01-11, 05:06
bharath vissapragada
2011-01-11, 05:46
Phil Whelan
2011-01-11, 06:39
Adarsh Sharma
2011-01-11, 06:40
Raj V
2011-01-11, 16:05
Ted Dunning
2011-01-11, 16:22
Raj V
2011-01-11, 16:40
Niels Basjes
2011-01-11, 19:07
Raj V
2011-01-11, 20:31
Steve Loughran
2011-01-13, 11:05
Raj V
2011-01-13, 16:51
|
-
TeraSort question.Raj V 2011-01-11, 05:06
All,
I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting. I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same., Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O. Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run? I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3). I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.) I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight) I am using CDH3B3, even though I think this is not specific to CDH3B3. Sorry for the cross post. Raj
-
Re: TeraSort question.bharath vissapragada 2011-01-11, 05:46
Ravi,
Please post the figures and graphs .. Figures for large clusters (> 200 nodes) are certainly interesting .. Thanks On Tue, Jan 11, 2011 at 10:36 AM, Raj V <[EMAIL PROTECTED]> wrote: > All, > > I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting. > > I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same., > > Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O. > > Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run? > > I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3). > > I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.) > > I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight) > > I am using CDH3B3, even though I think this is not specific to CDH3B3. > > Sorry for the cross post. > > Raj
-
Re: TeraSort question.Phil Whelan 2011-01-11, 06:39
Hi Raj,
> Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O. This process defaults to just 2 map jobs, so only 2 nodes are utilized. Did you try this option? mapred.map.tasks. I found a very similar question + answer here... http://www.mail-archive.com/[EMAIL PROTECTED]/msg00005.html >> 1. The data is generated in a fashion to where it is not balanced >> across my cluster. This is because the data is generated with 2 maps. > > These are due to the default #maps/#reduces in Map-Reduce. > Use: > $ bin/hadoop jar hadoop-*-dev-examples.jar teragen - Dmapred.map.tasks=8000 10000000000 /tera/in $ bin/hadoop jar hadoop-*-dev-examples.jar terasort - Dmapred.reduce.tasks=5300 /tera/in /tera/out > Arun Hope that helps. Thanks, Phil On Mon, Jan 10, 2011 at 9:06 PM, Raj V <[EMAIL PROTECTED]> wrote: > All, > > I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting. > > I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same., > > Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O. > > Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run? > > I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3). > > I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.) > > I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight) > > I am using CDH3B3, even though I think this is not specific to CDH3B3. > > Sorry for the cross post. > > Raj
-
Re: TeraSort question.Adarsh Sharma 2011-01-11, 06:40
If possible Please also post your configuration parameters like
*dfs.data.dir* , *mapred.local.dir* , map and reduce parmeters, java etc. Thanks bharath vissapragada wrote: > Ravi, > > Please post the figures and graphs .. Figures for large clusters (> > 200 nodes) are certainly interesting .. > > Thanks > > On Tue, Jan 11, 2011 at 10:36 AM, Raj V <[EMAIL PROTECTED]> wrote: > >> All, >> >> I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting. >> >> I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same., >> >> Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O. >> >> Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run? >> >> I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3). >> >> I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.) >> >> I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight) >> >> I am using CDH3B3, even though I think this is not specific to CDH3B3. >> >> Sorry for the cross post. >> >> Raj >>
-
Re: TeraSort question.Raj V 2011-01-11, 16:05
I used 9500 maps.
The number of maps defaulty to 2 for teragen. For terasort, it would depend on the number of input files, the dfs.block.size and number of nodes. Raj From: Phil Whelan <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Raj V <[EMAIL PROTECTED]> Cc: Sent: Monday, January 10, 2011 10:39:29 PM Subject: Re: TeraSort question. Hi Raj, > Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O. This process defaults to just 2 map jobs, so only 2 nodes are utilized. Did you try this option? mapred.map.tasks. I found a very similar question + answer here... http://www.mail-archive.com/[EMAIL PROTECTED]/msg00005.html >> 1. The data is generated in a fashion to where it is not balanced >> across my cluster. This is because the data is generated with 2 maps. > > These are due to the default #maps/#reduces in Map-Reduce. > Use: > $ bin/hadoop jar hadoop-*-dev-examples.jar teragen - Dmapred.map.tasks=8000 10000000000 /tera/in $ bin/hadoop jar hadoop-*-dev-examples.jar terasort - Dmapred.reduce.tasks=5300 /tera/in /tera/out > Arun Hope that helps. Thanks, Phil On Mon, Jan 10, 2011 at 9:06 PM, Raj V <[EMAIL PROTECTED]> wrote: > All, > > I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting. > > I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same., > > Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O. > > Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run? > > I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3). > > I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.) > > I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight) > > I am using CDH3B3, even though I think this is not specific to CDH3B3. > > Sorry for the cross post. > > Raj
-
Re: TeraSort question.Ted Dunning 2011-01-11, 16:22
Raj,
Do you have the job history files? That would be very useful. I would be happy to create some swimlane and related graphs for you if you can send me the history files. On Mon, Jan 10, 2011 at 9:06 PM, Raj V <[EMAIL PROTECTED]> wrote: > All, > > I have been running terasort on a 480 node hadoop cluster. I have also > collected cpu,memory,disk, network statistics during this run. The system > stats are quite intersting. I can post it when I have put them together in > some presentable format ( if there is interest.). However while looking at > the data, I noticed something interesting. > > I thought, intutively, that the all the systems in the cluster would have > more or less similar behaviour ( time translation was possible) but the > overall graph would look the same., > > Just to confirm it I took 5 random nodes and looked at the CPU, disk > ,network etc. activity when the sort was running. Strangeley enough, it was > not so., Two of the 5 systems were seriously busy, big IO with lots of disk > and network activity. The other three systems, CPU was more or less 100% > idle, slight network and I/O. > > Is that normal and/or expected? SHouldn't all the nodes be utilized in more > or less manner over the length of the run? > > I generated the data forf the sort using teragen. ( 128MB bloick size, > replication =3). > > I would also be interested in other people timings of sort. Is there some > place where people can post sort numbers ( not just the record.) > > I will post the actual graphs of the 5 nodes, if there is interest, > tomorrow. ( Some logistical issues abt. posting them tonight) > > I am using CDH3B3, even though I think this is not specific to CDH3B3. > > Sorry for the cross post. > > Raj
-
Re: TeraSort question.Raj V 2011-01-11, 16:40
Ted
Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format., For legal reasons, I really don't want to send the complete job histiory files. My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different. Sometime during the day, I will post the system data from 5 nodes and that would probably explain my question better. Raj From: Ted Dunning <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Raj V <[EMAIL PROTECTED]> Cc: Sent: Tuesday, January 11, 2011 8:22:17 AM Subject: Re: TeraSort question. Raj, Do you have the job history files? That would be very useful. I would be happy to create some swimlane and related graphs for you if you can send me the history files. On Mon, Jan 10, 2011 at 9:06 PM, Raj V <[EMAIL PROTECTED]> wrote: > All, > > I have been running terasort on a 480 node hadoop cluster. I have also > collected cpu,memory,disk, network statistics during this run. The system > stats are quite intersting. I can post it when I have put them together in > some presentable format ( if there is interest.). However while looking at > the data, I noticed something interesting. > > I thought, intutively, that the all the systems in the cluster would have > more or less similar behaviour ( time translation was possible) but the > overall graph would look the same., > > Just to confirm it I took 5 random nodes and looked at the CPU, disk > ,network etc. activity when the sort was running. Strangeley enough, it was > not so., Two of the 5 systems were seriously busy, big IO with lots of disk > and network activity. The other three systems, CPU was more or less 100% > idle, slight network and I/O. > > Is that normal and/or expected? SHouldn't all the nodes be utilized in more > or less manner over the length of the run? > > I generated the data forf the sort using teragen. ( 128MB bloick size, > replication =3). > > I would also be interested in other people timings of sort. Is there some > place where people can post sort numbers ( not just the record.) > > I will post the actual graphs of the 5 nodes, if there is interest, > tomorrow. ( Some logistical issues abt. posting them tonight) > > I am using CDH3B3, even though I think this is not specific to CDH3B3. > > Sorry for the cross post. > > Raj
-
Re: TeraSort question.Niels Basjes 2011-01-11, 19:07
Raj,
Have a look at the graph shown here: http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_1.1_--_Generating_Task_Timelines It should make clear that the number of tasks varies greatly over the lifetime of a job. Depending on the nodes available this may leave node idle. Niels 2011/1/11 Raj V <[EMAIL PROTECTED]>: > Ted > > > Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format., > > For legal reasons, I really don't want to send the complete job histiory files. > > My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different. > > Sometime during the day, I will post the system data from 5 nodes and that would probably explain my question better. > > Raj > From: Ted Dunning <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; Raj V <[EMAIL PROTECTED]> > Cc: > Sent: Tuesday, January 11, 2011 8:22:17 AM > Subject: Re: TeraSort question. > > Raj, > > Do you have the job history files? That would be very useful. I would be > happy to create some swimlane and related graphs for you if you can send me > the history files. > > On Mon, Jan 10, 2011 at 9:06 PM, Raj V <[EMAIL PROTECTED]> wrote: > >> All, >> >> I have been running terasort on a 480 node hadoop cluster. I have also >> collected cpu,memory,disk, network statistics during this run. The system >> stats are quite intersting. I can post it when I have put them together in >> some presentable format ( if there is interest.). However while looking at >> the data, I noticed something interesting. >> >> I thought, intutively, that the all the systems in the cluster would have >> more or less similar behaviour ( time translation was possible) but the >> overall graph would look the same., >> >> Just to confirm it I took 5 random nodes and looked at the CPU, disk >> ,network etc. activity when the sort was running. Strangeley enough, it was >> not so., Two of the 5 systems were seriously busy, big IO with lots of disk >> and network activity. The other three systems, CPU was more or less 100% >> idle, slight network and I/O. >> >> Is that normal and/or expected? SHouldn't all the nodes be utilized in more >> or less manner over the length of the run? >> >> I generated the data forf the sort using teragen. ( 128MB bloick size, >> replication =3). >> >> I would also be interested in other people timings of sort. Is there some >> place where people can post sort numbers ( not just the record.) >> >> I will post the actual graphs of the 5 nodes, if there is interest, >> tomorrow. ( Some logistical issues abt. posting them tonight) >> >> I am using CDH3B3, even though I think this is not specific to CDH3B3. >> >> Sorry for the cross post. >> >> Raj -- Met vriendelijke groeten, Niels Basjes
-
Re: TeraSort question.Raj V 2011-01-11, 20:31
Can't attach teh pdf file that shows diffeent maps.,
File is too big, From: Niels Basjes <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Raj V <[EMAIL PROTECTED]> Cc: Sent: Tuesday, January 11, 2011 11:07:08 AM Subject: Re: TeraSort question. Raj, Have a look at the graph shown here: http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_1.1_--_Generating_Task_Timelines It should make clear that the number of tasks varies greatly over the lifetime of a job. Depending on the nodes available this may leave node idle. Niels 2011/1/11 Raj V <[EMAIL PROTECTED]>: > Ted > > > Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format., > > For legal reasons, I really don't want to send the complete job histiory files. > > My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different. > > Sometime during the day, I will post the system data from 5 nodes and that would probably explain my question better. > > Raj > From: Ted Dunning <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; Raj V <[EMAIL PROTECTED]> > Cc: > Sent: Tuesday, January 11, 2011 8:22:17 AM > Subject: Re: TeraSort question. > > Raj, > > Do you have the job history files? That would be very useful. I would be > happy to create some swimlane and related graphs for you if you can send me > the history files. > > On Mon, Jan 10, 2011 at 9:06 PM, Raj V <[EMAIL PROTECTED]> wrote: > >> All, >> >> I have been running terasort on a 480 node hadoop cluster. I have also >> collected cpu,memory,disk, network statistics during this run. The system >> stats are quite intersting. I can post it when I have put them together in >> some presentable format ( if there is interest.). However while looking at >> the data, I noticed something interesting. >> >> I thought, intutively, that the all the systems in the cluster would have >> more or less similar behaviour ( time translation was possible) but the >> overall graph would look the same., >> >> Just to confirm it I took 5 random nodes and looked at the CPU, disk >> ,network etc. activity when the sort was running. Strangeley enough, it was >> not so., Two of the 5 systems were seriously busy, big IO with lots of disk >> and network activity. The other three systems, CPU was more or less 100% >> idle, slight network and I/O. >> >> Is that normal and/or expected? SHouldn't all the nodes be utilized in more >> or less manner over the length of the run? >> >> I generated the data forf the sort using teragen. ( 128MB bloick size, >> replication =3). >> >> I would also be interested in other people timings of sort. Is there some >> place where people can post sort numbers ( not just the record.) >> >> I will post the actual graphs of the 5 nodes, if there is interest, >> tomorrow. ( Some logistical issues abt. posting them tonight) >> >> I am using CDH3B3, even though I think this is not specific to CDH3B3. >> >> Sorry for the cross post. >> >> Raj -- Met vriendelijke groeten, Niels Basjes
-
Re: TeraSort question.Steve Loughran 2011-01-13, 11:05
On 11/01/11 16:40, Raj V wrote:
> Ted > > > Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format., > > For legal reasons, I really don't want to send the complete job histiory files. > > My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different. They can be different. The JT pushes out work to machines when they report in, some may get more work than others, so generate more local data. This will have follow-on consequences. In a live system things are different as the work tends to follow the data, so machines with (or near) the data you need get the work. It's a really hard thing to say "is the cluster working right", when bringing it up, everyone is really guessing about expected performance. -Steve
-
Re: TeraSort question.Raj V 2011-01-13, 16:51
Steve
Let me plot the graphs for all the nodes. I picked up 6 random nodes out oif 480 and 2 of these were really busy and the otehr 4 were idle. Either that makes me very lucky or the cluster was underutilized. I would have found it acceptable if different nodes were utilized in different ways, but in my case , 2 nodes had serious CPU , Network and Disk activity and others were completely idle. From: Steve Loughran <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Thursday, January 13, 2011 3:05 AM Subject: Re: TeraSort question. On 11/01/11 16:40, Raj V wrote: > Ted > > > Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format., > > For legal reasons, I really don't want to send the complete job histiory files. > > My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different. They can be different. The JT pushes out work to machines when they report in, some may get more work than others, so generate more local data. This will have follow-on consequences. In a live system things are different as the work tends to follow the data, so machines with (or near) the data you need get the work. It's a really hard thing to say "is the cluster working right", when bringing it up, everyone is really guessing about expected performance. -Steve |