|
|
-
monitor the hadoop cluster
Da Zheng 2010-11-11, 19:52
Hello,
I wrote a MapReduce program and ran it on a 3-node hadoop cluster, but its running time varies a lot, from 2 minutes to 3 minutes. I want to understand how time is used by the map phase and the reduce phase, and hope to find the place to improve the performance.
Also the current input data is sorted, so I wrote a customized partitioner to reduce the data shuffling across the network. I need some means to help me observe the data movement.
I know hadoop community developed chukwa for monitoring, but it seems very immature right now. I wonder how people monitor hadoop cluster right now. Is there a good way to solve my problems listed above?
Thanks, Da
-
Re: monitor the hadoop cluster
David Rosenstrauch 2010-11-11, 20:00
On 11/11/2010 02:52 PM, Da Zheng wrote: > Hello, > > I wrote a MapReduce program and ran it on a 3-node hadoop cluster, but > its running time varies a lot, from 2 minutes to 3 minutes. I want to > understand how time is used by the map phase and the reduce phase, and > hope to find the place to improve the performance. > > Also the current input data is sorted, so I wrote a customized > partitioner to reduce the data shuffling across the network. I need some > means to help me observe the data movement. > > I know hadoop community developed chukwa for monitoring, but it seems > very immature right now. I wonder how people monitor hadoop cluster > right now. Is there a good way to solve my problems listed above? > > Thanks, > Da
Just my $0.02, but IMO you're working on some faulty assumptions here. Hadoop is explicitly *not* a real-time system, and so it's not reasonable for you to expect to have such fine-grained control over its processing speed. It's a distributed system, where many things can affect how long a job takes, such as: how many nodes in the cluster, how many other jobs are running, the technical specs of each node, whether/how Hadoop implements "speculative execution" during your job, whether your job as any task failures/retries, whether you have any hardware failures during your job, ......
You can have control over performance on a Hadoop cluster, via things like adding nodes, tweaking some config parms, etc. But you're much more likely to be able to make performance improvements like cutting a job down from 3 hours to 2 hours, not from 3 minutes to 2 minutes. You're just not going to get that kind of fine-grained control with Hadoop. Nor should you be looking for it, IMO. If that's what you want, then Hadoop is probably the wrong tool for your job.
DR
-
Re: monitor the hadoop cluster
Da Zheng 2010-11-11, 20:23
Hello,
On 11/11/2010 03:00 PM, David Rosenstrauch wrote: > On 11/11/2010 02:52 PM, Da Zheng wrote: >> Hello, >> >> I wrote a MapReduce program and ran it on a 3-node hadoop cluster, but >> its running time varies a lot, from 2 minutes to 3 minutes. I want to >> understand how time is used by the map phase and the reduce phase, and >> hope to find the place to improve the performance. >> >> Also the current input data is sorted, so I wrote a customized >> partitioner to reduce the data shuffling across the network. I need some >> means to help me observe the data movement. >> >> I know hadoop community developed chukwa for monitoring, but it seems >> very immature right now. I wonder how people monitor hadoop cluster >> right now. Is there a good way to solve my problems listed above? >> >> Thanks, >> Da > > Just my $0.02, but IMO you're working on some faulty assumptions here. > Hadoop is explicitly *not* a real-time system, and so it's not > reasonable for you to expect to have such fine-grained control over > its processing speed. It's a distributed system, where many things > can affect how long a job takes, such as: how many nodes in the > cluster, how many other jobs are running, the technical specs of each > node, whether/how Hadoop implements "speculative execution" during > your job, whether your job as any task failures/retries, whether you > have any hardware failures during your job, ...... > > You can have control over performance on a Hadoop cluster, via things > like adding nodes, tweaking some config parms, etc. But you're much > more likely to be able to make performance improvements like cutting a > job down from 3 hours to 2 hours, not from 3 minutes to 2 minutes. > You're just not going to get that kind of fine-grained control with > Hadoop. Nor should you be looking for it, IMO. If that's what you > want, then Hadoop is probably the wrong tool for your job. I don't really try to cut the time from 3 minutes to 2 minutes. I was asking whether I can have some tools to monitor the hadoop cluster and possibly find the spot for performance improvement. I'm very new to hadoop, and I hope to have a good view how time is used by each mapper and reducer, so I'll have more confidence to run it on a much larger dataset.
More importantly, I want to see how much data shaffling can be saved if I use the customized partitioner.
Best, Da
-
Re: monitor the hadoop cluster
Luke Lu 2010-11-11, 20:31
The job detail page from the jobtracker shows a lot of information about any given job: the start/finish times of each task and various counters (like time spent in various phase, the input/output bytes/records etc.)
For monitoring the aggregate performance of a cluster, the hadoop metrics system can send a lot of information to standard monitoring tools (ganglia etc.) to graph and monitor various aggregate metrics like (running/waiting maps/reduces etc.)
__Luke
On Thu, Nov 11, 2010 at 12:23 PM, Da Zheng <[EMAIL PROTECTED]> wrote: > Hello, > > On 11/11/2010 03:00 PM, David Rosenstrauch wrote: >> >> On 11/11/2010 02:52 PM, Da Zheng wrote: >>> >>> Hello, >>> >>> I wrote a MapReduce program and ran it on a 3-node hadoop cluster, but >>> its running time varies a lot, from 2 minutes to 3 minutes. I want to >>> understand how time is used by the map phase and the reduce phase, and >>> hope to find the place to improve the performance. >>> >>> Also the current input data is sorted, so I wrote a customized >>> partitioner to reduce the data shuffling across the network. I need some >>> means to help me observe the data movement. >>> >>> I know hadoop community developed chukwa for monitoring, but it seems >>> very immature right now. I wonder how people monitor hadoop cluster >>> right now. Is there a good way to solve my problems listed above? >>> >>> Thanks, >>> Da >> >> Just my $0.02, but IMO you're working on some faulty assumptions here. >> Hadoop is explicitly *not* a real-time system, and so it's not reasonable >> for you to expect to have such fine-grained control over its processing >> speed. It's a distributed system, where many things can affect how long a >> job takes, such as: how many nodes in the cluster, how many other jobs are >> running, the technical specs of each node, whether/how Hadoop implements >> "speculative execution" during your job, whether your job as any task >> failures/retries, whether you have any hardware failures during your job, >> ...... >> >> You can have control over performance on a Hadoop cluster, via things like >> adding nodes, tweaking some config parms, etc. But you're much more likely >> to be able to make performance improvements like cutting a job down from 3 >> hours to 2 hours, not from 3 minutes to 2 minutes. You're just not going to >> get that kind of fine-grained control with Hadoop. Nor should you be >> looking for it, IMO. If that's what you want, then Hadoop is probably the >> wrong tool for your job. > > I don't really try to cut the time from 3 minutes to 2 minutes. I was asking > whether I can have some tools to monitor the hadoop cluster and possibly > find the spot for performance improvement. I'm very new to hadoop, and I > hope to have a good view how time is used by each mapper and reducer, so > I'll have more confidence to run it on a much larger dataset. > > More importantly, I want to see how much data shaffling can be saved if I > use the customized partitioner. > > Best, > Da >
-
Re: monitor the hadoop cluster
Da Zheng 2010-11-12, 03:24
On 11/11/10 3:31 PM, Luke Lu wrote: > The job detail page from the jobtracker shows a lot of information > about any given job: the start/finish times of each task and various > counters (like time spent in various phase, the input/output > bytes/records etc.) > > For monitoring the aggregate performance of a cluster, the hadoop > metrics system can send a lot of information to standard monitoring > tools (ganglia etc.) to graph and monitor various aggregate metrics > like (running/waiting maps/reduces etc.) I guess it might make more sense to measure the performance only when a large dataset is used. the page of jobtracker doesn't update very frequently.
Could you tell me which metrics of hadoop can be gathered by ganglia? It's not trivial for me to install and setup ganglia especially I don't have the root permission right now.
Best, Da > > __Luke > > On Thu, Nov 11, 2010 at 12:23 PM, Da Zheng <[EMAIL PROTECTED]> wrote: >> Hello, >> >> On 11/11/2010 03:00 PM, David Rosenstrauch wrote: >>> >>> On 11/11/2010 02:52 PM, Da Zheng wrote: >>>> >>>> Hello, >>>> >>>> I wrote a MapReduce program and ran it on a 3-node hadoop cluster, but >>>> its running time varies a lot, from 2 minutes to 3 minutes. I want to >>>> understand how time is used by the map phase and the reduce phase, and >>>> hope to find the place to improve the performance. >>>> >>>> Also the current input data is sorted, so I wrote a customized >>>> partitioner to reduce the data shuffling across the network. I need some >>>> means to help me observe the data movement. >>>> >>>> I know hadoop community developed chukwa for monitoring, but it seems >>>> very immature right now. I wonder how people monitor hadoop cluster >>>> right now. Is there a good way to solve my problems listed above? >>>> >>>> Thanks, >>>> Da >>> >>> Just my $0.02, but IMO you're working on some faulty assumptions here. >>> Hadoop is explicitly *not* a real-time system, and so it's not reasonable >>> for you to expect to have such fine-grained control over its processing >>> speed. It's a distributed system, where many things can affect how long a >>> job takes, such as: how many nodes in the cluster, how many other jobs are >>> running, the technical specs of each node, whether/how Hadoop implements >>> "speculative execution" during your job, whether your job as any task >>> failures/retries, whether you have any hardware failures during your job, >>> ...... >>> >>> You can have control over performance on a Hadoop cluster, via things like >>> adding nodes, tweaking some config parms, etc. But you're much more likely >>> to be able to make performance improvements like cutting a job down from 3 >>> hours to 2 hours, not from 3 minutes to 2 minutes. You're just not going to >>> get that kind of fine-grained control with Hadoop. Nor should you be >>> looking for it, IMO. If that's what you want, then Hadoop is probably the >>> wrong tool for your job. >> >> I don't really try to cut the time from 3 minutes to 2 minutes. I was asking >> whether I can have some tools to monitor the hadoop cluster and possibly >> find the spot for performance improvement. I'm very new to hadoop, and I >> hope to have a good view how time is used by each mapper and reducer, so >> I'll have more confidence to run it on a much larger dataset. >> >> More importantly, I want to see how much data shaffling can be saved if I >> use the customized partitioner. >> >> Best, >> Da >>
-
Re: monitor the hadoop cluster
Harsh J 2010-11-12, 05:29
Hello, On Fri, Nov 12, 2010 at 1:22 AM, Da Zheng <[EMAIL PROTECTED]> wrote: > Hello, > > I wrote a MapReduce program and ran it on a 3-node hadoop cluster, but its > running time varies a lot, from 2 minutes to 3 minutes. I want to understand > how time is used by the map phase and the reduce phase, and hope to find the > place to improve the performance. If you are looking to profile the job (your mapper/reducer code), see: http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Profiling> > Also the current input data is sorted, so I wrote a customized partitioner > to reduce the data shuffling across the network. I need some means to help > me observe the data movement. A reducer will always pick up all the files the partitioner assigned to it. ReduceTask uses http to retrieve these files and then merge them to prepare for the Reducer to run. I don't see how a partitioner can help you reduce data movement, as data still needs to be transferred around? A combiner helps though. -- Harsh J www.harshj.com
-
Re: monitor the hadoop cluster
Da Zheng 2010-11-12, 05:40
On 11/12/10 12:29 AM, Harsh J wrote: > Hello, > > On Fri, Nov 12, 2010 at 1:22 AM, Da Zheng <[EMAIL PROTECTED]> wrote: >> Hello, >> >> I wrote a MapReduce program and ran it on a 3-node hadoop cluster, but its >> running time varies a lot, from 2 minutes to 3 minutes. I want to understand >> how time is used by the map phase and the reduce phase, and hope to find the >> place to improve the performance. > > If you are looking to profile the job (your mapper/reducer code), see: > http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Profiling> >> >> Also the current input data is sorted, so I wrote a customized partitioner >> to reduce the data shuffling across the network. I need some means to help >> me observe the data movement. > > A reducer will always pick up all the files the partitioner assigned > to it. ReduceTask uses http to retrieve these files and then merge > them to prepare for the Reducer to run. I don't see how a partitioner > can help you reduce data movement, as data still needs to be > transferred around? A combiner helps though. It can if the output data of mappers can just feed to the reducer in the same machine. As I said, the input data is sorted. It is possible if I partition the mapper output data very carefully. Best, Da
|
|