The job detail page from the jobtracker shows a lot of information
about any given job: the start/finish times of each task and various
counters (like time spent in various phase, the input/output
For monitoring the aggregate performance of a cluster, the hadoop
metrics system can send a lot of information to standard monitoring
tools (ganglia etc.) to graph and monitor various aggregate metrics
like (running/waiting maps/reduces etc.)
On Thu, Nov 11, 2010 at 12:23 PM, Da Zheng <[EMAIL PROTECTED]> wrote:
> On 11/11/2010 03:00 PM, David Rosenstrauch wrote:
>> On 11/11/2010 02:52 PM, Da Zheng wrote:
>>> I wrote a MapReduce program and ran it on a 3-node hadoop cluster, but
>>> its running time varies a lot, from 2 minutes to 3 minutes. I want to
>>> understand how time is used by the map phase and the reduce phase, and
>>> hope to find the place to improve the performance.
>>> Also the current input data is sorted, so I wrote a customized
>>> partitioner to reduce the data shuffling across the network. I need some
>>> means to help me observe the data movement.
>>> I know hadoop community developed chukwa for monitoring, but it seems
>>> very immature right now. I wonder how people monitor hadoop cluster
>>> right now. Is there a good way to solve my problems listed above?
>> Just my $0.02, but IMO you're working on some faulty assumptions here.
>> Hadoop is explicitly *not* a real-time system, and so it's not reasonable
>> for you to expect to have such fine-grained control over its processing
>> speed. It's a distributed system, where many things can affect how long a
>> job takes, such as: how many nodes in the cluster, how many other jobs are
>> running, the technical specs of each node, whether/how Hadoop implements
>> "speculative execution" during your job, whether your job as any task
>> failures/retries, whether you have any hardware failures during your job,
>> You can have control over performance on a Hadoop cluster, via things like
>> adding nodes, tweaking some config parms, etc. But you're much more likely
>> to be able to make performance improvements like cutting a job down from 3
>> hours to 2 hours, not from 3 minutes to 2 minutes. You're just not going to
>> get that kind of fine-grained control with Hadoop. Nor should you be
>> looking for it, IMO. If that's what you want, then Hadoop is probably the
>> wrong tool for your job.
> I don't really try to cut the time from 3 minutes to 2 minutes. I was asking
> whether I can have some tools to monitor the hadoop cluster and possibly
> find the spot for performance improvement. I'm very new to hadoop, and I
> hope to have a good view how time is used by each mapper and reducer, so
> I'll have more confidence to run it on a much larger dataset.
> More importantly, I want to see how much data shaffling can be saved if I
> use the customized partitioner.