Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> understanding performance


Copy link to this message
-
understanding performance
Hi there,

I've been doing some performance testing with hadoop and have been experiencing highly variable results which I am trying to understand. I've been examining how long it takes to perform a particular MR job, and am finding that the time taken varies by a factor of 2 when I repeat the job. Note that the data, algorithm, cluster etc is completely the same (and I am the only person on the cluster).

The way I do the test is from a simple shell script that just runs the job again and again. I find that the job is as fast as 5 mins, but as slow as 10 mins, with everything in between.

I've examined the output of two log files, where I can see that the performance difference is coming from the map and shuffle phases. For a sample 'fast' job, the map phases take on average 2 mins 34 secs, whereas for a sample 'slow' jobs the phases take on average 4 mins 12 secs. Interestingly, if I then look at the counters for random maps (one each from the fast and slow jobs) then I find that all counters are pretty much equal – including CPU time. This suggests that the slowdown comes from bottlenecks at disk I/O or network. Since I am the only user on the network (it's a dedicated GB switch) and the only one using the disks, I don't understand what can be happening. Also, the total data is not that huge – the job analyses 21GB with replication 2 spread across 8 disks on 4 nodes. The total disk output from the reducers is about 300MB. I'm not sure how to investigate further – is there some other diagnostic within hadoop that can tell me where the code is waiting (e.g. For network or disk I/O) – or perhaps some system tool that can indicate performance hits in specific places?

Thanks for any suggestions

Peter
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB