I've been doing some performance testing with hadoop and have been experiencing highly variable results which I am trying to understand. I've been examining how long it takes to perform a particular MR job, and am finding that the time taken varies by a factor of 2 when I repeat the job. Note that the data, algorithm, cluster etc is completely the same (and I am the only person on the cluster).
The way I do the test is from a simple shell script that just runs the job again and again. I find that the job is as fast as 5 mins, but as slow as 10 mins, with everything in between.
I've examined the output of two log files, where I can see that the performance difference is coming from the map and shuffle phases. For a sample 'fast' job, the map phases take on average 2 mins 34 secs, whereas for a sample 'slow' jobs the phases take on average 4 mins 12 secs. Interestingly, if I then look at the counters for random maps (one each from the fast and slow jobs) then I find that all counters are pretty much equal – including CPU time. This suggests that the slowdown comes from bottlenecks at disk I/O or network. Since I am the only user on the network (it's a dedicated GB switch) and the only one using the disks, I don't understand what can be happening. Also, the total data is not that huge – the job analyses 21GB with replication 2 spread across 8 disks on 4 nodes. The total disk output from the reducers is about 300MB. I'm not sure how to investigate further – is there some other diagnostic within hadoop that can tell me where the code is waiting (e.g. For network or disk I/O) – or perhaps some system tool that can indicate performance hits in specific places?
Thanks for any suggestions