I have a job that's getting 600s task timeouts during the copy phase of the reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and it's taking longer than 10 min to do that copy.
The process starts copying when the reduce step is 80% complete. This is a very IO bound task as I'm just joining 1.5TB of data via 2 map/reduce steps on 6 nodes (each node has 1x 4TB disk, and 24GB of ram).
What should I be thinking in terms of fixing this?
. Increase timeout? (seems odd that it would timeout on the internal copy)