I have a job that's getting 600s task timeouts during the copy phase of the
reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and
it's taking longer than 10 min to do that copy.
The process starts copying when the reduce step is 80% complete. This is a
very IO bound task as I'm just joining 1.5TB of data via 2 map/reduce steps
on 6 nodes (each node has 1x 4TB disk, and 24GB of ram).
What should I be thinking in terms of fixing this?
. Increase timeout? (seems odd that it would timeout on the internal
. Reduce # tasks? (I've got 8 reducers, 1-per-core, 25
io.sort.factor & 256 io.sort.mb)
o Can I do that per job??
. Increase copy threads?
. Don't start the reducers until 100% complete on the mappers?