What sorts of causes might be responsible for a long or slow shuffle stage? For example, I have a job of 266 maps (each emitting 4 records) and 17 reduces (each ingesting about 60 records) that takes 72 minutes to complete. The maps tend to run in about 9-13 minutes (the value in parentheses under the Finish Time column of the map task list in the job tracker and the reduces run in about 37 minutes (same column). If I click into a specific reduce task, I see a Finish Time of 37 minutes of course, and a Shuffle time of about 27 minutes.
So, 11 minutes were spent in the maps, 10 in the reduces, and 27 shuffling. Note that the 72 minute overall job time is considerably longer than the sum of these three averages because of a few outlier maps (25 minutes, one even took 37 minutes) that held up the later stages).
Disregarding the outliers, it's still spending more than 50% of the job time (27 out of 48 minutes) shuffling instead of doing actual computation in the maps and reducers. This feels inefficient to me.
What causes this and what can be done to improve it?
Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com
"Luminous beings are we, not this crude matter."