Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Re: 600s timeout during copy phase of job


Copy link to this message
-
Re: 600s timeout during copy phase of job
Harsh J 2013-05-13, 16:42
Hi David,

A Reduce task's percentage is a good indicator of the phase it is in:
First 33% is the COPY phase, 33-66% is the SORT phase and 66-100% is
the user-code-progress phase.

If your reduce task, at an individual level, is hanging at 80%, then
the cause is not the COPY mechanism - as thats already been completed
log ago. Also, given the timeouts/etc. built into the COPY phase
(which isn't user-code btw), it'd be extremely surprising if the task
hung and timed out in such a phase rather than failing outright.

You could do (1) at a per-job level, or alternately you can
investigate what is causing the sudden hang (memory fill up? slowing
I/O?) and try to address that. If you have an operation that may take
over 10 minutes to return back to proceed onto the next value/key
iteration, then its better to set a status update within such an
operation or as a b/g daemon thread that keeps reporting a different
status every < 10 mins such that the JT is aware it is still alive.

I'm not sure if (2), (2.5) and (3) are relevant here, but thats a yes
to (2.5) - # of reduces is a pure per job setting.

I guess (4) helps improve COPY phase speeds, but per your post I doubt
you're seeing any perf. problems here with COPY.

On Mon, May 13, 2013 at 11:35 AM, David Parks <[EMAIL PROTECTED]> wrote:
> I have a job that’s getting 600s task timeouts during the copy phase of the
> reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and
> it’s taking longer than 10 min to do that copy.
>
>
>
> The process starts copying when the reduce step is 80% complete. This is a
> very IO bound task as I’m just joining 1.5TB of data via 2 map/reduce steps
> on 6 nodes (each node has 1x 4TB disk, and 24GB of ram).
>
>
>
> What should I be thinking in terms of fixing this?
>
> ·         Increase timeout? (seems odd that it would timeout on the internal
> copy)
>
> ·         Reduce # tasks? (I’ve got 8 reducers, 1-per-core, 25
> io.sort.factor & 256 io.sort.mb)
>
> o   Can I do that per job??
>
> ·         Increase copy threads?
>
> ·         Don’t start the reducers until 100% complete on the mappers?
>
>
>
>
>
>
>
>
>
>

--
Harsh J