Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: 600s timeout during copy phase of job


Copy link to this message
-
Re: 600s timeout during copy phase of job
Hi David,

A Reduce task's percentage is a good indicator of the phase it is in:
First 33% is the COPY phase, 33-66% is the SORT phase and 66-100% is
the user-code-progress phase.

If your reduce task, at an individual level, is hanging at 80%, then
the cause is not the COPY mechanism - as thats already been completed
log ago. Also, given the timeouts/etc. built into the COPY phase
(which isn't user-code btw), it'd be extremely surprising if the task
hung and timed out in such a phase rather than failing outright.

You could do (1) at a per-job level, or alternately you can
investigate what is causing the sudden hang (memory fill up? slowing
I/O?) and try to address that. If you have an operation that may take
over 10 minutes to return back to proceed onto the next value/key
iteration, then its better to set a status update within such an
operation or as a b/g daemon thread that keeps reporting a different
status every < 10 mins such that the JT is aware it is still alive.

I'm not sure if (2), (2.5) and (3) are relevant here, but thats a yes
to (2.5) - # of reduces is a pure per job setting.

I guess (4) helps improve COPY phase speeds, but per your post I doubt
you're seeing any perf. problems here with COPY.

On Mon, May 13, 2013 at 11:35 AM, David Parks <[EMAIL PROTECTED]> wrote:
> I have a job that’s getting 600s task timeouts during the copy phase of the
> reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and
> it’s taking longer than 10 min to do that copy.
>
>
>
> The process starts copying when the reduce step is 80% complete. This is a
> very IO bound task as I’m just joining 1.5TB of data via 2 map/reduce steps
> on 6 nodes (each node has 1x 4TB disk, and 24GB of ram).
>
>
>
> What should I be thinking in terms of fixing this?
>
> ·         Increase timeout? (seems odd that it would timeout on the internal
> copy)
>
> ·         Reduce # tasks? (I’ve got 8 reducers, 1-per-core, 25
> io.sort.factor & 256 io.sort.mb)
>
> o   Can I do that per job??
>
> ·         Increase copy threads?
>
> ·         Don’t start the reducers until 100% complete on the mappers?
>
>
>
>
>
>
>
>
>
>

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB