Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task


Copy link to this message
-
Re: Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task
Hi Alan,

I believe this is expected behavior wrt EMR and S3.  There cannot exist a
duplicate file path in S3 prior to commit; in your case it looks like
bucket: n2ygk, path: reduced.1/useful/part-m-00009*/file -> file. On EMR,
to mitigate hanging tasks, a given job may spawn duplicate tasks
(referenced by a trailing _0, _1, etc.).  This then becomes a race
condition issue wrt duplicate tasks (_0, _1, etc.) committing to the same
bucket/path in S3.

In addition, you may also consider increasing the task timeout from 600s to
something higher/lower to potentially timeout less/more (I think lowest
bound is 60000ms).  I've had jobs which required a *two hour* timeout in
order to succeed.  This can be done with a bootstrap, ie) --bootstrap-action
s3://elasticmapreduce/bootstrap-actions/configure-hadoop
--args -m,mapred.task.timeout=2400000

As for the cleaning up of intermediate steps, I'm not sure.  Possibly try
implementing EXEC
<https://pig.apache.org/docs/r0.11.1/cmds.html#exec>breakpoints prior
to problem blocks, but this will cause pig's job chaining
to weaken and the execution time to grow.

Hope this helps.

-Dan
On Wed, Jun 12, 2013 at 11:21 PM, Alan Crosswell <[EMAIL PROTECTED]> wrote:

> Is this expected behavior or improper error recovery:
>
> *Task attempt_201306130117_0001_m_000009_0 failed to report status for 602
> seconds. Killing!*
>
> This was then followed by the retries of the task failing due to the
> existence of the S3 output file that the dead task had started writing:
>
> *org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable
> to setup the store function.
> *
> *...*
> *Caused by: java.io.IOException: File already
> exists:s3n://n2ygk/reduced.1/useful/part-m-00009*
>
> Seems like this is exactly the kind of task restart that should "just work"
> if the garbage from the failed task were properly cleaned up.
>
> Is there a way to tell Pig to just clobber output files?
>
> Is there a technique for checkpointing Pig scripts so that I don't have to
> keep resubmitting this job and losing hours of work? I was even doing
> "STORE" of intermediate aliases so I could restart later, but the job
> failure causes the intermediate files to be deleted from S3.
>
> Thanks.
> /a
>