-Re: Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task
Alan Crosswell 2013-06-13, 14:40
The file did not exist until the first task attempt created it before it
was killed. As such the subsequent task attempts were guaranteed to fail
since the killed task's output file had not be cleaned up. So when I
launched the Pig script, there was no file in the way.
I'll take a look at upping the timeout.
On Thu, Jun 13, 2013 at 9:57 AM, Dan DeCapria, CivicScience <
[EMAIL PROTECTED]> wrote:
> Hi Alan,
> I believe this is expected behavior wrt EMR and S3. There cannot exist a
> duplicate file path in S3 prior to commit; in your case it looks like
> bucket: n2ygk, path: reduced.1/useful/part-m-00009*/file -> file. On EMR,
> to mitigate hanging tasks, a given job may spawn duplicate tasks
> (referenced by a trailing _0, _1, etc.). This then becomes a race
> condition issue wrt duplicate tasks (_0, _1, etc.) committing to the same
> bucket/path in S3.
> In addition, you may also consider increasing the task timeout from 600s to
> something higher/lower to potentially timeout less/more (I think lowest
> bound is 60000ms). I've had jobs which required a *two hour* timeout in
> order to succeed. This can be done with a bootstrap, ie)
> --args -m,mapred.task.timeout=2400000
> As for the cleaning up of intermediate steps, I'm not sure. Possibly try
> implementing EXEC
> <https://pig.apache.org/docs/r0.11.1/cmds.html#exec>breakpoints prior
> to problem blocks, but this will cause pig's job chaining
> to weaken and the execution time to grow.
> Hope this helps.
> On Wed, Jun 12, 2013 at 11:21 PM, Alan Crosswell <[EMAIL PROTECTED]>
> > Is this expected behavior or improper error recovery:
> > *Task attempt_201306130117_0001_m_000009_0 failed to report status for
> > seconds. Killing!*
> > This was then followed by the retries of the task failing due to the
> > existence of the S3 output file that the dead task had started writing:
> > *org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable
> > to setup the store function.
> > *
> > *...*
> > *Caused by: java.io.IOException: File already
> > exists:s3n://n2ygk/reduced.1/useful/part-m-00009*
> > Seems like this is exactly the kind of task restart that should "just
> > if the garbage from the failed task were properly cleaned up.
> > Is there a way to tell Pig to just clobber output files?
> > Is there a technique for checkpointing Pig scripts so that I don't have
> > keep resubmitting this job and losing hours of work? I was even doing
> > "STORE" of intermediate aliases so I could restart later, but the job
> > failure causes the intermediate files to be deleted from S3.
> > Thanks.
> > /a