Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task


Copy link to this message
-
Re: Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task
Thanks for the suggestion, Cheolsoo.
/a
On Thu, Jun 13, 2013 at 2:18 PM, Cheolsoo Park <[EMAIL PROTECTED]> wrote:

> Hi Alan,
>
> >> Seems like this is exactly the kind of task restart that should "just
> work" if the garbage from the failed task were properly cleaned up.
>
> Unfortunately,this is not the case because of S3 eventual consistency. Even
> though the failed task cleans up files on S3, since delete is not
> immediately propagated on S3, the next task may still see them and fail. As
> far as I know, EMR Pig/S3 integration is not as good as EMR Hive/S3
> integration. So you will have handle S3 eventual consistency by yourself in
> Pig.
>
> One workaround is to write StoreFunc that stages data to HDFS until task
> completes and then copies them to S3 at commit task step. This will
> minimize the number of S3 eventual consistency issues you see.
>
> Thanks,
> Cheolsoo
>
>
>
>
>
>
>
>
> On Thu, Jun 13, 2013 at 7:40 AM, Alan Crosswell <[EMAIL PROTECTED]> wrote:
>
> > The file did not exist until the first task attempt created it before it
> > was killed. As such the subsequent task attempts were guaranteed to fail
> > since the killed task's output file had not be cleaned up. So when I
> > launched the Pig script, there was no file in the way.
> >
> > I'll take a look at upping the timeout.
> >
> > Thanks.
> >
> >
> > On Thu, Jun 13, 2013 at 9:57 AM, Dan DeCapria, CivicScience <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Hi Alan,
> > >
> > > I believe this is expected behavior wrt EMR and S3.  There cannot
> exist a
> > > duplicate file path in S3 prior to commit; in your case it looks like
> > > bucket: n2ygk, path: reduced.1/useful/part-m-00009*/file -> file. On
> EMR,
> > > to mitigate hanging tasks, a given job may spawn duplicate tasks
> > > (referenced by a trailing _0, _1, etc.).  This then becomes a race
> > > condition issue wrt duplicate tasks (_0, _1, etc.) committing to the
> same
> > > bucket/path in S3.
> > >
> > > In addition, you may also consider increasing the task timeout from
> 600s
> > to
> > > something higher/lower to potentially timeout less/more (I think lowest
> > > bound is 60000ms).  I've had jobs which required a *two hour* timeout
> in
> > > order to succeed.  This can be done with a bootstrap, ie)
> > > --bootstrap-action
> > > s3://elasticmapreduce/bootstrap-actions/configure-hadoop
> > > --args -m,mapred.task.timeout=2400000
> > >
> > > As for the cleaning up of intermediate steps, I'm not sure.  Possibly
> try
> > > implementing EXEC
> > > <https://pig.apache.org/docs/r0.11.1/cmds.html#exec>breakpoints prior
> > > to problem blocks, but this will cause pig's job chaining
> > > to weaken and the execution time to grow.
> > >
> > > Hope this helps.
> > >
> > > -Dan
> > >
> > >
> > > On Wed, Jun 12, 2013 at 11:21 PM, Alan Crosswell <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Is this expected behavior or improper error recovery:
> > > >
> > > > *Task attempt_201306130117_0001_m_000009_0 failed to report status
> for
> > > 602
> > > > seconds. Killing!*
> > > >
> > > > This was then followed by the retries of the task failing due to the
> > > > existence of the S3 output file that the dead task had started
> writing:
> > > >
> > > > *org.apache.pig.backend.executionengine.ExecException: ERROR 2081:
> > Unable
> > > > to setup the store function.
> > > > *
> > > > *...*
> > > > *Caused by: java.io.IOException: File already
> > > > exists:s3n://n2ygk/reduced.1/useful/part-m-00009*
> > > >
> > > > Seems like this is exactly the kind of task restart that should "just
> > > work"
> > > > if the garbage from the failed task were properly cleaned up.
> > > >
> > > > Is there a way to tell Pig to just clobber output files?
> > > >
> > > > Is there a technique for checkpointing Pig scripts so that I don't
> have
> > > to
> > > > keep resubmitting this job and losing hours of work? I was even doing
> > > > "STORE" of intermediate aliases so I could restart later, but the job
> > > > failure causes the intermediate files to be deleted from S3.