Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task


Copy link to this message
-
Re: Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task
Russell Jurney 2013-06-13, 18:51
One thing I've done regarding timeouts is to insert prints to STDERR more
often in my UDF. If I recall correctly, this takes care of the timeout
problem.
On Thu, Jun 13, 2013 at 11:37 AM, Alan Crosswell <[EMAIL PROTECTED]> wrote:

> Thanks for the suggestion, Cheolsoo.
> /a
>
>
> On Thu, Jun 13, 2013 at 2:18 PM, Cheolsoo Park <[EMAIL PROTECTED]>
> wrote:
>
> > Hi Alan,
> >
> > >> Seems like this is exactly the kind of task restart that should "just
> > work" if the garbage from the failed task were properly cleaned up.
> >
> > Unfortunately,this is not the case because of S3 eventual consistency.
> Even
> > though the failed task cleans up files on S3, since delete is not
> > immediately propagated on S3, the next task may still see them and fail.
> As
> > far as I know, EMR Pig/S3 integration is not as good as EMR Hive/S3
> > integration. So you will have handle S3 eventual consistency by yourself
> in
> > Pig.
> >
> > One workaround is to write StoreFunc that stages data to HDFS until task
> > completes and then copies them to S3 at commit task step. This will
> > minimize the number of S3 eventual consistency issues you see.
> >
> > Thanks,
> > Cheolsoo
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Jun 13, 2013 at 7:40 AM, Alan Crosswell <[EMAIL PROTECTED]>
> wrote:
> >
> > > The file did not exist until the first task attempt created it before
> it
> > > was killed. As such the subsequent task attempts were guaranteed to
> fail
> > > since the killed task's output file had not be cleaned up. So when I
> > > launched the Pig script, there was no file in the way.
> > >
> > > I'll take a look at upping the timeout.
> > >
> > > Thanks.
> > >
> > >
> > > On Thu, Jun 13, 2013 at 9:57 AM, Dan DeCapria, CivicScience <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > Hi Alan,
> > > >
> > > > I believe this is expected behavior wrt EMR and S3.  There cannot
> > exist a
> > > > duplicate file path in S3 prior to commit; in your case it looks like
> > > > bucket: n2ygk, path: reduced.1/useful/part-m-00009*/file -> file. On
> > EMR,
> > > > to mitigate hanging tasks, a given job may spawn duplicate tasks
> > > > (referenced by a trailing _0, _1, etc.).  This then becomes a race
> > > > condition issue wrt duplicate tasks (_0, _1, etc.) committing to the
> > same
> > > > bucket/path in S3.
> > > >
> > > > In addition, you may also consider increasing the task timeout from
> > 600s
> > > to
> > > > something higher/lower to potentially timeout less/more (I think
> lowest
> > > > bound is 60000ms).  I've had jobs which required a *two hour* timeout
> > in
> > > > order to succeed.  This can be done with a bootstrap, ie)
> > > > --bootstrap-action
> > > > s3://elasticmapreduce/bootstrap-actions/configure-hadoop
> > > > --args -m,mapred.task.timeout=2400000
> > > >
> > > > As for the cleaning up of intermediate steps, I'm not sure.  Possibly
> > try
> > > > implementing EXEC
> > > > <https://pig.apache.org/docs/r0.11.1/cmds.html#exec>breakpoints
> prior
> > > > to problem blocks, but this will cause pig's job chaining
> > > > to weaken and the execution time to grow.
> > > >
> > > > Hope this helps.
> > > >
> > > > -Dan
> > > >
> > > >
> > > > On Wed, Jun 12, 2013 at 11:21 PM, Alan Crosswell <[EMAIL PROTECTED]>
> > > > wrote:
> > > >
> > > > > Is this expected behavior or improper error recovery:
> > > > >
> > > > > *Task attempt_201306130117_0001_m_000009_0 failed to report status
> > for
> > > > 602
> > > > > seconds. Killing!*
> > > > >
> > > > > This was then followed by the retries of the task failing due to
> the
> > > > > existence of the S3 output file that the dead task had started
> > writing:
> > > > >
> > > > > *org.apache.pig.backend.executionengine.ExecException: ERROR 2081:
> > > Unable
> > > > > to setup the store function.
> > > > > *
> > > > > *...*
> > > > > *Caused by: java.io.IOException: File already
> > > > > exists:s3n://n2ygk/reduced.1/useful/part-m-00009*
> > > > >
> > > > > Seems like this is exactly the kind of task restart that should

Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com