Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task


+
Alan Crosswell 2013-06-13, 03:21
Copy link to this message
-
Re: Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task
Hi Alan,

I believe this is expected behavior wrt EMR and S3.  There cannot exist a
duplicate file path in S3 prior to commit; in your case it looks like
bucket: n2ygk, path: reduced.1/useful/part-m-00009*/file -> file. On EMR,
to mitigate hanging tasks, a given job may spawn duplicate tasks
(referenced by a trailing _0, _1, etc.).  This then becomes a race
condition issue wrt duplicate tasks (_0, _1, etc.) committing to the same
bucket/path in S3.

In addition, you may also consider increasing the task timeout from 600s to
something higher/lower to potentially timeout less/more (I think lowest
bound is 60000ms).  I've had jobs which required a *two hour* timeout in
order to succeed.  This can be done with a bootstrap, ie) --bootstrap-action
s3://elasticmapreduce/bootstrap-actions/configure-hadoop
--args -m,mapred.task.timeout=2400000

As for the cleaning up of intermediate steps, I'm not sure.  Possibly try
implementing EXEC
<https://pig.apache.org/docs/r0.11.1/cmds.html#exec>breakpoints prior
to problem blocks, but this will cause pig's job chaining
to weaken and the execution time to grow.

Hope this helps.

-Dan
On Wed, Jun 12, 2013 at 11:21 PM, Alan Crosswell <[EMAIL PROTECTED]> wrote:

> Is this expected behavior or improper error recovery:
>
> *Task attempt_201306130117_0001_m_000009_0 failed to report status for 602
> seconds. Killing!*
>
> This was then followed by the retries of the task failing due to the
> existence of the S3 output file that the dead task had started writing:
>
> *org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable
> to setup the store function.
> *
> *...*
> *Caused by: java.io.IOException: File already
> exists:s3n://n2ygk/reduced.1/useful/part-m-00009*
>
> Seems like this is exactly the kind of task restart that should "just work"
> if the garbage from the failed task were properly cleaned up.
>
> Is there a way to tell Pig to just clobber output files?
>
> Is there a technique for checkpointing Pig scripts so that I don't have to
> keep resubmitting this job and losing hours of work? I was even doing
> "STORE" of intermediate aliases so I could restart later, but the job
> failure causes the intermediate files to be deleted from S3.
>
> Thanks.
> /a
>
+
Alan Crosswell 2013-06-13, 14:40
+
Cheolsoo Park 2013-06-13, 18:18
+
Alan Crosswell 2013-06-13, 18:37
+
Russell Jurney 2013-06-13, 18:51
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB