Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task


Copy link to this message
-
Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task
Is this expected behavior or improper error recovery:

*Task attempt_201306130117_0001_m_000009_0 failed to report status for 602
seconds. Killing!*

This was then followed by the retries of the task failing due to the
existence of the S3 output file that the dead task had started writing:

*org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable
to setup the store function.
*
*...*
*Caused by: java.io.IOException: File already
exists:s3n://n2ygk/reduced.1/useful/part-m-00009*

Seems like this is exactly the kind of task restart that should "just work"
if the garbage from the failed task were properly cleaned up.

Is there a way to tell Pig to just clobber output files?

Is there a technique for checkpointing Pig scripts so that I don't have to
keep resubmitting this job and losing hours of work? I was even doing
"STORE" of intermediate aliases so I could restart later, but the job
failure causes the intermediate files to be deleted from S3.

Thanks.
/a