Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task


Copy link to this message
-
Pig 0.11.1 on AWS EMR/S3 fails to cleanup failed task output file before retrying that task
Is this expected behavior or improper error recovery:

*Task attempt_201306130117_0001_m_000009_0 failed to report status for 602
seconds. Killing!*

This was then followed by the retries of the task failing due to the
existence of the S3 output file that the dead task had started writing:

*org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable
to setup the store function.
*
*...*
*Caused by: java.io.IOException: File already
exists:s3n://n2ygk/reduced.1/useful/part-m-00009*

Seems like this is exactly the kind of task restart that should "just work"
if the garbage from the failed task were properly cleaned up.

Is there a way to tell Pig to just clobber output files?

Is there a technique for checkpointing Pig scripts so that I don't have to
keep resubmitting this job and losing hours of work? I was even doing
"STORE" of intermediate aliases so I could restart later, but the job
failure causes the intermediate files to be deleted from S3.

Thanks.
/a
+
Dan DeCapria, CivicScienc... 2013-06-13, 13:57
+
Alan Crosswell 2013-06-13, 14:40
+
Cheolsoo Park 2013-06-13, 18:18
+
Alan Crosswell 2013-06-13, 18:37
+
Russell Jurney 2013-06-13, 18:51
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB