Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Resume failed pig script


Copy link to this message
-
Re: Resume failed pig script
Could save some metadata like crcs of all the jars... And maybe a hash of the subplan associated with each stored intermediate output... But really we should just do Nectar since it solves all this and more :)

On Jun 15, 2012, at 10:43 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:

> Well, you can do this physically by adding load/store boundaries to your
> code. Thinking out loud, such a thing could be possible...
>
> At any M/R boundary, you store the intermediate in HDFS, and pig is aware
> of this and doesn't automatically delete it (this part in and of itself is
> not trivial -- what manages the garbage collection? perhaps that could be
> part of the configuration of such a feature). Then, when you rerun a job,
> it will look to see if the nodes that it would have saved (since it knows
> this at compile time) don't already actually exist.
>
> There are some tricky caveats here... what if your code changes affect
> intermediate data? You could save the logical plan as well, but what if you
> make a change to a UDF? I am not sure if the benefit of automating this in
> the language compared to developing a workflow similar to yours external to
> pig is worth the complexity.
>
> But it is intriguing, and is a subset of data caching that we have thought
> a lot about here.
>
> 2012/6/15 Russell Jurney <[EMAIL PROTECTED]>
>
>> In production I use short Pig scripts and schedule them with Azkaban
>> with dependencies setup, so that I can use Azkaban to restart long
>> data pipelines at the point of failure. I edit the failing pig script,
>> usually towards the end of the data pipeline, and restart the Azkaban
>> job. This saves hours and hours of repeated processing.
>>
>> I wish Pig could do this. To resume at its point of failure when
>> re-run from the command line. Is this feasible?
>>
>> Russell Jurney
>> twitter.com/rjurney
>> [EMAIL PROTECTED]
>> datasyndrome.com
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB