Could save some metadata like crcs of all the jars... And maybe a hash of the subplan associated with each stored intermediate output... But really we should just do Nectar since it solves all this and more :)
On Jun 15, 2012, at 10:43 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> Well, you can do this physically by adding load/store boundaries to your
> code. Thinking out loud, such a thing could be possible...
> At any M/R boundary, you store the intermediate in HDFS, and pig is aware
> of this and doesn't automatically delete it (this part in and of itself is
> not trivial -- what manages the garbage collection? perhaps that could be
> part of the configuration of such a feature). Then, when you rerun a job,
> it will look to see if the nodes that it would have saved (since it knows
> this at compile time) don't already actually exist.
> There are some tricky caveats here... what if your code changes affect
> intermediate data? You could save the logical plan as well, but what if you
> make a change to a UDF? I am not sure if the benefit of automating this in
> the language compared to developing a workflow similar to yours external to
> pig is worth the complexity.
> But it is intriguing, and is a subset of data caching that we have thought
> a lot about here.
> 2012/6/15 Russell Jurney <[EMAIL PROTECTED]>
>> In production I use short Pig scripts and schedule them with Azkaban
>> with dependencies setup, so that I can use Azkaban to restart long
>> data pipelines at the point of failure. I edit the failing pig script,
>> usually towards the end of the data pipeline, and restart the Azkaban
>> job. This saves hours and hours of repeated processing.
>> I wish Pig could do this. To resume at its point of failure when
>> re-run from the command line. Is this feasible?
>> Russell Jurney
>> [EMAIL PROTECTED]