-Re: Resume failed pig script
Russell Jurney 2012-06-16, 18:55
I'd like this feature because Pig is easier to read than Oozie XML or
Azkaban YAML/ JSON where one must manually specify dependencies.
Lipstick is a good example of using Pig this way?
On Jun 16, 2012, at 8:27 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Could save some metadata like crcs of all the jars... And maybe a hash of the subplan associated with each stored intermediate output... But really we should just do Nectar since it solves all this and more :)
> On Jun 15, 2012, at 10:43 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
>> Well, you can do this physically by adding load/store boundaries to your
>> code. Thinking out loud, such a thing could be possible...
>> At any M/R boundary, you store the intermediate in HDFS, and pig is aware
>> of this and doesn't automatically delete it (this part in and of itself is
>> not trivial -- what manages the garbage collection? perhaps that could be
>> part of the configuration of such a feature). Then, when you rerun a job,
>> it will look to see if the nodes that it would have saved (since it knows
>> this at compile time) don't already actually exist.
>> There are some tricky caveats here... what if your code changes affect
>> intermediate data? You could save the logical plan as well, but what if you
>> make a change to a UDF? I am not sure if the benefit of automating this in
>> the language compared to developing a workflow similar to yours external to
>> pig is worth the complexity.
>> But it is intriguing, and is a subset of data caching that we have thought
>> a lot about here.
>> 2012/6/15 Russell Jurney <[EMAIL PROTECTED]>
>>> In production I use short Pig scripts and schedule them with Azkaban
>>> with dependencies setup, so that I can use Azkaban to restart long
>>> data pipelines at the point of failure. I edit the failing pig script,
>>> usually towards the end of the data pipeline, and restart the Azkaban
>>> job. This saves hours and hours of repeated processing.
>>> I wish Pig could do this. To resume at its point of failure when
>>> re-run from the command line. Is this feasible?
>>> Russell Jurney
>>> [EMAIL PROTECTED]