Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Resume failed pig script


Copy link to this message
-
Re: Resume failed pig script
What's nectar?

I'd like this feature because Pig is easier to read than Oozie XML or
Azkaban YAML/ JSON where one must manually specify dependencies.
Lipstick is a good example of using Pig this way?

Russell Jurney
twitter.com/rjurney
[EMAIL PROTECTED]
datasyndrome.com

On Jun 16, 2012, at 8:27 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> Could save some metadata like crcs of all the jars... And maybe a hash of the subplan associated with each stored intermediate output... But really we should just do Nectar since it solves all this and more :)
>
> On Jun 15, 2012, at 10:43 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
>
>> Well, you can do this physically by adding load/store boundaries to your
>> code. Thinking out loud, such a thing could be possible...
>>
>> At any M/R boundary, you store the intermediate in HDFS, and pig is aware
>> of this and doesn't automatically delete it (this part in and of itself is
>> not trivial -- what manages the garbage collection? perhaps that could be
>> part of the configuration of such a feature). Then, when you rerun a job,
>> it will look to see if the nodes that it would have saved (since it knows
>> this at compile time) don't already actually exist.
>>
>> There are some tricky caveats here... what if your code changes affect
>> intermediate data? You could save the logical plan as well, but what if you
>> make a change to a UDF? I am not sure if the benefit of automating this in
>> the language compared to developing a workflow similar to yours external to
>> pig is worth the complexity.
>>
>> But it is intriguing, and is a subset of data caching that we have thought
>> a lot about here.
>>
>> 2012/6/15 Russell Jurney <[EMAIL PROTECTED]>
>>
>>> In production I use short Pig scripts and schedule them with Azkaban
>>> with dependencies setup, so that I can use Azkaban to restart long
>>> data pipelines at the point of failure. I edit the failing pig script,
>>> usually towards the end of the data pipeline, and restart the Azkaban
>>> job. This saves hours and hours of repeated processing.
>>>
>>> I wish Pig could do this. To resume at its point of failure when
>>> re-run from the command line. Is this feasible?
>>>
>>> Russell Jurney
>>> twitter.com/rjurney
>>> [EMAIL PROTECTED]
>>> datasyndrome.com
>>>