Yang 2012-10-19, 09:09
Jagat Singh 2012-10-19, 09:18
Yang 2012-10-19, 19:01
Dmitriy Ryaboy 2012-10-23, 00:32
Yang 2012-10-23, 18:11
Yang 2012-11-07, 21:05
Ruslan Al-Fakikh 2012-10-22, 12:55
Basically it would be perfect if you first test with a small amount of
data in local mode and then run the script on the big data to verify
If this is not possible you can store a relation at any point of your
script with a STORE statement, so not to lose intermediate results.
And then you can remove the STORE's after debugging.
Best Regards, Ruslan
On Fri, Oct 19, 2012 at 1:18 PM, Jagat Singh <[EMAIL PROTECTED]> wrote:
> Hello ,
> I understand the pain :)
> Have you seen PigUnit and Penny
> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:
>> one of the greatest pains I face with debugging a pig code is that the
>> iteration cycles are really long:
>> the applications for which we use pig typically deal with large dataset,
>> and if a pig script involves many
>> JOIN/generate/filter steps, every step takes a lot of time, but every time
>> I fix one step, I have to run from the start,
>> which is meaningless.
>> what I am doing so far to reduce the meaningless wasted time to re-run
>> already-debugged steps, is to
>> manually divide my script into many small scripts, and save the last
>> variable out into hdfs, and once the
>> small script is debugged fine, I load the previous variable in the next
>> small script
>> after all small scripts are done, I connect them back manually to the
>> original big script.
>> is there a way to automate this? for example add a mark around a particular
>> step, and tells pig
>> that the result is to be saved up, and all following steps are not to be
>> executed. and when we move
>> onto the next step, it knows where to pick up the last-saved data.
>> writing a preprocessor to do the above is not trivial so that I can't whip
>> up something immediately , cuz it needs to figure out the
>> schemas of variables that propagate through the steps.
Yang 2012-10-19, 18:57