Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> debug feature??


Copy link to this message
-
Re: debug feature??
yes, this is what I'm doing,

but manually adding and removing the STORE and LOAD commands is difficult,
and more importantly
it adds the possibility to introduce bugs during the code change.   the
best scenario is to put a "marker" so that certain variables are stored or
skipped computation but instead LOADed

On Fri, Oct 19, 2012 at 6:04 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:

> Hi,
>
> Basically it would be perfect if you first test with a small amount of
> data in local mode and then run the script on the big data to verify
> the correctness.
> If this is not possible you can store a relation at any point of your
> script with a STORE statement, so not to lose intermediate results.
> And then you can remove the STORE's after debugging.
>
> Best Regards, Ruslan
>
> On Fri, Oct 19, 2012 at 1:18 PM, Jagat Singh <[EMAIL PROTECTED]> wrote:
> > Hello ,
> >
> > I understand the pain :)
> >
> > Have you seen PigUnit and Penny
> >
> > http://pig.apache.org/docs/r0.10.0/test.html
> >
> >
> >
> > On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:
> >
> >> one of the greatest pains I face with debugging a pig code is that the
> >> iteration cycles are really long:
> >> the applications for which we use pig typically deal with large dataset,
> >> and if a pig script involves many
> >> JOIN/generate/filter steps, every step takes a lot of time, but every
> time
> >> I fix one step, I have to run from the start,
> >> which is meaningless.
> >>
> >> what I am doing so far to reduce the meaningless wasted time to re-run
> >> already-debugged steps, is to
> >> manually divide my script into many small scripts, and save the last
> >> variable out into hdfs, and once the
> >> small script is debugged fine, I load the previous variable in the next
> >> small script
> >>
> >> after all small scripts are done, I connect them back manually to the
> >> original big script.
> >>
> >>
> >> is there a way to automate this? for example add a mark around a
> particular
> >> step, and tells pig
> >> that the result is to be saved up, and all following steps are not to be
> >> executed. and when we move
> >> onto the next step, it knows where to pick up the last-saved data.
> >>
> >> writing a preprocessor to do the above is not trivial so that I can't
> whip
> >> up something immediately , cuz it needs to figure out the
> >> schemas of variables that propagate through the steps.
> >>
> >>
> >> Thanks
> >> Yang
> >>
>