Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - debug feature??

Copy link to this message
Re: debug feature??
Yang 2012-10-19, 19:01
I am using PigUnit, but it's somewhat limited: it can run only localmode,
so I can't find issues that come with fairly large test data; you have to
create small snippets of code that you cut out manually from your original
code, so after you tested a snippet to be fine, you have to copy-paste that
back into the production code, which introduces possible copy-paste errors.
 if you compare this to java junit, this is really very crude: in java, you
have a class, and you can do junit testing on individual methods of the
class, instead of having to copy paste and create a special "test version"
of that class.
overall, I feel that testability is an area where PIG could spend a lot
more efforts and it will greatly benefit its wider adoption.  ----- some
other tools (Cascading, Cascalog etc) advertise testability as one of their
important features.

let me check out penny... thanks

On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> wrote:

> Hello ,
> I understand the pain :)
> Have you seen PigUnit and Penny
> http://pig.apache.org/docs/r0.10.0/test.html
> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:
> > one of the greatest pains I face with debugging a pig code is that the
> > iteration cycles are really long:
> > the applications for which we use pig typically deal with large dataset,
> > and if a pig script involves many
> > JOIN/generate/filter steps, every step takes a lot of time, but every
> time
> > I fix one step, I have to run from the start,
> > which is meaningless.
> >
> > what I am doing so far to reduce the meaningless wasted time to re-run
> > already-debugged steps, is to
> > manually divide my script into many small scripts, and save the last
> > variable out into hdfs, and once the
> > small script is debugged fine, I load the previous variable in the next
> > small script
> >
> > after all small scripts are done, I connect them back manually to the
> > original big script.
> >
> >
> > is there a way to automate this? for example add a mark around a
> particular
> > step, and tells pig
> > that the result is to be saved up, and all following steps are not to be
> > executed. and when we move
> > onto the next step, it knows where to pick up the last-saved data.
> >
> > writing a preprocessor to do the above is not trivial so that I can't
> whip
> > up something immediately , cuz it needs to figure out the
> > schemas of variables that propagate through the steps.
> >
> >
> > Thanks
> > Yang
> >