Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - debug feature??


Copy link to this message
-
Re: debug feature??
Yang 2012-10-23, 18:11
nice, thanks

macros and mock.Storage() are both new to me, I believe it will help a lot

On Mon, Oct 22, 2012 at 5:32 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> Some testing tips:
>
> 1) parametrize your load/store statements so that if you have to run
> in hadoop mode, it's easy to switch to debug inputs / outputs (and
> debug input/output loaders and storers). It's vastly preferable to
> test in local mode when possible, since the iterations are so much
> faster.
>
> 2) it's a good thing that PigUnit makes you test small pieces of code!
> Factor out macros so that you can create unit tests; don't copy and
> paste code, use macros and the import statement.
>
> 3) Try using mock.Storage (see
> https://issues.apache.org/jira/browse/PIG-2650) to automatically
> create inputs and examine outputs in your unit tests, if you are on
> pig 11.
>
> D
>
> On Fri, Oct 19, 2012 at 12:01 PM, Yang <[EMAIL PROTECTED]> wrote:
> > I am using PigUnit, but it's somewhat limited: it can run only localmode,
> > so I can't find issues that come with fairly large test data; you have to
> > create small snippets of code that you cut out manually from your
> original
> > code, so after you tested a snippet to be fine, you have to copy-paste
> that
> > back into the production code, which introduces possible copy-paste
> errors.
> >  if you compare this to java junit, this is really very crude: in java,
> you
> > have a class, and you can do junit testing on individual methods of the
> > class, instead of having to copy paste and create a special "test
> version"
> > of that class.
> >
> >
> > overall, I feel that testability is an area where PIG could spend a lot
> > more efforts and it will greatly benefit its wider adoption.  ----- some
> > other tools (Cascading, Cascalog etc) advertise testability as one of
> their
> > important features.
> >
> > let me check out penny... thanks
> >
> > On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]>
> wrote:
> >
> >> Hello ,
> >>
> >> I understand the pain :)
> >>
> >> Have you seen PigUnit and Penny
> >>
> >> http://pig.apache.org/docs/r0.10.0/test.html
> >>
> >>
> >>
> >> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:
> >>
> >> > one of the greatest pains I face with debugging a pig code is that the
> >> > iteration cycles are really long:
> >> > the applications for which we use pig typically deal with large
> dataset,
> >> > and if a pig script involves many
> >> > JOIN/generate/filter steps, every step takes a lot of time, but every
> >> time
> >> > I fix one step, I have to run from the start,
> >> > which is meaningless.
> >> >
> >> > what I am doing so far to reduce the meaningless wasted time to re-run
> >> > already-debugged steps, is to
> >> > manually divide my script into many small scripts, and save the last
> >> > variable out into hdfs, and once the
> >> > small script is debugged fine, I load the previous variable in the
> next
> >> > small script
> >> >
> >> > after all small scripts are done, I connect them back manually to the
> >> > original big script.
> >> >
> >> >
> >> > is there a way to automate this? for example add a mark around a
> >> particular
> >> > step, and tells pig
> >> > that the result is to be saved up, and all following steps are not to
> be
> >> > executed. and when we move
> >> > onto the next step, it knows where to pick up the last-saved data.
> >> >
> >> > writing a preprocessor to do the above is not trivial so that I can't
> >> whip
> >> > up something immediately , cuz it needs to figure out the
> >> > schemas of variables that propagate through the steps.
> >> >
> >> >
> >> > Thanks
> >> > Yang
> >> >
> >>
>