Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> debug feature??


Copy link to this message
-
Re: debug feature??
I am using PigUnit, but it's somewhat limited: it can run only localmode,
so I can't find issues that come with fairly large test data; you have to
create small snippets of code that you cut out manually from your original
code, so after you tested a snippet to be fine, you have to copy-paste that
back into the production code, which introduces possible copy-paste errors.
 if you compare this to java junit, this is really very crude: in java, you
have a class, and you can do junit testing on individual methods of the
class, instead of having to copy paste and create a special "test version"
of that class.
overall, I feel that testability is an area where PIG could spend a lot
more efforts and it will greatly benefit its wider adoption.  ----- some
other tools (Cascading, Cascalog etc) advertise testability as one of their
important features.

let me check out penny... thanks

On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> wrote:

> Hello ,
>
> I understand the pain :)
>
> Have you seen PigUnit and Penny
>
> http://pig.apache.org/docs/r0.10.0/test.html
>
>
>
> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:
>
> > one of the greatest pains I face with debugging a pig code is that the
> > iteration cycles are really long:
> > the applications for which we use pig typically deal with large dataset,
> > and if a pig script involves many
> > JOIN/generate/filter steps, every step takes a lot of time, but every
> time
> > I fix one step, I have to run from the start,
> > which is meaningless.
> >
> > what I am doing so far to reduce the meaningless wasted time to re-run
> > already-debugged steps, is to
> > manually divide my script into many small scripts, and save the last
> > variable out into hdfs, and once the
> > small script is debugged fine, I load the previous variable in the next
> > small script
> >
> > after all small scripts are done, I connect them back manually to the
> > original big script.
> >
> >
> > is there a way to automate this? for example add a mark around a
> particular
> > step, and tells pig
> > that the result is to be saved up, and all following steps are not to be
> > executed. and when we move
> > onto the next step, it knows where to pick up the last-saved data.
> >
> > writing a preprocessor to do the above is not trivial so that I can't
> whip
> > up something immediately , cuz it needs to figure out the
> > schemas of variables that propagate through the steps.
> >
> >
> > Thanks
> > Yang
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB