Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> debug feature??

Copy link to this message
debug feature??
one of the greatest pains I face with debugging a pig code is that the
iteration cycles are really long:
the applications for which we use pig typically deal with large dataset,
and if a pig script involves many
JOIN/generate/filter steps, every step takes a lot of time, but every time
I fix one step, I have to run from the start,
which is meaningless.

what I am doing so far to reduce the meaningless wasted time to re-run
already-debugged steps, is to
manually divide my script into many small scripts, and save the last
variable out into hdfs, and once the
small script is debugged fine, I load the previous variable in the next
small script

after all small scripts are done, I connect them back manually to the
original big script.
is there a way to automate this? for example add a mark around a particular
step, and tells pig
that the result is to be saved up, and all following steps are not to be
executed. and when we move
onto the next step, it knows where to pick up the last-saved data.

writing a preprocessor to do the above is not trivial so that I can't whip
up something immediately , cuz it needs to figure out the
schemas of variables that propagate through the steps.