Yang 2012-10-19, 09:09
Jagat Singh 2012-10-19, 09:18
Yang 2012-10-19, 19:01
Some testing tips:
1) parametrize your load/store statements so that if you have to run
in hadoop mode, it's easy to switch to debug inputs / outputs (and
debug input/output loaders and storers). It's vastly preferable to
test in local mode when possible, since the iterations are so much
2) it's a good thing that PigUnit makes you test small pieces of code!
Factor out macros so that you can create unit tests; don't copy and
paste code, use macros and the import statement.
3) Try using mock.Storage (see
https://issues.apache.org/jira/browse/PIG-2650) to automatically
create inputs and examine outputs in your unit tests, if you are on
On Fri, Oct 19, 2012 at 12:01 PM, Yang <[EMAIL PROTECTED]> wrote:
> I am using PigUnit, but it's somewhat limited: it can run only localmode,
> so I can't find issues that come with fairly large test data; you have to
> create small snippets of code that you cut out manually from your original
> code, so after you tested a snippet to be fine, you have to copy-paste that
> back into the production code, which introduces possible copy-paste errors.
> if you compare this to java junit, this is really very crude: in java, you
> have a class, and you can do junit testing on individual methods of the
> class, instead of having to copy paste and create a special "test version"
> of that class.
> overall, I feel that testability is an area where PIG could spend a lot
> more efforts and it will greatly benefit its wider adoption. ----- some
> other tools (Cascading, Cascalog etc) advertise testability as one of their
> important features.
> let me check out penny... thanks
> On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> wrote:
>> Hello ,
>> I understand the pain :)
>> Have you seen PigUnit and Penny
>> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:
>> > one of the greatest pains I face with debugging a pig code is that the
>> > iteration cycles are really long:
>> > the applications for which we use pig typically deal with large dataset,
>> > and if a pig script involves many
>> > JOIN/generate/filter steps, every step takes a lot of time, but every
>> > I fix one step, I have to run from the start,
>> > which is meaningless.
>> > what I am doing so far to reduce the meaningless wasted time to re-run
>> > already-debugged steps, is to
>> > manually divide my script into many small scripts, and save the last
>> > variable out into hdfs, and once the
>> > small script is debugged fine, I load the previous variable in the next
>> > small script
>> > after all small scripts are done, I connect them back manually to the
>> > original big script.
>> > is there a way to automate this? for example add a mark around a
>> > step, and tells pig
>> > that the result is to be saved up, and all following steps are not to be
>> > executed. and when we move
>> > onto the next step, it knows where to pick up the last-saved data.
>> > writing a preprocessor to do the above is not trivial so that I can't
>> > up something immediately , cuz it needs to figure out the
>> > schemas of variables that propagate through the steps.
>> > Thanks
>> > Yang
Yang 2012-10-23, 18:11
Yang 2012-11-07, 21:05
Ruslan Al-Fakikh 2012-10-22, 12:55
Ruslan Al-Fakikh 2012-10-19, 13:04
Yang 2012-10-19, 18:57