Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> debug feature??


Copy link to this message
-
Re: debug feature??
As for:
>the
>best scenario is to put a "marker" so that certain variables are stored or
>skipped computation but instead LOADed
I remember there was some discussion on this in the past. Actually
this is not trivial. What would it do if you changed a UDF internal
code, for example? How would it know that it should reprocess instead
of load? As far as I remember some other problems were mentioned.

Ruslan

On Fri, Oct 19, 2012 at 11:01 PM, Yang <[EMAIL PROTECTED]> wrote:
> I am using PigUnit, but it's somewhat limited: it can run only localmode,
> so I can't find issues that come with fairly large test data; you have to
> create small snippets of code that you cut out manually from your original
> code, so after you tested a snippet to be fine, you have to copy-paste that
> back into the production code, which introduces possible copy-paste errors.
>  if you compare this to java junit, this is really very crude: in java, you
> have a class, and you can do junit testing on individual methods of the
> class, instead of having to copy paste and create a special "test version"
> of that class.
>
>
> overall, I feel that testability is an area where PIG could spend a lot
> more efforts and it will greatly benefit its wider adoption.  ----- some
> other tools (Cascading, Cascalog etc) advertise testability as one of their
> important features.
>
> let me check out penny... thanks
>
> On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> wrote:
>
>> Hello ,
>>
>> I understand the pain :)
>>
>> Have you seen PigUnit and Penny
>>
>> http://pig.apache.org/docs/r0.10.0/test.html
>>
>>
>>
>> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:
>>
>> > one of the greatest pains I face with debugging a pig code is that the
>> > iteration cycles are really long:
>> > the applications for which we use pig typically deal with large dataset,
>> > and if a pig script involves many
>> > JOIN/generate/filter steps, every step takes a lot of time, but every
>> time
>> > I fix one step, I have to run from the start,
>> > which is meaningless.
>> >
>> > what I am doing so far to reduce the meaningless wasted time to re-run
>> > already-debugged steps, is to
>> > manually divide my script into many small scripts, and save the last
>> > variable out into hdfs, and once the
>> > small script is debugged fine, I load the previous variable in the next
>> > small script
>> >
>> > after all small scripts are done, I connect them back manually to the
>> > original big script.
>> >
>> >
>> > is there a way to automate this? for example add a mark around a
>> particular
>> > step, and tells pig
>> > that the result is to be saved up, and all following steps are not to be
>> > executed. and when we move
>> > onto the next step, it knows where to pick up the last-saved data.
>> >
>> > writing a preprocessor to do the above is not trivial so that I can't
>> whip
>> > up something immediately , cuz it needs to figure out the
>> > schemas of variables that propagate through the steps.
>> >
>> >
>> > Thanks
>> > Yang
>> >
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB