|
Yang
2012-10-19, 09:09
Jagat Singh
2012-10-19, 09:18
Yang
2012-10-19, 19:01
Dmitriy Ryaboy
2012-10-23, 00:32
Yang
2012-10-23, 18:11
Yang
2012-11-07, 21:05
Ruslan Al-Fakikh
2012-10-22, 12:55
Ruslan Al-Fakikh
2012-10-19, 13:04
Yang
2012-10-19, 18:57
|
-
debug feature??Yang 2012-10-19, 09:09
one of the greatest pains I face with debugging a pig code is that the
iteration cycles are really long: the applications for which we use pig typically deal with large dataset, and if a pig script involves many JOIN/generate/filter steps, every step takes a lot of time, but every time I fix one step, I have to run from the start, which is meaningless. what I am doing so far to reduce the meaningless wasted time to re-run already-debugged steps, is to manually divide my script into many small scripts, and save the last variable out into hdfs, and once the small script is debugged fine, I load the previous variable in the next small script after all small scripts are done, I connect them back manually to the original big script. is there a way to automate this? for example add a mark around a particular step, and tells pig that the result is to be saved up, and all following steps are not to be executed. and when we move onto the next step, it knows where to pick up the last-saved data. writing a preprocessor to do the above is not trivial so that I can't whip up something immediately , cuz it needs to figure out the schemas of variables that propagate through the steps. Thanks Yang +
Yang 2012-10-19, 09:09
-
Re: debug feature??Jagat Singh 2012-10-19, 09:18
Hello ,
I understand the pain :) Have you seen PigUnit and Penny http://pig.apache.org/docs/r0.10.0/test.html On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote: > one of the greatest pains I face with debugging a pig code is that the > iteration cycles are really long: > the applications for which we use pig typically deal with large dataset, > and if a pig script involves many > JOIN/generate/filter steps, every step takes a lot of time, but every time > I fix one step, I have to run from the start, > which is meaningless. > > what I am doing so far to reduce the meaningless wasted time to re-run > already-debugged steps, is to > manually divide my script into many small scripts, and save the last > variable out into hdfs, and once the > small script is debugged fine, I load the previous variable in the next > small script > > after all small scripts are done, I connect them back manually to the > original big script. > > > is there a way to automate this? for example add a mark around a particular > step, and tells pig > that the result is to be saved up, and all following steps are not to be > executed. and when we move > onto the next step, it knows where to pick up the last-saved data. > > writing a preprocessor to do the above is not trivial so that I can't whip > up something immediately , cuz it needs to figure out the > schemas of variables that propagate through the steps. > > > Thanks > Yang > +
Jagat Singh 2012-10-19, 09:18
-
Re: debug feature??Yang 2012-10-19, 19:01
I am using PigUnit, but it's somewhat limited: it can run only localmode,
so I can't find issues that come with fairly large test data; you have to create small snippets of code that you cut out manually from your original code, so after you tested a snippet to be fine, you have to copy-paste that back into the production code, which introduces possible copy-paste errors. if you compare this to java junit, this is really very crude: in java, you have a class, and you can do junit testing on individual methods of the class, instead of having to copy paste and create a special "test version" of that class. overall, I feel that testability is an area where PIG could spend a lot more efforts and it will greatly benefit its wider adoption. ----- some other tools (Cascading, Cascalog etc) advertise testability as one of their important features. let me check out penny... thanks On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> wrote: > Hello , > > I understand the pain :) > > Have you seen PigUnit and Penny > > http://pig.apache.org/docs/r0.10.0/test.html > > > > On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote: > > > one of the greatest pains I face with debugging a pig code is that the > > iteration cycles are really long: > > the applications for which we use pig typically deal with large dataset, > > and if a pig script involves many > > JOIN/generate/filter steps, every step takes a lot of time, but every > time > > I fix one step, I have to run from the start, > > which is meaningless. > > > > what I am doing so far to reduce the meaningless wasted time to re-run > > already-debugged steps, is to > > manually divide my script into many small scripts, and save the last > > variable out into hdfs, and once the > > small script is debugged fine, I load the previous variable in the next > > small script > > > > after all small scripts are done, I connect them back manually to the > > original big script. > > > > > > is there a way to automate this? for example add a mark around a > particular > > step, and tells pig > > that the result is to be saved up, and all following steps are not to be > > executed. and when we move > > onto the next step, it knows where to pick up the last-saved data. > > > > writing a preprocessor to do the above is not trivial so that I can't > whip > > up something immediately , cuz it needs to figure out the > > schemas of variables that propagate through the steps. > > > > > > Thanks > > Yang > > > +
Yang 2012-10-19, 19:01
-
Re: debug feature??Dmitriy Ryaboy 2012-10-23, 00:32
Some testing tips:
1) parametrize your load/store statements so that if you have to run in hadoop mode, it's easy to switch to debug inputs / outputs (and debug input/output loaders and storers). It's vastly preferable to test in local mode when possible, since the iterations are so much faster. 2) it's a good thing that PigUnit makes you test small pieces of code! Factor out macros so that you can create unit tests; don't copy and paste code, use macros and the import statement. 3) Try using mock.Storage (see https://issues.apache.org/jira/browse/PIG-2650) to automatically create inputs and examine outputs in your unit tests, if you are on pig 11. D On Fri, Oct 19, 2012 at 12:01 PM, Yang <[EMAIL PROTECTED]> wrote: > I am using PigUnit, but it's somewhat limited: it can run only localmode, > so I can't find issues that come with fairly large test data; you have to > create small snippets of code that you cut out manually from your original > code, so after you tested a snippet to be fine, you have to copy-paste that > back into the production code, which introduces possible copy-paste errors. > if you compare this to java junit, this is really very crude: in java, you > have a class, and you can do junit testing on individual methods of the > class, instead of having to copy paste and create a special "test version" > of that class. > > > overall, I feel that testability is an area where PIG could spend a lot > more efforts and it will greatly benefit its wider adoption. ----- some > other tools (Cascading, Cascalog etc) advertise testability as one of their > important features. > > let me check out penny... thanks > > On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> wrote: > >> Hello , >> >> I understand the pain :) >> >> Have you seen PigUnit and Penny >> >> http://pig.apache.org/docs/r0.10.0/test.html >> >> >> >> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote: >> >> > one of the greatest pains I face with debugging a pig code is that the >> > iteration cycles are really long: >> > the applications for which we use pig typically deal with large dataset, >> > and if a pig script involves many >> > JOIN/generate/filter steps, every step takes a lot of time, but every >> time >> > I fix one step, I have to run from the start, >> > which is meaningless. >> > >> > what I am doing so far to reduce the meaningless wasted time to re-run >> > already-debugged steps, is to >> > manually divide my script into many small scripts, and save the last >> > variable out into hdfs, and once the >> > small script is debugged fine, I load the previous variable in the next >> > small script >> > >> > after all small scripts are done, I connect them back manually to the >> > original big script. >> > >> > >> > is there a way to automate this? for example add a mark around a >> particular >> > step, and tells pig >> > that the result is to be saved up, and all following steps are not to be >> > executed. and when we move >> > onto the next step, it knows where to pick up the last-saved data. >> > >> > writing a preprocessor to do the above is not trivial so that I can't >> whip >> > up something immediately , cuz it needs to figure out the >> > schemas of variables that propagate through the steps. >> > >> > >> > Thanks >> > Yang >> > >> +
Dmitriy Ryaboy 2012-10-23, 00:32
-
Re: debug feature??Yang 2012-10-23, 18:11
nice, thanks
macros and mock.Storage() are both new to me, I believe it will help a lot On Mon, Oct 22, 2012 at 5:32 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Some testing tips: > > 1) parametrize your load/store statements so that if you have to run > in hadoop mode, it's easy to switch to debug inputs / outputs (and > debug input/output loaders and storers). It's vastly preferable to > test in local mode when possible, since the iterations are so much > faster. > > 2) it's a good thing that PigUnit makes you test small pieces of code! > Factor out macros so that you can create unit tests; don't copy and > paste code, use macros and the import statement. > > 3) Try using mock.Storage (see > https://issues.apache.org/jira/browse/PIG-2650) to automatically > create inputs and examine outputs in your unit tests, if you are on > pig 11. > > D > > On Fri, Oct 19, 2012 at 12:01 PM, Yang <[EMAIL PROTECTED]> wrote: > > I am using PigUnit, but it's somewhat limited: it can run only localmode, > > so I can't find issues that come with fairly large test data; you have to > > create small snippets of code that you cut out manually from your > original > > code, so after you tested a snippet to be fine, you have to copy-paste > that > > back into the production code, which introduces possible copy-paste > errors. > > if you compare this to java junit, this is really very crude: in java, > you > > have a class, and you can do junit testing on individual methods of the > > class, instead of having to copy paste and create a special "test > version" > > of that class. > > > > > > overall, I feel that testability is an area where PIG could spend a lot > > more efforts and it will greatly benefit its wider adoption. ----- some > > other tools (Cascading, Cascalog etc) advertise testability as one of > their > > important features. > > > > let me check out penny... thanks > > > > On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> > wrote: > > > >> Hello , > >> > >> I understand the pain :) > >> > >> Have you seen PigUnit and Penny > >> > >> http://pig.apache.org/docs/r0.10.0/test.html > >> > >> > >> > >> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote: > >> > >> > one of the greatest pains I face with debugging a pig code is that the > >> > iteration cycles are really long: > >> > the applications for which we use pig typically deal with large > dataset, > >> > and if a pig script involves many > >> > JOIN/generate/filter steps, every step takes a lot of time, but every > >> time > >> > I fix one step, I have to run from the start, > >> > which is meaningless. > >> > > >> > what I am doing so far to reduce the meaningless wasted time to re-run > >> > already-debugged steps, is to > >> > manually divide my script into many small scripts, and save the last > >> > variable out into hdfs, and once the > >> > small script is debugged fine, I load the previous variable in the > next > >> > small script > >> > > >> > after all small scripts are done, I connect them back manually to the > >> > original big script. > >> > > >> > > >> > is there a way to automate this? for example add a mark around a > >> particular > >> > step, and tells pig > >> > that the result is to be saved up, and all following steps are not to > be > >> > executed. and when we move > >> > onto the next step, it knows where to pick up the last-saved data. > >> > > >> > writing a preprocessor to do the above is not trivial so that I can't > >> whip > >> > up something immediately , cuz it needs to figure out the > >> > schemas of variables that propagate through the steps. > >> > > >> > > >> > Thanks > >> > Yang > >> > > >> > +
Yang 2012-10-23, 18:11
-
Re: debug feature??Yang 2012-11-07, 21:05
ok, I found this practice to be useful:
I divide my code into sections, each section implemented as a macro. then I debug each macro separately, at the end of each macro, I manually write its output vars into tmp storage. Then for each macro, I write a corresponding "***_fake.pig" macro, which has the same signature, but populates the same return vars by loading them from the tmp storage. then after I am done with one section, I swap out the IMPORT sentence to import the **_fake.pig script instead, so that the same computation is not done again. On Tue, Oct 23, 2012 at 11:11 AM, Yang <[EMAIL PROTECTED]> wrote: > nice, thanks > > macros and mock.Storage() are both new to me, I believe it will help a lot > > > On Mon, Oct 22, 2012 at 5:32 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote: > >> Some testing tips: >> >> 1) parametrize your load/store statements so that if you have to run >> in hadoop mode, it's easy to switch to debug inputs / outputs (and >> debug input/output loaders and storers). It's vastly preferable to >> test in local mode when possible, since the iterations are so much >> faster. >> >> 2) it's a good thing that PigUnit makes you test small pieces of code! >> Factor out macros so that you can create unit tests; don't copy and >> paste code, use macros and the import statement. >> >> 3) Try using mock.Storage (see >> https://issues.apache.org/jira/browse/PIG-2650) to automatically >> create inputs and examine outputs in your unit tests, if you are on >> pig 11. >> >> D >> >> On Fri, Oct 19, 2012 at 12:01 PM, Yang <[EMAIL PROTECTED]> wrote: >> > I am using PigUnit, but it's somewhat limited: it can run only >> localmode, >> > so I can't find issues that come with fairly large test data; you have >> to >> > create small snippets of code that you cut out manually from your >> original >> > code, so after you tested a snippet to be fine, you have to copy-paste >> that >> > back into the production code, which introduces possible copy-paste >> errors. >> > if you compare this to java junit, this is really very crude: in java, >> you >> > have a class, and you can do junit testing on individual methods of the >> > class, instead of having to copy paste and create a special "test >> version" >> > of that class. >> > >> > >> > overall, I feel that testability is an area where PIG could spend a lot >> > more efforts and it will greatly benefit its wider adoption. ----- some >> > other tools (Cascading, Cascalog etc) advertise testability as one of >> their >> > important features. >> > >> > let me check out penny... thanks >> > >> > On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> >> wrote: >> > >> >> Hello , >> >> >> >> I understand the pain :) >> >> >> >> Have you seen PigUnit and Penny >> >> >> >> http://pig.apache.org/docs/r0.10.0/test.html >> >> >> >> >> >> >> >> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote: >> >> >> >> > one of the greatest pains I face with debugging a pig code is that >> the >> >> > iteration cycles are really long: >> >> > the applications for which we use pig typically deal with large >> dataset, >> >> > and if a pig script involves many >> >> > JOIN/generate/filter steps, every step takes a lot of time, but every >> >> time >> >> > I fix one step, I have to run from the start, >> >> > which is meaningless. >> >> > >> >> > what I am doing so far to reduce the meaningless wasted time to >> re-run >> >> > already-debugged steps, is to >> >> > manually divide my script into many small scripts, and save the last >> >> > variable out into hdfs, and once the >> >> > small script is debugged fine, I load the previous variable in the >> next >> >> > small script >> >> > >> >> > after all small scripts are done, I connect them back manually to the >> >> > original big script. >> >> > >> >> > >> >> > is there a way to automate this? for example add a mark around a >> >> particular >> >> > step, and tells pig >> >> > that the result is to be saved up, and all following steps are not +
Yang 2012-11-07, 21:05
-
Re: debug feature??Ruslan Al-Fakikh 2012-10-22, 12:55
As for:
>the >best scenario is to put a "marker" so that certain variables are stored or >skipped computation but instead LOADed I remember there was some discussion on this in the past. Actually this is not trivial. What would it do if you changed a UDF internal code, for example? How would it know that it should reprocess instead of load? As far as I remember some other problems were mentioned. Ruslan On Fri, Oct 19, 2012 at 11:01 PM, Yang <[EMAIL PROTECTED]> wrote: > I am using PigUnit, but it's somewhat limited: it can run only localmode, > so I can't find issues that come with fairly large test data; you have to > create small snippets of code that you cut out manually from your original > code, so after you tested a snippet to be fine, you have to copy-paste that > back into the production code, which introduces possible copy-paste errors. > if you compare this to java junit, this is really very crude: in java, you > have a class, and you can do junit testing on individual methods of the > class, instead of having to copy paste and create a special "test version" > of that class. > > > overall, I feel that testability is an area where PIG could spend a lot > more efforts and it will greatly benefit its wider adoption. ----- some > other tools (Cascading, Cascalog etc) advertise testability as one of their > important features. > > let me check out penny... thanks > > On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> wrote: > >> Hello , >> >> I understand the pain :) >> >> Have you seen PigUnit and Penny >> >> http://pig.apache.org/docs/r0.10.0/test.html >> >> >> >> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote: >> >> > one of the greatest pains I face with debugging a pig code is that the >> > iteration cycles are really long: >> > the applications for which we use pig typically deal with large dataset, >> > and if a pig script involves many >> > JOIN/generate/filter steps, every step takes a lot of time, but every >> time >> > I fix one step, I have to run from the start, >> > which is meaningless. >> > >> > what I am doing so far to reduce the meaningless wasted time to re-run >> > already-debugged steps, is to >> > manually divide my script into many small scripts, and save the last >> > variable out into hdfs, and once the >> > small script is debugged fine, I load the previous variable in the next >> > small script >> > >> > after all small scripts are done, I connect them back manually to the >> > original big script. >> > >> > >> > is there a way to automate this? for example add a mark around a >> particular >> > step, and tells pig >> > that the result is to be saved up, and all following steps are not to be >> > executed. and when we move >> > onto the next step, it knows where to pick up the last-saved data. >> > >> > writing a preprocessor to do the above is not trivial so that I can't >> whip >> > up something immediately , cuz it needs to figure out the >> > schemas of variables that propagate through the steps. >> > >> > >> > Thanks >> > Yang >> > >> +
Ruslan Al-Fakikh 2012-10-22, 12:55
-
Re: debug feature??Ruslan Al-Fakikh 2012-10-19, 13:04
Hi,
Basically it would be perfect if you first test with a small amount of data in local mode and then run the script on the big data to verify the correctness. If this is not possible you can store a relation at any point of your script with a STORE statement, so not to lose intermediate results. And then you can remove the STORE's after debugging. Best Regards, Ruslan On Fri, Oct 19, 2012 at 1:18 PM, Jagat Singh <[EMAIL PROTECTED]> wrote: > Hello , > > I understand the pain :) > > Have you seen PigUnit and Penny > > http://pig.apache.org/docs/r0.10.0/test.html > > > > On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote: > >> one of the greatest pains I face with debugging a pig code is that the >> iteration cycles are really long: >> the applications for which we use pig typically deal with large dataset, >> and if a pig script involves many >> JOIN/generate/filter steps, every step takes a lot of time, but every time >> I fix one step, I have to run from the start, >> which is meaningless. >> >> what I am doing so far to reduce the meaningless wasted time to re-run >> already-debugged steps, is to >> manually divide my script into many small scripts, and save the last >> variable out into hdfs, and once the >> small script is debugged fine, I load the previous variable in the next >> small script >> >> after all small scripts are done, I connect them back manually to the >> original big script. >> >> >> is there a way to automate this? for example add a mark around a particular >> step, and tells pig >> that the result is to be saved up, and all following steps are not to be >> executed. and when we move >> onto the next step, it knows where to pick up the last-saved data. >> >> writing a preprocessor to do the above is not trivial so that I can't whip >> up something immediately , cuz it needs to figure out the >> schemas of variables that propagate through the steps. >> >> >> Thanks >> Yang >> +
Ruslan Al-Fakikh 2012-10-19, 13:04
-
Re: debug feature??Yang 2012-10-19, 18:57
yes, this is what I'm doing,
but manually adding and removing the STORE and LOAD commands is difficult, and more importantly it adds the possibility to introduce bugs during the code change. the best scenario is to put a "marker" so that certain variables are stored or skipped computation but instead LOADed On Fri, Oct 19, 2012 at 6:04 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote: > Hi, > > Basically it would be perfect if you first test with a small amount of > data in local mode and then run the script on the big data to verify > the correctness. > If this is not possible you can store a relation at any point of your > script with a STORE statement, so not to lose intermediate results. > And then you can remove the STORE's after debugging. > > Best Regards, Ruslan > > On Fri, Oct 19, 2012 at 1:18 PM, Jagat Singh <[EMAIL PROTECTED]> wrote: > > Hello , > > > > I understand the pain :) > > > > Have you seen PigUnit and Penny > > > > http://pig.apache.org/docs/r0.10.0/test.html > > > > > > > > On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote: > > > >> one of the greatest pains I face with debugging a pig code is that the > >> iteration cycles are really long: > >> the applications for which we use pig typically deal with large dataset, > >> and if a pig script involves many > >> JOIN/generate/filter steps, every step takes a lot of time, but every > time > >> I fix one step, I have to run from the start, > >> which is meaningless. > >> > >> what I am doing so far to reduce the meaningless wasted time to re-run > >> already-debugged steps, is to > >> manually divide my script into many small scripts, and save the last > >> variable out into hdfs, and once the > >> small script is debugged fine, I load the previous variable in the next > >> small script > >> > >> after all small scripts are done, I connect them back manually to the > >> original big script. > >> > >> > >> is there a way to automate this? for example add a mark around a > particular > >> step, and tells pig > >> that the result is to be saved up, and all following steps are not to be > >> executed. and when we move > >> onto the next step, it knows where to pick up the last-saved data. > >> > >> writing a preprocessor to do the above is not trivial so that I can't > whip > >> up something immediately , cuz it needs to figure out the > >> schemas of variables that propagate through the steps. > >> > >> > >> Thanks > >> Yang > >> > +
Yang 2012-10-19, 18:57
|