|
Johannes Schwenk
2012-05-21, 16:36
Jonathan Coveney
2012-05-21, 17:11
Johannes Schwenk
2012-05-22, 16:37
Jonathan Coveney
2012-05-22, 19:26
Johannes Schwenk
2012-05-23, 09:42
Jonathan Coveney
2012-05-23, 16:20
Johannes Schwenk
2012-05-24, 12:54
Jonathan Coveney
2012-05-24, 16:55
Alan Gates
2012-05-24, 17:15
|
-
UDF FilterFunc and logical ORJohannes Schwenk 2012-05-21, 16:36
Hello List,
I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1. I have written a UDF extending FilterFunc that checks if the provided string is contained within the specified column of the current tuple: http://pastebin.com/Uwje7v1V I have also written some TestCases: http://pastebin.com/uA4LHB4Q The odd thing is, that only TestCase testFilteringClusterWithOR1 fails because the result has not the expected length of 3 but is of length 2 instead (line 177 in http://pastebin.com/Uwje7v1V). After a lot of investigating I still can not find out why testFilteringCluster and testFilteringClusterWithOR2 succeed but not testFilteringClusterWithOR1. Is there a special prerequisite for making my FilterFunc usabel within OR ? Maybe I have missed something very obvious... Please help me figure this out! Greetings, Johannes Schwenk -- Softwareentwickler (Reporting) ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
-
Re: UDF FilterFunc and logical ORJonathan Coveney 2012-05-21, 17:11
Not sure why it is failing... though I will mention two things. 1) you
should use local mode if possible, especially just to test UDFs :) 2) you could use the builtin matches function to achieve this (ie matches '.*keyword.*') Besides that it is odd indeed, and I'd have to dig in more. 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]> > Hello List, > > I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1. > > I have written a UDF extending FilterFunc that checks if the provided > string is contained within the specified column of the current tuple: > http://pastebin.com/Uwje7v1V > > I have also written some TestCases: > http://pastebin.com/uA4LHB4Q > > The odd thing is, that only TestCase testFilteringClusterWithOR1 fails > because the result has not the expected length of 3 but is of length 2 > instead (line 177 in http://pastebin.com/Uwje7v1V). After a lot of > investigating I still can not find out why testFilteringCluster and > testFilteringClusterWithOR2 succeed but not testFilteringClusterWithOR1. > Is there a special prerequisite for making my FilterFunc usabel within > OR ? Maybe I have missed something very obvious... Please help me figure > this out! > > Greetings, > Johannes Schwenk > > -- > Softwareentwickler (Reporting) > ________________________________________________________ > > ADITION technologies AG > Schwarzwaldstraße 78b > 79117 Freiburg > > http://www.adition.com > > T +49 / (0)761 / 88147 - 30 > F +49 / (0)761 / 88147 - 77 > SUPPORT +49 / (0)1805 - ADITION > > (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) > > Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter > Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer > UStIDNr.: DE 218 858 434 > >
-
Re: UDF FilterFunc and logical ORJohannes Schwenk 2012-05-22, 16:37
Thank you for your quick suggestions!
- I am now using local mode - good point! - I know of builtin matches, the CONTAINS filter was just to get into programming UDFS... - Whatever I do the problem persists. I tried: * turning off all optimizations (-t All) : no effect * reordering the statements : the outcome contains still only the matching tuples to the lhs of the OR * using different data (just in case...) : no effect * finally counted how many times the exec() function gets called processing the script... : exactly *six times* - each for every record! That last observation leads me to believe that this is a bug!? The exec function should be called at least *ten times* I think. Du you have any suggestions on how to verify this? Greetings Am 21.05.2012 19:11, schrieb Jonathan Coveney: > Not sure why it is failing... though I will mention two things. 1) you > should use local mode if possible, especially just to test UDFs :) 2) you > could use the builtin matches function to achieve this (ie matches > '.*keyword.*') > > Besides that it is odd indeed, and I'd have to dig in more. > > 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]> > >> Hello List, >> >> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1. >> >> I have written a UDF extending FilterFunc that checks if the provided >> string is contained within the specified column of the current tuple: >> http://pastebin.com/Uwje7v1V >> >> I have also written some TestCases: >> http://pastebin.com/uA4LHB4Q >> >> The odd thing is, that only TestCase testFilteringClusterWithOR1 fails >> because the result has not the expected length of 3 but is of length 2 >> instead (line 177 in http://pastebin.com/Uwje7v1V). After a lot of >> investigating I still can not find out why testFilteringCluster and >> testFilteringClusterWithOR2 succeed but not testFilteringClusterWithOR1. >> Is there a special prerequisite for making my FilterFunc usabel within >> OR ? Maybe I have missed something very obvious... Please help me figure >> this out! >> >> Greetings, >> Johannes Schwenk >> >> -- >> Softwareentwickler (Reporting) >> ________________________________________________________ >> >> ADITION technologies AG >> Schwarzwaldstraße 78b >> 79117 Freiburg >> >> http://www.adition.com >> >> T +49 / (0)761 / 88147 - 30 >> F +49 / (0)761 / 88147 - 77 >> SUPPORT +49 / (0)1805 - ADITION >> >> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >> >> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter >> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >> UStIDNr.: DE 218 858 434 >> >> > Johannes Schwenk -- Softwareentwickler (Reporting) ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
-
Re: UDF FilterFunc and logical ORJonathan Coveney 2012-05-22, 19:26
If this is a bug, it's an annoying one, so I definitely appreciate your
help in getting to the bottom of it. So let's get to the bottom of it :) First, I would clone the trunk version of pig and run the same tests against it and compare. Always good to test any bugs against trunk to see if it is version specific. Right off the bat, I would say that you should dump the files in your test to a file, make a short script that does exactly what your test does, and paste the EXPLAIN plan generated for your script (ideally in both your version of pig and trunk). We should be able to see if there is something weird going on. Let me know if you need any help with any of that. If it persists I'll try and recreate on my end. 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]> > Thank you for your quick suggestions! > > - I am now using local mode - good point! > - I know of builtin matches, the CONTAINS filter was just to get into > programming UDFS... > - Whatever I do the problem persists. I tried: > * turning off all optimizations (-t All) : no effect > * reordering the statements : the outcome contains still only the > matching tuples to the lhs of the OR > * using different data (just in case...) : no effect > * finally counted how many times the exec() function gets called > processing the script... : exactly *six times* - each for every record! > > That last observation leads me to believe that this is a bug!? The exec > function should be called at least *ten times* I think. > > Du you have any suggestions on how to verify this? > > Greetings > > Am 21.05.2012 19:11, schrieb Jonathan Coveney: > > Not sure why it is failing... though I will mention two things. 1) you > > should use local mode if possible, especially just to test UDFs :) 2) you > > could use the builtin matches function to achieve this (ie matches > > '.*keyword.*') > > > > Besides that it is odd indeed, and I'd have to dig in more. > > > > 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]> > > > >> Hello List, > >> > >> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1. > >> > >> I have written a UDF extending FilterFunc that checks if the provided > >> string is contained within the specified column of the current tuple: > >> http://pastebin.com/Uwje7v1V > >> > >> I have also written some TestCases: > >> http://pastebin.com/uA4LHB4Q > >> > >> The odd thing is, that only TestCase testFilteringClusterWithOR1 fails > >> because the result has not the expected length of 3 but is of length 2 > >> instead (line 177 in http://pastebin.com/Uwje7v1V). After a lot of > >> investigating I still can not find out why testFilteringCluster and > >> testFilteringClusterWithOR2 succeed but not testFilteringClusterWithOR1. > >> Is there a special prerequisite for making my FilterFunc usabel within > >> OR ? Maybe I have missed something very obvious... Please help me figure > >> this out! > >> > >> Greetings, > >> Johannes Schwenk > >> > >> -- > >> Softwareentwickler (Reporting) > >> ________________________________________________________ > >> > >> ADITION technologies AG > >> Schwarzwaldstraße 78b > >> 79117 Freiburg > >> > >> http://www.adition.com > >> > >> T +49 / (0)761 / 88147 - 30 > >> F +49 / (0)761 / 88147 - 77 > >> SUPPORT +49 / (0)1805 - ADITION > >> > >> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) > >> > >> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > >> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus > Schlüter > >> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer > >> UStIDNr.: DE 218 858 434 > >> > >> > > > > > > Johannes Schwenk > > -- > Softwareentwickler (Reporting) > ________________________________________________________ > > ADITION technologies AG > Schwarzwaldstraße 78b > 79117 Freiburg > > http://www.adition.com > > T +49 / (0)761 / 88147 - 30 > F +49 / (0)761 / 88147 - 77 > SUPPORT +49 / (0)1805 - ADITION > > (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
-
Re: UDF FilterFunc and logical ORJohannes Schwenk 2012-05-23, 09:42
Hi Jonathan,
thanks again for your help! I have cloned the current git head and created this pig script http://pastebin.com/Gc9C9ZPS TestCONTAINS-testFilteringCluster-input.txt contains http://pastebin.com/h5MC695F The adition.jar has been built against the cloudera cdh3u3 distribution and contains the filter function CONTAINS http://pastebin.com/Uwje7v1V Output from running my script with both versions of pig: pig 0.11.0-SNAPSHOT http://pastebin.com/Cr5CkHui => Correct results!! pig 0.8.1-cdh3u3 http://pastebin.com/yXY17mXx => Incorrect results!! It seems like the new logical plan in pig 0.8.1 optimizes the OR operator away. So its a bug, right? Am 22.05.2012 21:26, schrieb Jonathan Coveney: > If this is a bug, it's an annoying one, so I definitely appreciate your > help in getting to the bottom of it. So let's get to the bottom of it :) > > First, I would clone the trunk version of pig and run the same tests > against it and compare. Always good to test any bugs against trunk to see > if it is version specific. > > Right off the bat, I would say that you should dump the files in your test > to a file, make a short script that does exactly what your test does, and > paste the EXPLAIN plan generated for your script (ideally in both your > version of pig and trunk). We should be able to see if there is something > weird going on. > > Let me know if you need any help with any of that. If it persists I'll try > and recreate on my end. > > 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]> > >> Thank you for your quick suggestions! >> >> - I am now using local mode - good point! >> - I know of builtin matches, the CONTAINS filter was just to get into >> programming UDFS... >> - Whatever I do the problem persists. I tried: >> * turning off all optimizations (-t All) : no effect >> * reordering the statements : the outcome contains still only the >> matching tuples to the lhs of the OR >> * using different data (just in case...) : no effect >> * finally counted how many times the exec() function gets called >> processing the script... : exactly *six times* - each for every record! >> >> That last observation leads me to believe that this is a bug!? The exec >> function should be called at least *ten times* I think. >> >> Du you have any suggestions on how to verify this? >> >> Greetings >> >> Am 21.05.2012 19:11, schrieb Jonathan Coveney: >>> Not sure why it is failing... though I will mention two things. 1) you >>> should use local mode if possible, especially just to test UDFs :) 2) you >>> could use the builtin matches function to achieve this (ie matches >>> '.*keyword.*') >>> >>> Besides that it is odd indeed, and I'd have to dig in more. >>> >>> 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]> >>> >>>> Hello List, >>>> >>>> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1. >>>> >>>> I have written a UDF extending FilterFunc that checks if the provided >>>> string is contained within the specified column of the current tuple: >>>> http://pastebin.com/Uwje7v1V >>>> >>>> I have also written some TestCases: >>>> http://pastebin.com/uA4LHB4Q >>>> >>>> The odd thing is, that only TestCase testFilteringClusterWithOR1 fails >>>> because the result has not the expected length of 3 but is of length 2 >>>> instead (line 177 in http://pastebin.com/Uwje7v1V). After a lot of >>>> investigating I still can not find out why testFilteringCluster and >>>> testFilteringClusterWithOR2 succeed but not testFilteringClusterWithOR1. >>>> Is there a special prerequisite for making my FilterFunc usabel within >>>> OR ? Maybe I have missed something very obvious... Please help me figure >>>> this out! >>>> >>>> Greetings, >>>> Johannes Schwenk >>>> >>>> -- >>>> Softwareentwickler (Reporting) >>>> ________________________________________________________ >>>> >>>> ADITION technologies AG >>>> Schwarzwaldstraße 78b >>>> 79117 Freiburg >>>> >>>> http://www.adition.com >>>> >>>> T +49 / (0)761 / 88147 - 30 Johannes Schwenk Softwareentwickler (Reporting) ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
-
Re: UDF FilterFunc and logical ORJonathan Coveney 2012-05-23, 16:20
Thanks for being thorough! It's indeed a bug, but backporting a fix may be
hard. The parser and logical plan changed a lot from .8-.9, so if at all possible, I would try to use 0.10 (the last release). We use it in production and it is stable, and has a lot of benefits over .8. I will wan that the parser changed so if you have many existing jobs, it may be worth running them on a test cluster with 0.10, but if you don't, defintely better to make the jump now. 2012/5/23 Johannes Schwenk <[EMAIL PROTECTED]> > Hi Jonathan, > > thanks again for your help! > > I have cloned the current git head and created this pig script > http://pastebin.com/Gc9C9ZPS > > TestCONTAINS-testFilteringCluster-input.txt contains > http://pastebin.com/h5MC695F > > The adition.jar has been built against the cloudera cdh3u3 distribution > and contains the filter function CONTAINS > http://pastebin.com/Uwje7v1V > > > Output from running my script with both versions of pig: > > pig 0.11.0-SNAPSHOT > http://pastebin.com/Cr5CkHui > > => Correct results!! > > > pig 0.8.1-cdh3u3 > http://pastebin.com/yXY17mXx > > => Incorrect results!! > > > It seems like the new logical plan in pig 0.8.1 optimizes the OR > operator away. So its a bug, right? > > > > Am 22.05.2012 21:26, schrieb Jonathan Coveney: > > If this is a bug, it's an annoying one, so I definitely appreciate your > > help in getting to the bottom of it. So let's get to the bottom of it :) > > > > First, I would clone the trunk version of pig and run the same tests > > against it and compare. Always good to test any bugs against trunk to see > > if it is version specific. > > > > Right off the bat, I would say that you should dump the files in your > test > > to a file, make a short script that does exactly what your test does, and > > paste the EXPLAIN plan generated for your script (ideally in both your > > version of pig and trunk). We should be able to see if there is something > > weird going on. > > > > Let me know if you need any help with any of that. If it persists I'll > try > > and recreate on my end. > > > > 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]> > > > >> Thank you for your quick suggestions! > >> > >> - I am now using local mode - good point! > >> - I know of builtin matches, the CONTAINS filter was just to get into > >> programming UDFS... > >> - Whatever I do the problem persists. I tried: > >> * turning off all optimizations (-t All) : no effect > >> * reordering the statements : the outcome contains still only the > >> matching tuples to the lhs of the OR > >> * using different data (just in case...) : no effect > >> * finally counted how many times the exec() function gets called > >> processing the script... : exactly *six times* - each for every record! > >> > >> That last observation leads me to believe that this is a bug!? The exec > >> function should be called at least *ten times* I think. > >> > >> Du you have any suggestions on how to verify this? > >> > >> Greetings > >> > >> Am 21.05.2012 19:11, schrieb Jonathan Coveney: > >>> Not sure why it is failing... though I will mention two things. 1) you > >>> should use local mode if possible, especially just to test UDFs :) 2) > you > >>> could use the builtin matches function to achieve this (ie matches > >>> '.*keyword.*') > >>> > >>> Besides that it is odd indeed, and I'd have to dig in more. > >>> > >>> 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]> > >>> > >>>> Hello List, > >>>> > >>>> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1. > >>>> > >>>> I have written a UDF extending FilterFunc that checks if the provided > >>>> string is contained within the specified column of the current tuple: > >>>> http://pastebin.com/Uwje7v1V > >>>> > >>>> I have also written some TestCases: > >>>> http://pastebin.com/uA4LHB4Q > >>>> > >>>> The odd thing is, that only TestCase testFilteringClusterWithOR1 fails > >>>> because the result has not the expected length of 3 but is of length 2
-
Re: UDF FilterFunc and logical ORJohannes Schwenk 2012-05-24, 12:54
Ok then. We are trying to use pig 0.10.0 now. We hit some errors in
running our tests - but see my new mail for that... Should I file a bug for the found issue - just for completeness? Thanks! Am 23.05.2012 18:20, schrieb Jonathan Coveney: > Thanks for being thorough! It's indeed a bug, but backporting a fix may be > hard. The parser and logical plan changed a lot from .8-.9, so if at all > possible, I would try to use 0.10 (the last release). We use it in > production and it is stable, and has a lot of benefits over .8. I will wan > that the parser changed so if you have many existing jobs, it may be worth > running them on a test cluster with 0.10, but if you don't, defintely > better to make the jump now. > > 2012/5/23 Johannes Schwenk <[EMAIL PROTECTED]> > >> Hi Jonathan, >> >> thanks again for your help! >> >> I have cloned the current git head and created this pig script >> http://pastebin.com/Gc9C9ZPS >> >> TestCONTAINS-testFilteringCluster-input.txt contains >> http://pastebin.com/h5MC695F >> >> The adition.jar has been built against the cloudera cdh3u3 distribution >> and contains the filter function CONTAINS >> http://pastebin.com/Uwje7v1V >> >> >> Output from running my script with both versions of pig: >> >> pig 0.11.0-SNAPSHOT >> http://pastebin.com/Cr5CkHui >> >> => Correct results!! >> >> >> pig 0.8.1-cdh3u3 >> http://pastebin.com/yXY17mXx >> >> => Incorrect results!! >> >> >> It seems like the new logical plan in pig 0.8.1 optimizes the OR >> operator away. So its a bug, right? >> >> >> >> Am 22.05.2012 21:26, schrieb Jonathan Coveney: >>> If this is a bug, it's an annoying one, so I definitely appreciate your >>> help in getting to the bottom of it. So let's get to the bottom of it :) >>> >>> First, I would clone the trunk version of pig and run the same tests >>> against it and compare. Always good to test any bugs against trunk to see >>> if it is version specific. >>> >>> Right off the bat, I would say that you should dump the files in your >> test >>> to a file, make a short script that does exactly what your test does, and >>> paste the EXPLAIN plan generated for your script (ideally in both your >>> version of pig and trunk). We should be able to see if there is something >>> weird going on. >>> >>> Let me know if you need any help with any of that. If it persists I'll >> try >>> and recreate on my end. >>> >>> 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]> >>> >>>> Thank you for your quick suggestions! >>>> >>>> - I am now using local mode - good point! >>>> - I know of builtin matches, the CONTAINS filter was just to get into >>>> programming UDFS... >>>> - Whatever I do the problem persists. I tried: >>>> * turning off all optimizations (-t All) : no effect >>>> * reordering the statements : the outcome contains still only the >>>> matching tuples to the lhs of the OR >>>> * using different data (just in case...) : no effect >>>> * finally counted how many times the exec() function gets called >>>> processing the script... : exactly *six times* - each for every record! >>>> >>>> That last observation leads me to believe that this is a bug!? The exec >>>> function should be called at least *ten times* I think. >>>> >>>> Du you have any suggestions on how to verify this? >>>> >>>> Greetings >>>> >>>> Am 21.05.2012 19:11, schrieb Jonathan Coveney: >>>>> Not sure why it is failing... though I will mention two things. 1) you >>>>> should use local mode if possible, especially just to test UDFs :) 2) >> you >>>>> could use the builtin matches function to achieve this (ie matches >>>>> '.*keyword.*') >>>>> >>>>> Besides that it is odd indeed, and I'd have to dig in more. >>>>> >>>>> 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]> >>>>> >>>>>> Hello List, >>>>>> >>>>>> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1. >>>>>> >>>>>> I have written a UDF extending FilterFunc that checks if the provided >>>>>> string is contained within the specified column of the current tuple: Johannes Schwenk Softwareentwickler (Reporting) ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
-
Re: UDF FilterFunc and logical ORJonathan Coveney 2012-05-24, 16:55
I think that there are a lot of known issues like that in pig 0.8... I
don't know that anyone is really actively fixing them. Pig 0.8 is now pretty ancient and a ton of big stuff changed since then. I'm all about "file a bug for everything," but in this case I don't see us rolling out a new version of 8 any time soon. Can any other committers comment on this? 2012/5/24 Johannes Schwenk <[EMAIL PROTECTED]> > Ok then. We are trying to use pig 0.10.0 now. We hit some errors in > running our tests - but see my new mail for that... > > Should I file a bug for the found issue - just for completeness? > > Thanks! > > Am 23.05.2012 18:20, schrieb Jonathan Coveney: > > Thanks for being thorough! It's indeed a bug, but backporting a fix may > be > > hard. The parser and logical plan changed a lot from .8-.9, so if at all > > possible, I would try to use 0.10 (the last release). We use it in > > production and it is stable, and has a lot of benefits over .8. I will > wan > > that the parser changed so if you have many existing jobs, it may be > worth > > running them on a test cluster with 0.10, but if you don't, defintely > > better to make the jump now. > > > > 2012/5/23 Johannes Schwenk <[EMAIL PROTECTED]> > > > >> Hi Jonathan, > >> > >> thanks again for your help! > >> > >> I have cloned the current git head and created this pig script > >> http://pastebin.com/Gc9C9ZPS > >> > >> TestCONTAINS-testFilteringCluster-input.txt contains > >> http://pastebin.com/h5MC695F > >> > >> The adition.jar has been built against the cloudera cdh3u3 distribution > >> and contains the filter function CONTAINS > >> http://pastebin.com/Uwje7v1V > >> > >> > >> Output from running my script with both versions of pig: > >> > >> pig 0.11.0-SNAPSHOT > >> http://pastebin.com/Cr5CkHui > >> > >> => Correct results!! > >> > >> > >> pig 0.8.1-cdh3u3 > >> http://pastebin.com/yXY17mXx > >> > >> => Incorrect results!! > >> > >> > >> It seems like the new logical plan in pig 0.8.1 optimizes the OR > >> operator away. So its a bug, right? > >> > >> > >> > >> Am 22.05.2012 21:26, schrieb Jonathan Coveney: > >>> If this is a bug, it's an annoying one, so I definitely appreciate your > >>> help in getting to the bottom of it. So let's get to the bottom of it > :) > >>> > >>> First, I would clone the trunk version of pig and run the same tests > >>> against it and compare. Always good to test any bugs against trunk to > see > >>> if it is version specific. > >>> > >>> Right off the bat, I would say that you should dump the files in your > >> test > >>> to a file, make a short script that does exactly what your test does, > and > >>> paste the EXPLAIN plan generated for your script (ideally in both your > >>> version of pig and trunk). We should be able to see if there is > something > >>> weird going on. > >>> > >>> Let me know if you need any help with any of that. If it persists I'll > >> try > >>> and recreate on my end. > >>> > >>> 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]> > >>> > >>>> Thank you for your quick suggestions! > >>>> > >>>> - I am now using local mode - good point! > >>>> - I know of builtin matches, the CONTAINS filter was just to get into > >>>> programming UDFS... > >>>> - Whatever I do the problem persists. I tried: > >>>> * turning off all optimizations (-t All) : no effect > >>>> * reordering the statements : the outcome contains still only the > >>>> matching tuples to the lhs of the OR > >>>> * using different data (just in case...) : no effect > >>>> * finally counted how many times the exec() function gets called > >>>> processing the script... : exactly *six times* - each for every > record! > >>>> > >>>> That last observation leads me to believe that this is a bug!? The > exec > >>>> function should be called at least *ten times* I think. > >>>> > >>>> Du you have any suggestions on how to verify this? > >>>> > >>>> Greetings > >>>> > >>>> Am 21.05.2012 19:11, schrieb Jonathan Coveney: > >>>>> Not sure why it is failing... though I will mention two things. 1)
-
Re: UDF FilterFunc and logical ORAlan Gates 2012-05-24, 17:15
It's always good to file the bug, if nothing else so people know what land mines are out there instead of spending several days figuring out the problem (like Johannes just had the joy of doing).
Whether there will be a 0.8.3 is a separate question. If some committer feels the need for it and is willing to drive it forward then it will happen. If not, then not. If some non-committer feels the need for it and is willing to drive it forward I'm sure one of the committers could be convinced to help. Alan. On May 24, 2012, at 9:55 AM, Jonathan Coveney wrote: > I think that there are a lot of known issues like that in pig 0.8... I > don't know that anyone is really actively fixing them. Pig 0.8 is now > pretty ancient and a ton of big stuff changed since then. I'm all about > "file a bug for everything," but in this case I don't see us rolling out a > new version of 8 any time soon. > > Can any other committers comment on this? > > 2012/5/24 Johannes Schwenk <[EMAIL PROTECTED]> > >> Ok then. We are trying to use pig 0.10.0 now. We hit some errors in >> running our tests - but see my new mail for that... >> >> Should I file a bug for the found issue - just for completeness? >> >> Thanks! >> >> Am 23.05.2012 18:20, schrieb Jonathan Coveney: >>> Thanks for being thorough! It's indeed a bug, but backporting a fix may >> be >>> hard. The parser and logical plan changed a lot from .8-.9, so if at all >>> possible, I would try to use 0.10 (the last release). We use it in >>> production and it is stable, and has a lot of benefits over .8. I will >> wan >>> that the parser changed so if you have many existing jobs, it may be >> worth >>> running them on a test cluster with 0.10, but if you don't, defintely >>> better to make the jump now. >>> >>> 2012/5/23 Johannes Schwenk <[EMAIL PROTECTED]> >>> >>>> Hi Jonathan, >>>> >>>> thanks again for your help! >>>> >>>> I have cloned the current git head and created this pig script >>>> http://pastebin.com/Gc9C9ZPS >>>> >>>> TestCONTAINS-testFilteringCluster-input.txt contains >>>> http://pastebin.com/h5MC695F >>>> >>>> The adition.jar has been built against the cloudera cdh3u3 distribution >>>> and contains the filter function CONTAINS >>>> http://pastebin.com/Uwje7v1V >>>> >>>> >>>> Output from running my script with both versions of pig: >>>> >>>> pig 0.11.0-SNAPSHOT >>>> http://pastebin.com/Cr5CkHui >>>> >>>> => Correct results!! >>>> >>>> >>>> pig 0.8.1-cdh3u3 >>>> http://pastebin.com/yXY17mXx >>>> >>>> => Incorrect results!! >>>> >>>> >>>> It seems like the new logical plan in pig 0.8.1 optimizes the OR >>>> operator away. So its a bug, right? >>>> >>>> >>>> >>>> Am 22.05.2012 21:26, schrieb Jonathan Coveney: >>>>> If this is a bug, it's an annoying one, so I definitely appreciate your >>>>> help in getting to the bottom of it. So let's get to the bottom of it >> :) >>>>> >>>>> First, I would clone the trunk version of pig and run the same tests >>>>> against it and compare. Always good to test any bugs against trunk to >> see >>>>> if it is version specific. >>>>> >>>>> Right off the bat, I would say that you should dump the files in your >>>> test >>>>> to a file, make a short script that does exactly what your test does, >> and >>>>> paste the EXPLAIN plan generated for your script (ideally in both your >>>>> version of pig and trunk). We should be able to see if there is >> something >>>>> weird going on. >>>>> >>>>> Let me know if you need any help with any of that. If it persists I'll >>>> try >>>>> and recreate on my end. >>>>> >>>>> 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]> >>>>> >>>>>> Thank you for your quick suggestions! >>>>>> >>>>>> - I am now using local mode - good point! >>>>>> - I know of builtin matches, the CONTAINS filter was just to get into >>>>>> programming UDFS... >>>>>> - Whatever I do the problem persists. I tried: >>>>>> * turning off all optimizations (-t All) : no effect >>>>>> * reordering the statements : the outcome contains still only the |