|
|
-
Re: UDF FilterFunc and logical ORJohannes Schwenk 2012-05-24, 12:54
Ok then. We are trying to use pig 0.10.0 now. We hit some errors in
running our tests - but see my new mail for that... Should I file a bug for the found issue - just for completeness? Thanks! Am 23.05.2012 18:20, schrieb Jonathan Coveney: > Thanks for being thorough! It's indeed a bug, but backporting a fix may be > hard. The parser and logical plan changed a lot from .8-.9, so if at all > possible, I would try to use 0.10 (the last release). We use it in > production and it is stable, and has a lot of benefits over .8. I will wan > that the parser changed so if you have many existing jobs, it may be worth > running them on a test cluster with 0.10, but if you don't, defintely > better to make the jump now. > > 2012/5/23 Johannes Schwenk <[EMAIL PROTECTED]> > >> Hi Jonathan, >> >> thanks again for your help! >> >> I have cloned the current git head and created this pig script >> http://pastebin.com/Gc9C9ZPS >> >> TestCONTAINS-testFilteringCluster-input.txt contains >> http://pastebin.com/h5MC695F >> >> The adition.jar has been built against the cloudera cdh3u3 distribution >> and contains the filter function CONTAINS >> http://pastebin.com/Uwje7v1V >> >> >> Output from running my script with both versions of pig: >> >> pig 0.11.0-SNAPSHOT >> http://pastebin.com/Cr5CkHui >> >> => Correct results!! >> >> >> pig 0.8.1-cdh3u3 >> http://pastebin.com/yXY17mXx >> >> => Incorrect results!! >> >> >> It seems like the new logical plan in pig 0.8.1 optimizes the OR >> operator away. So its a bug, right? >> >> >> >> Am 22.05.2012 21:26, schrieb Jonathan Coveney: >>> If this is a bug, it's an annoying one, so I definitely appreciate your >>> help in getting to the bottom of it. So let's get to the bottom of it :) >>> >>> First, I would clone the trunk version of pig and run the same tests >>> against it and compare. Always good to test any bugs against trunk to see >>> if it is version specific. >>> >>> Right off the bat, I would say that you should dump the files in your >> test >>> to a file, make a short script that does exactly what your test does, and >>> paste the EXPLAIN plan generated for your script (ideally in both your >>> version of pig and trunk). We should be able to see if there is something >>> weird going on. >>> >>> Let me know if you need any help with any of that. If it persists I'll >> try >>> and recreate on my end. >>> >>> 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]> >>> >>>> Thank you for your quick suggestions! >>>> >>>> - I am now using local mode - good point! >>>> - I know of builtin matches, the CONTAINS filter was just to get into >>>> programming UDFS... >>>> - Whatever I do the problem persists. I tried: >>>> * turning off all optimizations (-t All) : no effect >>>> * reordering the statements : the outcome contains still only the >>>> matching tuples to the lhs of the OR >>>> * using different data (just in case...) : no effect >>>> * finally counted how many times the exec() function gets called >>>> processing the script... : exactly *six times* - each for every record! >>>> >>>> That last observation leads me to believe that this is a bug!? The exec >>>> function should be called at least *ten times* I think. >>>> >>>> Du you have any suggestions on how to verify this? >>>> >>>> Greetings >>>> >>>> Am 21.05.2012 19:11, schrieb Jonathan Coveney: >>>>> Not sure why it is failing... though I will mention two things. 1) you >>>>> should use local mode if possible, especially just to test UDFs :) 2) >> you >>>>> could use the builtin matches function to achieve this (ie matches >>>>> '.*keyword.*') >>>>> >>>>> Besides that it is odd indeed, and I'd have to dig in more. >>>>> >>>>> 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]> >>>>> >>>>>> Hello List, >>>>>> >>>>>> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1. >>>>>> >>>>>> I have written a UDF extending FilterFunc that checks if the provided >>>>>> string is contained within the specified column of the current tuple: Johannes Schwenk Softwareentwickler (Reporting) ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434 |