Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - UDF FilterFunc and logical OR


Copy link to this message
-
Re: UDF FilterFunc and logical OR
Johannes Schwenk 2012-05-23, 09:42
Hi Jonathan,

thanks again for your help!

I have cloned the current git head and created this pig script
http://pastebin.com/Gc9C9ZPS

TestCONTAINS-testFilteringCluster-input.txt contains
http://pastebin.com/h5MC695F

The adition.jar has been built against the cloudera cdh3u3 distribution
and contains the filter function CONTAINS
http://pastebin.com/Uwje7v1V
Output from running my script with both versions of pig:

pig 0.11.0-SNAPSHOT
http://pastebin.com/Cr5CkHui

=> Correct results!!
pig 0.8.1-cdh3u3
http://pastebin.com/yXY17mXx

=> Incorrect results!!
It seems like the new logical plan in pig 0.8.1 optimizes the OR
operator away. So its a bug, right?

Am 22.05.2012 21:26, schrieb Jonathan Coveney:
> If this is a bug, it's an annoying one, so I definitely appreciate your
> help in getting to the bottom of it. So let's get to the bottom of it :)
>
> First, I would clone the trunk version of pig and run the same tests
> against it and compare. Always good to test any bugs against trunk to see
> if it is version specific.
>
> Right off the bat, I would say that you should dump the files in your test
> to a file, make a short script that does exactly what your test does, and
> paste the EXPLAIN plan generated for your script (ideally in both your
> version of pig and trunk). We should be able to see if there is something
> weird going on.
>
> Let me know if you need any help with any of that. If it persists I'll try
> and recreate on my end.
>
> 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]>
>
>> Thank you for your quick suggestions!
>>
>> - I am now using local mode - good point!
>> - I know of builtin matches, the CONTAINS filter was just to get into
>> programming UDFS...
>> - Whatever I do the problem persists. I tried:
>>  * turning off all optimizations (-t All) : no effect
>>  * reordering the statements : the outcome contains still only the
>> matching tuples to the lhs of the OR
>>  * using different data (just in case...) : no effect
>>  * finally counted how many times the exec() function gets called
>> processing the script... : exactly *six times* - each for every record!
>>
>> That last observation leads me to believe that this is a bug!? The exec
>> function should be called at least *ten times* I think.
>>
>> Du you have any suggestions on how to verify this?
>>
>> Greetings
>>
>> Am 21.05.2012 19:11, schrieb Jonathan Coveney:
>>> Not sure why it is failing... though I will mention two things. 1) you
>>> should use local mode if possible, especially just to test UDFs :) 2) you
>>> could use the builtin matches function to achieve this (ie matches
>>> '.*keyword.*')
>>>
>>> Besides that it is odd indeed, and I'd have to dig in more.
>>>
>>> 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]>
>>>
>>>> Hello List,
>>>>
>>>> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1.
>>>>
>>>> I have written a UDF extending FilterFunc that checks if the provided
>>>> string is contained within the specified column of the current tuple:
>>>> http://pastebin.com/Uwje7v1V
>>>>
>>>> I have also written some TestCases:
>>>> http://pastebin.com/uA4LHB4Q
>>>>
>>>> The odd thing is, that only TestCase testFilteringClusterWithOR1 fails
>>>> because the result has not the expected length of 3 but is of length 2
>>>> instead (line 177 in http://pastebin.com/Uwje7v1V). After a lot of
>>>> investigating I still can not find out why testFilteringCluster and
>>>> testFilteringClusterWithOR2 succeed but not testFilteringClusterWithOR1.
>>>> Is there a special prerequisite for making my FilterFunc usabel within
>>>> OR ? Maybe I have missed something very obvious... Please help me figure
>>>> this out!
>>>>
>>>> Greetings,
>>>> Johannes Schwenk
>>>>
>>>> --
>>>> Softwareentwickler (Reporting)
>>>> ________________________________________________________
>>>>
>>>> ADITION technologies AG
>>>> Schwarzwaldstraße 78b
>>>> 79117 Freiburg
>>>>
>>>> http://www.adition.com
>>>>
>>>> T +49 / (0)761 / 88147 - 30

Johannes Schwenk

Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434