Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - UDF FilterFunc and logical OR


+
Johannes Schwenk 2012-05-21, 16:36
+
Jonathan Coveney 2012-05-21, 17:11
+
Johannes Schwenk 2012-05-22, 16:37
+
Jonathan Coveney 2012-05-22, 19:26
+
Johannes Schwenk 2012-05-23, 09:42
+
Jonathan Coveney 2012-05-23, 16:20
Copy link to this message
-
Re: UDF FilterFunc and logical OR
Johannes Schwenk 2012-05-24, 12:54
Ok then. We are trying to use pig 0.10.0 now. We hit some errors in
running our tests - but see my new mail for that...

Should I file a bug for the found issue - just for completeness?

Thanks!

Am 23.05.2012 18:20, schrieb Jonathan Coveney:
> Thanks for being thorough! It's indeed a bug, but backporting a fix may be
> hard. The parser and logical plan changed a lot from .8-.9, so if at all
> possible, I would try to use 0.10 (the last release). We use it in
> production and it is stable, and has a lot of benefits over .8. I will wan
> that the parser changed so if you have many existing jobs, it may be worth
> running them on a test cluster with 0.10, but if you don't, defintely
> better to make the jump now.
>
> 2012/5/23 Johannes Schwenk <[EMAIL PROTECTED]>
>
>> Hi Jonathan,
>>
>> thanks again for your help!
>>
>> I have cloned the current git head and created this pig script
>> http://pastebin.com/Gc9C9ZPS
>>
>> TestCONTAINS-testFilteringCluster-input.txt contains
>> http://pastebin.com/h5MC695F
>>
>> The adition.jar has been built against the cloudera cdh3u3 distribution
>> and contains the filter function CONTAINS
>> http://pastebin.com/Uwje7v1V
>>
>>
>> Output from running my script with both versions of pig:
>>
>> pig 0.11.0-SNAPSHOT
>> http://pastebin.com/Cr5CkHui
>>
>> => Correct results!!
>>
>>
>> pig 0.8.1-cdh3u3
>> http://pastebin.com/yXY17mXx
>>
>> => Incorrect results!!
>>
>>
>> It seems like the new logical plan in pig 0.8.1 optimizes the OR
>> operator away. So its a bug, right?
>>
>>
>>
>> Am 22.05.2012 21:26, schrieb Jonathan Coveney:
>>> If this is a bug, it's an annoying one, so I definitely appreciate your
>>> help in getting to the bottom of it. So let's get to the bottom of it :)
>>>
>>> First, I would clone the trunk version of pig and run the same tests
>>> against it and compare. Always good to test any bugs against trunk to see
>>> if it is version specific.
>>>
>>> Right off the bat, I would say that you should dump the files in your
>> test
>>> to a file, make a short script that does exactly what your test does, and
>>> paste the EXPLAIN plan generated for your script (ideally in both your
>>> version of pig and trunk). We should be able to see if there is something
>>> weird going on.
>>>
>>> Let me know if you need any help with any of that. If it persists I'll
>> try
>>> and recreate on my end.
>>>
>>> 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]>
>>>
>>>> Thank you for your quick suggestions!
>>>>
>>>> - I am now using local mode - good point!
>>>> - I know of builtin matches, the CONTAINS filter was just to get into
>>>> programming UDFS...
>>>> - Whatever I do the problem persists. I tried:
>>>>  * turning off all optimizations (-t All) : no effect
>>>>  * reordering the statements : the outcome contains still only the
>>>> matching tuples to the lhs of the OR
>>>>  * using different data (just in case...) : no effect
>>>>  * finally counted how many times the exec() function gets called
>>>> processing the script... : exactly *six times* - each for every record!
>>>>
>>>> That last observation leads me to believe that this is a bug!? The exec
>>>> function should be called at least *ten times* I think.
>>>>
>>>> Du you have any suggestions on how to verify this?
>>>>
>>>> Greetings
>>>>
>>>> Am 21.05.2012 19:11, schrieb Jonathan Coveney:
>>>>> Not sure why it is failing... though I will mention two things. 1) you
>>>>> should use local mode if possible, especially just to test UDFs :) 2)
>> you
>>>>> could use the builtin matches function to achieve this (ie matches
>>>>> '.*keyword.*')
>>>>>
>>>>> Besides that it is odd indeed, and I'd have to dig in more.
>>>>>
>>>>> 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]>
>>>>>
>>>>>> Hello List,
>>>>>>
>>>>>> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1.
>>>>>>
>>>>>> I have written a UDF extending FilterFunc that checks if the provided
>>>>>> string is contained within the specified column of the current tuple:

Johannes Schwenk

Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434
+
Jonathan Coveney 2012-05-24, 16:55
+
Alan Gates 2012-05-24, 17:15