Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> UDF FilterFunc and logical OR


+
Johannes Schwenk 2012-05-21, 16:36
+
Jonathan Coveney 2012-05-21, 17:11
+
Johannes Schwenk 2012-05-22, 16:37
+
Jonathan Coveney 2012-05-22, 19:26
Copy link to this message
-
Re: UDF FilterFunc and logical OR
Hi Jonathan,

thanks again for your help!

I have cloned the current git head and created this pig script
http://pastebin.com/Gc9C9ZPS

TestCONTAINS-testFilteringCluster-input.txt contains
http://pastebin.com/h5MC695F

The adition.jar has been built against the cloudera cdh3u3 distribution
and contains the filter function CONTAINS
http://pastebin.com/Uwje7v1V
Output from running my script with both versions of pig:

pig 0.11.0-SNAPSHOT
http://pastebin.com/Cr5CkHui

=> Correct results!!
pig 0.8.1-cdh3u3
http://pastebin.com/yXY17mXx

=> Incorrect results!!
It seems like the new logical plan in pig 0.8.1 optimizes the OR
operator away. So its a bug, right?

Am 22.05.2012 21:26, schrieb Jonathan Coveney:
> If this is a bug, it's an annoying one, so I definitely appreciate your
> help in getting to the bottom of it. So let's get to the bottom of it :)
>
> First, I would clone the trunk version of pig and run the same tests
> against it and compare. Always good to test any bugs against trunk to see
> if it is version specific.
>
> Right off the bat, I would say that you should dump the files in your test
> to a file, make a short script that does exactly what your test does, and
> paste the EXPLAIN plan generated for your script (ideally in both your
> version of pig and trunk). We should be able to see if there is something
> weird going on.
>
> Let me know if you need any help with any of that. If it persists I'll try
> and recreate on my end.
>
> 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]>
>
>> Thank you for your quick suggestions!
>>
>> - I am now using local mode - good point!
>> - I know of builtin matches, the CONTAINS filter was just to get into
>> programming UDFS...
>> - Whatever I do the problem persists. I tried:
>>  * turning off all optimizations (-t All) : no effect
>>  * reordering the statements : the outcome contains still only the
>> matching tuples to the lhs of the OR
>>  * using different data (just in case...) : no effect
>>  * finally counted how many times the exec() function gets called
>> processing the script... : exactly *six times* - each for every record!
>>
>> That last observation leads me to believe that this is a bug!? The exec
>> function should be called at least *ten times* I think.
>>
>> Du you have any suggestions on how to verify this?
>>
>> Greetings
>>
>> Am 21.05.2012 19:11, schrieb Jonathan Coveney:
>>> Not sure why it is failing... though I will mention two things. 1) you
>>> should use local mode if possible, especially just to test UDFs :) 2) you
>>> could use the builtin matches function to achieve this (ie matches
>>> '.*keyword.*')
>>>
>>> Besides that it is odd indeed, and I'd have to dig in more.
>>>
>>> 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]>
>>>
>>>> Hello List,
>>>>
>>>> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1.
>>>>
>>>> I have written a UDF extending FilterFunc that checks if the provided
>>>> string is contained within the specified column of the current tuple:
>>>> http://pastebin.com/Uwje7v1V
>>>>
>>>> I have also written some TestCases:
>>>> http://pastebin.com/uA4LHB4Q
>>>>
>>>> The odd thing is, that only TestCase testFilteringClusterWithOR1 fails
>>>> because the result has not the expected length of 3 but is of length 2
>>>> instead (line 177 in http://pastebin.com/Uwje7v1V). After a lot of
>>>> investigating I still can not find out why testFilteringCluster and
>>>> testFilteringClusterWithOR2 succeed but not testFilteringClusterWithOR1.
>>>> Is there a special prerequisite for making my FilterFunc usabel within
>>>> OR ? Maybe I have missed something very obvious... Please help me figure
>>>> this out!
>>>>
>>>> Greetings,
>>>> Johannes Schwenk
>>>>
>>>> --
>>>> Softwareentwickler (Reporting)
>>>> ________________________________________________________
>>>>
>>>> ADITION technologies AG
>>>> Schwarzwaldstraße 78b
>>>> 79117 Freiburg
>>>>
>>>> http://www.adition.com
>>>>
>>>> T +49 / (0)761 / 88147 - 30

Johannes Schwenk

Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434
+
Jonathan Coveney 2012-05-23, 16:20
+
Johannes Schwenk 2012-05-24, 12:54
+
Jonathan Coveney 2012-05-24, 16:55
+
Alan Gates 2012-05-24, 17:15
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB