Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> UDF FilterFunc and logical OR


Copy link to this message
-
Re: UDF FilterFunc and logical OR
Ok then. We are trying to use pig 0.10.0 now. We hit some errors in
running our tests - but see my new mail for that...

Should I file a bug for the found issue - just for completeness?

Thanks!

Am 23.05.2012 18:20, schrieb Jonathan Coveney:
> Thanks for being thorough! It's indeed a bug, but backporting a fix may be
> hard. The parser and logical plan changed a lot from .8-.9, so if at all
> possible, I would try to use 0.10 (the last release). We use it in
> production and it is stable, and has a lot of benefits over .8. I will wan
> that the parser changed so if you have many existing jobs, it may be worth
> running them on a test cluster with 0.10, but if you don't, defintely
> better to make the jump now.
>
> 2012/5/23 Johannes Schwenk <[EMAIL PROTECTED]>
>
>> Hi Jonathan,
>>
>> thanks again for your help!
>>
>> I have cloned the current git head and created this pig script
>> http://pastebin.com/Gc9C9ZPS
>>
>> TestCONTAINS-testFilteringCluster-input.txt contains
>> http://pastebin.com/h5MC695F
>>
>> The adition.jar has been built against the cloudera cdh3u3 distribution
>> and contains the filter function CONTAINS
>> http://pastebin.com/Uwje7v1V
>>
>>
>> Output from running my script with both versions of pig:
>>
>> pig 0.11.0-SNAPSHOT
>> http://pastebin.com/Cr5CkHui
>>
>> => Correct results!!
>>
>>
>> pig 0.8.1-cdh3u3
>> http://pastebin.com/yXY17mXx
>>
>> => Incorrect results!!
>>
>>
>> It seems like the new logical plan in pig 0.8.1 optimizes the OR
>> operator away. So its a bug, right?
>>
>>
>>
>> Am 22.05.2012 21:26, schrieb Jonathan Coveney:
>>> If this is a bug, it's an annoying one, so I definitely appreciate your
>>> help in getting to the bottom of it. So let's get to the bottom of it :)
>>>
>>> First, I would clone the trunk version of pig and run the same tests
>>> against it and compare. Always good to test any bugs against trunk to see
>>> if it is version specific.
>>>
>>> Right off the bat, I would say that you should dump the files in your
>> test
>>> to a file, make a short script that does exactly what your test does, and
>>> paste the EXPLAIN plan generated for your script (ideally in both your
>>> version of pig and trunk). We should be able to see if there is something
>>> weird going on.
>>>
>>> Let me know if you need any help with any of that. If it persists I'll
>> try
>>> and recreate on my end.
>>>
>>> 2012/5/22 Johannes Schwenk <[EMAIL PROTECTED]>
>>>
>>>> Thank you for your quick suggestions!
>>>>
>>>> - I am now using local mode - good point!
>>>> - I know of builtin matches, the CONTAINS filter was just to get into
>>>> programming UDFS...
>>>> - Whatever I do the problem persists. I tried:
>>>>  * turning off all optimizations (-t All) : no effect
>>>>  * reordering the statements : the outcome contains still only the
>>>> matching tuples to the lhs of the OR
>>>>  * using different data (just in case...) : no effect
>>>>  * finally counted how many times the exec() function gets called
>>>> processing the script... : exactly *six times* - each for every record!
>>>>
>>>> That last observation leads me to believe that this is a bug!? The exec
>>>> function should be called at least *ten times* I think.
>>>>
>>>> Du you have any suggestions on how to verify this?
>>>>
>>>> Greetings
>>>>
>>>> Am 21.05.2012 19:11, schrieb Jonathan Coveney:
>>>>> Not sure why it is failing... though I will mention two things. 1) you
>>>>> should use local mode if possible, especially just to test UDFs :) 2)
>> you
>>>>> could use the builtin matches function to achieve this (ie matches
>>>>> '.*keyword.*')
>>>>>
>>>>> Besides that it is odd indeed, and I'd have to dig in more.
>>>>>
>>>>> 2012/5/21 Johannes Schwenk <[EMAIL PROTECTED]>
>>>>>
>>>>>> Hello List,
>>>>>>
>>>>>> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1.
>>>>>>
>>>>>> I have written a UDF extending FilterFunc that checks if the provided
>>>>>> string is contained within the specified column of the current tuple:

Johannes Schwenk

Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB