@Deterministic was added after many of the UDFs some that are
deterministic may not be marked as such.
On Tue, May 1, 2012 at 9:38 AM, Ramkumar <[EMAIL PROTECTED]> wrote:
> Thanks. Yes, thats what the UDF writers seemed to have intended. The way
> this has been done is by using the 'deterministic' attribute in UDFType
> annotation.
> See
>
http://javasourcecode.org/html/open-source/hive/hive-0.7.1/org/apache/hadoop/hive/ql/udf/UDFUnixTimeStamp.java.html>
> ________________________________
> From: Philip Tromans <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Thursday, April 19, 2012 4:38 PM
> Subject: Re: nested UDFs on Partition column
>
> In fact, it's just not a reasonable thing to do partition pruning on.
> Imagine a situation where you had:
>
> WHERE partition_column = f(unix_timestamp()) AND ordinary_column > f(unix_timestamp).
>
> The right hand side of the predicate has to be evaluated at map-time,
> whereas you're assuming that left hand side should be evaluated at
> compile time, which means you have two different values of
> unix_timestamp() floating around, which can only end badly.
>
> Cheers,
>
> Phil.
>
> On 19 April 2012 16:35, Philip Tromans <[EMAIL PROTECTED]> wrote:
>> I don't know what the state of Hive's partition pruning is, but I
>> would imagine that the problem is that the two example you're giving
>> are fundamentally different.
>>
>> 1) WHERE local_date = =date_add('2011-12-07',3) ,
>>
>> the udf is a function of some constants, so the constant gets
>> evaluated at compile time.
>>
>> 2) WHERE local_date >> date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>
>> unix_timestamp() is not explicitly a constant, and so the expression
>> won't be simplified.
>>
>> I would imagine that the constant simplification code probably doesn't
>> know the difference between a partition column and a real column, and
>> so treats everything as a real column. If local_date wasn't a
>> partition column, then there's no reasonable way of simplifying that
>> predicate at compile time.
>>
>> Cheers,
>>
>> Phil.
>>
>> On 19 April 2012 11:50, Nitin Pawar <[EMAIL PROTECTED]> wrote:
>>> as per my understanding,
>>>
>>> In this case hive needs to look for all the partitions because it does
>>> not
>>> have the value before hand on the partition check and note the udfs are
>>> executed on the mapred and not on hive client side.
>>>
>>> I would suggest you write a hive query in a file and replace the
>>> partition
>>> value with a variable
>>> something like
>>>
>>> for partitionValue in values
>>>
>>> hive $HIVEPARAMS -hiveconf partition=$partition -e
>>> hivequery.hql
>>>
>>> and then in hivequery.sql you can refer the variable with
>>>
>>> where column_name = '${hiveconf:partition}'
>>>
>>> I may be wrong in interpreting the execution pattern of hivequery but
>>> this
>>> approach solved my problem
>>> Thanks,
>>> nitin
>>>
>>>
>>> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <[EMAIL PROTECTED]>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have a table partitioned by local_date. When I write a query with
>>>>
>>>> WHERE local_date = =date_add('2011-12-07',3) ,
>>>>
>>>> hive executes the UDF ahead and looks only into the specific partitions.
>>>> But when the udf becomes more complex like
>>>>
>>>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>>>
>>>> hive looks through all the partitions even though the above function
>>>> can
>>>> very well be computed ahead of time and optimize the query. Is this
>>>> behaviour intentional ? And is there a workaround other than hardcoding
>>>> the
>>>> date or using a param?
>>>>
>>>> Thanks,
>>>> Ramkumar
>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>
>