Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> nested UDFs on Partition column


Copy link to this message
-
Re: nested UDFs on Partition column
In fact, it's just not a reasonable thing to do partition pruning on.
Imagine a situation where you had:

WHERE partition_column = f(unix_timestamp()) AND ordinary_column f(unix_timestamp).

The right hand side of the predicate has to be evaluated at map-time,
whereas you're assuming that left hand side should be evaluated at
compile time, which means you have two different values of
unix_timestamp() floating around, which can only end badly.

Cheers,

Phil.

On 19 April 2012 16:35, Philip Tromans <[EMAIL PROTECTED]> wrote:
> I don't know what the state of Hive's partition pruning is, but I
> would imagine that the problem is that the two example you're giving
> are fundamentally different.
>
> 1) WHERE local_date = =date_add('2011-12-07',3) ,
>
> the udf is a function of some constants, so the constant gets
> evaluated at compile time.
>
> 2) WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>
> unix_timestamp() is not explicitly a constant, and so the expression
> won't be simplified.
>
> I would imagine that the constant simplification code probably doesn't
> know the difference between a partition column and a real column, and
> so treats everything as a real column. If local_date wasn't a
> partition column, then there's no reasonable way of simplifying that
> predicate at compile time.
>
> Cheers,
>
> Phil.
>
> On 19 April 2012 11:50, Nitin Pawar <[EMAIL PROTECTED]> wrote:
>> as per my understanding,
>>
>> In this case hive needs to look for all the partitions because it does not
>> have the value before hand on the partition check and note the udfs are
>> executed on the mapred and not on hive client side.
>>
>> I would suggest you write a hive query in a file and replace the partition
>> value with a variable
>> something like
>>
>> for partitionValue in values
>>
>>           hive $HIVEPARAMS -hiveconf  partition=$partition -e hivequery.hql
>>
>> and then in hivequery.sql you can refer the variable with
>>
>> where column_name = '${hiveconf:partition}'
>>
>> I may be wrong in interpreting the execution pattern of hivequery but this
>> approach solved my problem
>> Thanks,
>> nitin
>>
>>
>> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have a table partitioned by local_date.  When I write a query with
>>>
>>> WHERE local_date = =date_add('2011-12-07',3) ,
>>>
>>> hive executes the UDF ahead and looks only into the specific partitions.
>>> But when the udf becomes more complex like
>>>
>>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>>
>>> hive looks through all the partitions even though the above function  can
>>> very well be computed ahead of time and optimize the query.  Is this
>>> behaviour intentional ? And is there a workaround other than hardcoding the
>>> date or using a param?
>>>
>>> Thanks,
>>> Ramkumar
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>