Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Re: Custom UserDefinedFunction in Hive


Copy link to this message
-
Re: Custom UserDefinedFunction in Hive
Given the implementation of the UDF, I don't think hive would be able
to use partition pruning. Especially the version you're using. I'd
really recommend upgrading to a later version that has the hiveconf
support. That can save a lot of trouble rather than trying to get
things working on 0.6

On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]> wrote:
> Hi Jan,
>
> I have date in different format also, so that is the reason I was thinking
> to do by this approach. How can I make sure this will work on the selected
> partition only and it will not scan the entire table. I will add your
> suggestion in my UDF as deterministic thing.
>
> My simple question here is- How to get the Yesterdays date which I can use
> on the Date Partition I cannot use hiveconf here as I am working with Hive
> 0.6
>
>
>
>
> Raihan Jamal
>
>
>
> On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote:
>>
>> I'm afraid that  he query
>>
>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>
>> will scan entire table, because the functions is evaluated at runtime, so
>> Hive doesn't know what the value is when it decides which files to scan. I
>> am not 100% sure though, you should try it.
>>
>> Also, you might want to try to add annotation to your UDF saying that the
>> function is deterministic:
>> @UDFType(deterministic=false)
>>
>> I think Hive might be clever enough to evaluate it early enough to use the
>> partition pruning correctly, since it operates on constant expression. But
>> again, I'm not really sure, maybe someone with deeper knowledge of Hive
>> optimizations will tell us more. It is actually quite interesting question.
>>
>> Another way to help Hive with the optimizations might be to skip passing
>> the format string argument, if you have all dates in same format, you can
>> call the function just like 'yesterdaydate()' and hardcode the format in the
>> function.
>>
>> Jan
>>
>>
>> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> Hi Jan,
>>>
>>>
>>>
>>> I figured that out, it is working fine for me now. The only question I
>>> have is, if I am doing like this-
>>>
>>>
>>>
>>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>>
>>>
>>>
>>> Then the above query will be evaluated as below right?
>>>
>>>
>>>
>>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>>
>>>
>>>
>>> So that means it will look for data in the corresponding dt partition
>>> (20120806) only right as above table is partitioned on dt column ? And it
>>> will not scan the whole table right?
>>>
>>>
>>>
>>>
>>> Raihan Jamal
>>>
>>>
>>>
>>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Hi Jamal,
>>>>
>>>> Check if the function really returns what it should and that your data
>>>> are really in yyyyMMdd format. You can do this by simple query like this:
>>>>
>>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>>
>>>> I don't see anything wrong with the function itself, it works well for
>>>> me (although I tested it in hive 0.7.1). The only thing I would change about
>>>> it would be to optimize it by calling 'new' only at the time of construction
>>>> and reusing the object when the function is called, but that should not
>>>> affect the functionality at all.
>>>>
>>>> Best regards,
>>>> Jan
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <[EMAIL PROTECTED]>
>>>> wrote:
>>>>>
>>>>> Problem
>>>>>
>>>>> I created the below UserDefinedFunction to get the yesterday's day in
>>>>> the format I wanted as I will be passing the format into this below method
>>>>> from the query.
>>>>>
>>>>>
>>>>>
>>>>> public final class YesterdayDate extends UDF {
>>>>>
>>>>>
>>>>>
>>>>>                 public String evaluate(final String format) {
>>>>>
>>>>>                                 DateFormat dateFormat = new
>>>>> SimpleDateFormat(format);
>>>>>
>>>>>                                 Calendar cal = Calendar.getInstance();
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB