Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Re: Custom UserDefinedFunction in Hive


Copy link to this message
-
Re: Custom UserDefinedFunction in Hive
Hi Jamal,

date is standard linux/unix tool, see the manual page:
http://linux.die.net/man/1/date.

The $(...) tells the shell to execute the command and insert it's output
into the string. So in this case it will execute command
date -d -1day +%Y%m%d
which returns yesterday date in the format you need.

Jan
On Tue, Aug 7, 2012 at 8:24 PM, Raihan Jamal <[EMAIL PROTECTED]> wrote:

> Yes it supports -e option, but in your query what is date?
>
> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
> AS 'com.example.hive.udf.YesterdayDate';
> SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;"
>
>
>
> *Raihan Jamal*
>
>
>
> On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote:
>
>> By the way, even without hiveconf, you can run hive from shell like this
>> to achieve what you want using shell capabilities:
>>
>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
>> AS 'com.example.hive.udf.YesterdayDate';
>> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;"
>>
>> At least if hive 6.0 supports -e option, I don't have where to check that.
>>
>> Jan
>>
>>
>> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <[EMAIL PROTECTED]> wrote:
>>
>>> Given the implementation of the UDF, I don't think hive would be able
>>> to use partition pruning. Especially the version you're using. I'd
>>> really recommend upgrading to a later version that has the hiveconf
>>> support. That can save a lot of trouble rather than trying to get
>>> things working on 0.6
>>>
>>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]>
>>> wrote:
>>> > Hi Jan,
>>> >
>>> > I have date in different format also, so that is the reason I was
>>> thinking
>>> > to do by this approach. How can I make sure this will work on the
>>> selected
>>> > partition only and it will not scan the entire table. I will add your
>>> > suggestion in my UDF as deterministic thing.
>>> >
>>> > My simple question here is- How to get the Yesterdays date which I can
>>> use
>>> > on the Date Partition I cannot use hiveconf here as I am working with
>>> Hive
>>> > 0.6
>>> >
>>> >
>>> >
>>> >
>>> > Raihan Jamal
>>> >
>>> >
>>> >
>>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]>
>>> wrote:
>>> >>
>>> >> I'm afraid that  he query
>>> >>
>>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>> >>
>>> >> will scan entire table, because the functions is evaluated at
>>> runtime, so
>>> >> Hive doesn't know what the value is when it decides which files to
>>> scan. I
>>> >> am not 100% sure though, you should try it.
>>> >>
>>> >> Also, you might want to try to add annotation to your UDF saying that
>>> the
>>> >> function is deterministic:
>>> >> @UDFType(deterministic=false)
>>> >>
>>> >> I think Hive might be clever enough to evaluate it early enough to
>>> use the
>>> >> partition pruning correctly, since it operates on constant
>>> expression. But
>>> >> again, I'm not really sure, maybe someone with deeper knowledge of
>>> Hive
>>> >> optimizations will tell us more. It is actually quite interesting
>>> question.
>>> >>
>>> >> Another way to help Hive with the optimizations might be to skip
>>> passing
>>> >> the format string argument, if you have all dates in same format, you
>>> can
>>> >> call the function just like 'yesterdaydate()' and hardcode the format
>>> in the
>>> >> function.
>>> >>
>>> >> Jan
>>> >>
>>> >>
>>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]>
>>> >> wrote:
>>> >>>
>>> >>> Hi Jan,
>>> >>>
>>> >>>
>>> >>>
>>> >>> I figured that out, it is working fine for me now. The only question
>>> I
>>> >>> have is, if I am doing like this-
>>> >>>
>>> >>>
>>> >>>
>>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>> >>>
>>> >>>
>>> >>>
>>> >>> Then the above query will be evaluated as below right?
>>> >>>
>>> >>>
>>> >>>
>>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>> >>>
>>> >>>
>>> >>>
>>> >>> So that means it will look for data in the corresponding dt partition