Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Re: Custom UserDefinedFunction in Hive


Copy link to this message
-
Re: Custom UserDefinedFunction in Hive
Jan Dolinár 2012-08-07, 18:18
By the way, even without hiveconf, you can run hive from shell like this to
achieve what you want using shell capabilities:

hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
AS 'com.example.hive.udf.YesterdayDate';
SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;"

At least if hive 6.0 supports -e option, I don't have where to check that.

Jan
On Tue, Aug 7, 2012 at 8:11 PM, Vijay <[EMAIL PROTECTED]> wrote:

> Given the implementation of the UDF, I don't think hive would be able
> to use partition pruning. Especially the version you're using. I'd
> really recommend upgrading to a later version that has the hiveconf
> support. That can save a lot of trouble rather than trying to get
> things working on 0.6
>
> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]>
> wrote:
> > Hi Jan,
> >
> > I have date in different format also, so that is the reason I was
> thinking
> > to do by this approach. How can I make sure this will work on the
> selected
> > partition only and it will not scan the entire table. I will add your
> > suggestion in my UDF as deterministic thing.
> >
> > My simple question here is- How to get the Yesterdays date which I can
> use
> > on the Date Partition I cannot use hiveconf here as I am working with
> Hive
> > 0.6
> >
> >
> >
> >
> > Raihan Jamal
> >
> >
> >
> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]>
> wrote:
> >>
> >> I'm afraid that  he query
> >>
> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
> >>
> >> will scan entire table, because the functions is evaluated at runtime,
> so
> >> Hive doesn't know what the value is when it decides which files to
> scan. I
> >> am not 100% sure though, you should try it.
> >>
> >> Also, you might want to try to add annotation to your UDF saying that
> the
> >> function is deterministic:
> >> @UDFType(deterministic=false)
> >>
> >> I think Hive might be clever enough to evaluate it early enough to use
> the
> >> partition pruning correctly, since it operates on constant expression.
> But
> >> again, I'm not really sure, maybe someone with deeper knowledge of Hive
> >> optimizations will tell us more. It is actually quite interesting
> question.
> >>
> >> Another way to help Hive with the optimizations might be to skip passing
> >> the format string argument, if you have all dates in same format, you
> can
> >> call the function just like 'yesterdaydate()' and hardcode the format
> in the
> >> function.
> >>
> >> Jan
> >>
> >>
> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]>
> >> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>>
> >>>
> >>> I figured that out, it is working fine for me now. The only question I
> >>> have is, if I am doing like this-
> >>>
> >>>
> >>>
> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
> >>>
> >>>
> >>>
> >>> Then the above query will be evaluated as below right?
> >>>
> >>>
> >>>
> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
> >>>
> >>>
> >>>
> >>> So that means it will look for data in the corresponding dt partition
> >>> (20120806) only right as above table is partitioned on dt column ? And
> it
> >>> will not scan the whole table right?
> >>>
> >>>
> >>>
> >>>
> >>> Raihan Jamal
> >>>
> >>>
> >>>
> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]>
> wrote:
> >>>>
> >>>> Hi Jamal,
> >>>>
> >>>> Check if the function really returns what it should and that your data
> >>>> are really in yyyyMMdd format. You can do this by simple query like
> this:
> >>>>
> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
> >>>>
> >>>> I don't see anything wrong with the function itself, it works well for
> >>>> me (although I tested it in hive 0.7.1). The only thing I would
> change about
> >>>> it would be to optimize it by calling 'new' only at the time of
> construction
> >>>> and reusing the object when the function is called, but that should
> not
> >>>> affect the functionality at all.