|
Jan Dolinár
2012-08-07, 17:37
Raihan Jamal
2012-08-07, 18:07
Vijay
2012-08-07, 18:11
Raihan Jamal
2012-08-07, 18:26
Vijay
2012-08-07, 18:31
Jan Dolinár
2012-08-07, 18:18
Raihan Jamal
2012-08-07, 18:24
Jan Dolinár
2012-08-07, 18:32
Techy Teck
2012-08-07, 18:39
Jan Dolinár
2012-08-07, 18:42
Raihan Jamal
2012-08-07, 18:41
Jan Dolinár
2012-08-07, 17:56
Raihan Jamal
2012-08-07, 18:03
kulkarni.swarnim@...)
2012-08-07, 17:35
Bejoy Ks
2012-08-08, 10:49
Venkatesh Kavuluri
2012-08-08, 23:41
|
-
Re: Custom UserDefinedFunction in HiveJan Dolinár 2012-08-07, 17:37
I'm afraid that he query
SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; will scan entire table, because the functions is evaluated at runtime, so Hive doesn't know what the value is when it decides which files to scan. I am not 100% sure though, you should try it. Also, you might want to try to add annotation to your UDF saying that the function is deterministic: @*UDFType(deterministic=false*) I think Hive might be clever enough to evaluate it early enough to use the partition pruning correctly, since it operates on constant expression. But again, I'm not really sure, maybe someone with deeper knowledge of Hive optimizations will tell us more. It is actually quite interesting question. Another way to help Hive with the optimizations might be to skip passing the format string argument, if you have all dates in same format, you can call the function just like 'yesterdaydate()' and hardcode the format in the function. Jan On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]> wrote: > Hi Jan, > > > I figured that out, it is working fine for me now. The only question I > have is, if I am doing like this- > > > > SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > > > > Then the above query will be evaluated as below right? > > > > SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; > > > > So that means it will look for data in the corresponding dt partition *(20120806) > *only right as above table is partitioned on dt column ? And it will not > scan the whole table right?** > > > > *Raihan Jamal* > > > > On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > >> Hi Jamal, >> >> Check if the function really returns what it should and that your data >> are really in yyyyMMdd format. You can do this by simple query like this: >> >> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; >> >> I don't see anything wrong with the function itself, it works well for me >> (although I tested it in hive 0.7.1). The only thing I would change about >> it would be to optimize it by calling 'new' only at the time of >> construction and reusing the object when the function is called, but that >> should not affect the functionality at all. >> >> Best regards, >> Jan >> >> >> >> >> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <[EMAIL PROTECTED]>wrote: >> >>> *Problem* >>> >>> I created the below UserDefinedFunction to get the yesterday's day in >>> the format I wanted as I will be passing the format into this below method >>> from the query. >>> >>> >>> >>> *public final class YesterdayDate extends UDF {* >>> >>> * * >>> >>> * public String evaluate(final String format) { * >>> >>> * DateFormat dateFormat = new >>> SimpleDateFormat(format); * >>> >>> * Calendar cal = Calendar.getInstance();* >>> >>> * cal.add(Calendar.DATE, -1); * >>> >>> * return >>> dateFormat.format(cal.getTime()).toString(); * >>> >>> * } * >>> >>> *}* >>> >>> >>> >>> >>> >>> So whenever I try to run the query like below by adding the jar to >>> classpath and creating the temporary function yesterdaydate, I always get >>> zero result back- >>> >>> >>> >>> hive> create temporary function *yesterdaydate* as >>> 'com.example.hive.udf.YesterdayDate'; >>> >>> OK >>> >>> Time taken: 0.512 seconds >>> >>> >>> >>> Below is the query I am running- >>> >>> >>> >>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT >>> 10;* >>> >>> *OK* >>> >>> * * >>> >>> And I always get zero result back but the data is there in that table >>> for Aug 5th.** >>> >>> >>> >>> What wrong I am doing? Any suggestions will be appreciated. >>> >>> >>> >>> >>> >>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable >>> substitution thing, so I cannot use hiveconf here and the above table has >>> been partitioned on dt(date) column.** +
Jan Dolinár 2012-08-07, 17:37
-
Re: Custom UserDefinedFunction in HiveRaihan Jamal 2012-08-07, 18:07
Hi Jan,
I have date in different format also, so that is the reason I was thinking to do by this approach. How can I make sure this will work on the selected partition only and it will not scan the entire table. I will add your suggestion in my UDF as deterministic thing. My simple question here is- How to get the Yesterdays date which I can use on the Date Partition I cannot use hiveconf here as I am working with Hive 0.6 *Raihan Jamal* On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > I'm afraid that he query > > SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > > will scan entire table, because the functions is evaluated at runtime, so > Hive doesn't know what the value is when it decides which files to scan. I > am not 100% sure though, you should try it. > > Also, you might want to try to add annotation to your UDF saying that the > function is deterministic: > @*UDFType(deterministic=false*) > > I think Hive might be clever enough to evaluate it early enough to use the > partition pruning correctly, since it operates on constant expression. But > again, I'm not really sure, maybe someone with deeper knowledge of Hive > optimizations will tell us more. It is actually quite interesting question. > > Another way to help Hive with the optimizations might be to skip passing > the format string argument, if you have all dates in same format, you can > call the function just like 'yesterdaydate()' and hardcode the format in > the function. > > Jan > > > On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]>wrote: > >> Hi Jan, >> >> >> I figured that out, it is working fine for me now. The only question I >> have is, if I am doing like this- >> >> >> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >> >> >> >> Then the above query will be evaluated as below right? >> >> >> >> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; >> >> >> >> So that means it will look for data in the corresponding dt partition *(20120806) >> *only right as above table is partitioned on dt column ? And it will not >> scan the whole table right?** >> >> >> >> *Raihan Jamal* >> >> >> >> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote: >> >>> Hi Jamal, >>> >>> Check if the function really returns what it should and that your data >>> are really in yyyyMMdd format. You can do this by simple query like this: >>> >>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; >>> >>> I don't see anything wrong with the function itself, it works well for >>> me (although I tested it in hive 0.7.1). The only thing I would change >>> about it would be to optimize it by calling 'new' only at the time of >>> construction and reusing the object when the function is called, but that >>> should not affect the functionality at all. >>> >>> Best regards, >>> Jan >>> >>> >>> >>> >>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <[EMAIL PROTECTED]>wrote: >>> >>>> *Problem* >>>> >>>> I created the below UserDefinedFunction to get the yesterday's day in >>>> the format I wanted as I will be passing the format into this below method >>>> from the query. >>>> >>>> >>>> >>>> *public final class YesterdayDate extends UDF {* >>>> >>>> * * >>>> >>>> * public String evaluate(final String format) { * >>>> >>>> * DateFormat dateFormat = new >>>> SimpleDateFormat(format); * >>>> >>>> * Calendar cal = Calendar.getInstance(); >>>> * >>>> >>>> * cal.add(Calendar.DATE, -1); * >>>> >>>> * return >>>> dateFormat.format(cal.getTime()).toString(); * >>>> >>>> * } * >>>> >>>> *}* >>>> >>>> >>>> >>>> >>>> >>>> So whenever I try to run the query like below by adding the jar to >>>> classpath and creating the temporary function yesterdaydate, I always get >>>> zero result back- >>>> >>>> >>>> >>>> hive> create temporary function *yesterdaydate* as +
Raihan Jamal 2012-08-07, 18:07
-
Re: Custom UserDefinedFunction in HiveVijay 2012-08-07, 18:11
Given the implementation of the UDF, I don't think hive would be able
to use partition pruning. Especially the version you're using. I'd really recommend upgrading to a later version that has the hiveconf support. That can save a lot of trouble rather than trying to get things working on 0.6 On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]> wrote: > Hi Jan, > > I have date in different format also, so that is the reason I was thinking > to do by this approach. How can I make sure this will work on the selected > partition only and it will not scan the entire table. I will add your > suggestion in my UDF as deterministic thing. > > My simple question here is- How to get the Yesterdays date which I can use > on the Date Partition I cannot use hiveconf here as I am working with Hive > 0.6 > > > > > Raihan Jamal > > > > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: >> >> I'm afraid that he query >> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >> >> will scan entire table, because the functions is evaluated at runtime, so >> Hive doesn't know what the value is when it decides which files to scan. I >> am not 100% sure though, you should try it. >> >> Also, you might want to try to add annotation to your UDF saying that the >> function is deterministic: >> @UDFType(deterministic=false) >> >> I think Hive might be clever enough to evaluate it early enough to use the >> partition pruning correctly, since it operates on constant expression. But >> again, I'm not really sure, maybe someone with deeper knowledge of Hive >> optimizations will tell us more. It is actually quite interesting question. >> >> Another way to help Hive with the optimizations might be to skip passing >> the format string argument, if you have all dates in same format, you can >> call the function just like 'yesterdaydate()' and hardcode the format in the >> function. >> >> Jan >> >> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]> >> wrote: >>> >>> Hi Jan, >>> >>> >>> >>> I figured that out, it is working fine for me now. The only question I >>> have is, if I am doing like this- >>> >>> >>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >>> >>> >>> >>> Then the above query will be evaluated as below right? >>> >>> >>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; >>> >>> >>> >>> So that means it will look for data in the corresponding dt partition >>> (20120806) only right as above table is partitioned on dt column ? And it >>> will not scan the whole table right? >>> >>> >>> >>> >>> Raihan Jamal >>> >>> >>> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote: >>>> >>>> Hi Jamal, >>>> >>>> Check if the function really returns what it should and that your data >>>> are really in yyyyMMdd format. You can do this by simple query like this: >>>> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; >>>> >>>> I don't see anything wrong with the function itself, it works well for >>>> me (although I tested it in hive 0.7.1). The only thing I would change about >>>> it would be to optimize it by calling 'new' only at the time of construction >>>> and reusing the object when the function is called, but that should not >>>> affect the functionality at all. >>>> >>>> Best regards, >>>> Jan >>>> >>>> >>>> >>>> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> Problem >>>>> >>>>> I created the below UserDefinedFunction to get the yesterday's day in >>>>> the format I wanted as I will be passing the format into this below method >>>>> from the query. >>>>> >>>>> >>>>> >>>>> public final class YesterdayDate extends UDF { >>>>> >>>>> >>>>> >>>>> public String evaluate(final String format) { >>>>> >>>>> DateFormat dateFormat = new >>>>> SimpleDateFormat(format); >>>>> >>>>> Calendar cal = Calendar.getInstance(); +
Vijay 2012-08-07, 18:11
-
Re: Custom UserDefinedFunction in HiveRaihan Jamal 2012-08-07, 18:26
Hi Vijay,
Thanks for the suggestion, If upgrading to Hive was under my control then I would have done for sure, but I am working in a company and they are running Hive 0.6 on all the cluster, And I told them to upgrade the Hive version but they said it will take few months for them to do this. And I don't know why they are saying like this, so that is the reason I was doing like this. Any suggestions will be appreciated to make this thing work *Raihan Jamal* On Tue, Aug 7, 2012 at 11:11 AM, Vijay <[EMAIL PROTECTED]> wrote: > Given the implementation of the UDF, I don't think hive would be able > to use partition pruning. Especially the version you're using. I'd > really recommend upgrading to a later version that has the hiveconf > support. That can save a lot of trouble rather than trying to get > things working on 0.6 > > On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]> > wrote: > > Hi Jan, > > > > I have date in different format also, so that is the reason I was > thinking > > to do by this approach. How can I make sure this will work on the > selected > > partition only and it will not scan the entire table. I will add your > > suggestion in my UDF as deterministic thing. > > > > My simple question here is- How to get the Yesterdays date which I can > use > > on the Date Partition I cannot use hiveconf here as I am working with > Hive > > 0.6 > > > > > > > > > > Raihan Jamal > > > > > > > > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> > wrote: > >> > >> I'm afraid that he query > >> > >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > >> > >> will scan entire table, because the functions is evaluated at runtime, > so > >> Hive doesn't know what the value is when it decides which files to > scan. I > >> am not 100% sure though, you should try it. > >> > >> Also, you might want to try to add annotation to your UDF saying that > the > >> function is deterministic: > >> @UDFType(deterministic=false) > >> > >> I think Hive might be clever enough to evaluate it early enough to use > the > >> partition pruning correctly, since it operates on constant expression. > But > >> again, I'm not really sure, maybe someone with deeper knowledge of Hive > >> optimizations will tell us more. It is actually quite interesting > question. > >> > >> Another way to help Hive with the optimizations might be to skip passing > >> the format string argument, if you have all dates in same format, you > can > >> call the function just like 'yesterdaydate()' and hardcode the format > in the > >> function. > >> > >> Jan > >> > >> > >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]> > >> wrote: > >>> > >>> Hi Jan, > >>> > >>> > >>> > >>> I figured that out, it is working fine for me now. The only question I > >>> have is, if I am doing like this- > >>> > >>> > >>> > >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > >>> > >>> > >>> > >>> Then the above query will be evaluated as below right? > >>> > >>> > >>> > >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; > >>> > >>> > >>> > >>> So that means it will look for data in the corresponding dt partition > >>> (20120806) only right as above table is partitioned on dt column ? And > it > >>> will not scan the whole table right? > >>> > >>> > >>> > >>> > >>> Raihan Jamal > >>> > >>> > >>> > >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> > wrote: > >>>> > >>>> Hi Jamal, > >>>> > >>>> Check if the function really returns what it should and that your data > >>>> are really in yyyyMMdd format. You can do this by simple query like > this: > >>>> > >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; > >>>> > >>>> I don't see anything wrong with the function itself, it works well for > >>>> me (although I tested it in hive 0.7.1). The only thing I would > change about > >>>> it would be to optimize it by calling 'new' only at the time of > construction +
Raihan Jamal 2012-08-07, 18:26
-
Re: Custom UserDefinedFunction in HiveVijay 2012-08-07, 18:31
You actually don't need hive on the whole cluster. That's the beauty
of it. You only need it on the client machine where you're submitting hive jobs. Of course the metadata store does need to be upgraded for newer versions so that might still be a problem. On Tue, Aug 7, 2012 at 11:26 AM, Raihan Jamal <[EMAIL PROTECTED]> wrote: > Hi Vijay, > > Thanks for the suggestion, If upgrading to Hive was under my control then I > would have done for sure, but I am working in a company and they are running > Hive 0.6 on all the cluster, And I told them to upgrade the Hive version but > they said it will take few months for them to do this. And I don't know why > they are saying like this, so that is the reason I was doing like this. > > Any suggestions will be appreciated to make this thing work > > > Raihan Jamal > > > > On Tue, Aug 7, 2012 at 11:11 AM, Vijay <[EMAIL PROTECTED]> wrote: >> >> Given the implementation of the UDF, I don't think hive would be able >> to use partition pruning. Especially the version you're using. I'd >> really recommend upgrading to a later version that has the hiveconf >> support. That can save a lot of trouble rather than trying to get >> things working on 0.6 >> >> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]> >> wrote: >> > Hi Jan, >> > >> > I have date in different format also, so that is the reason I was >> > thinking >> > to do by this approach. How can I make sure this will work on the >> > selected >> > partition only and it will not scan the entire table. I will add your >> > suggestion in my UDF as deterministic thing. >> > >> > My simple question here is- How to get the Yesterdays date which I can >> > use >> > on the Date Partition I cannot use hiveconf here as I am working with >> > Hive >> > 0.6 >> > >> > >> > >> > >> > Raihan Jamal >> > >> > >> > >> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> >> > wrote: >> >> >> >> I'm afraid that he query >> >> >> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >> >> >> >> will scan entire table, because the functions is evaluated at runtime, >> >> so >> >> Hive doesn't know what the value is when it decides which files to >> >> scan. I >> >> am not 100% sure though, you should try it. >> >> >> >> Also, you might want to try to add annotation to your UDF saying that >> >> the >> >> function is deterministic: >> >> @UDFType(deterministic=false) >> >> >> >> I think Hive might be clever enough to evaluate it early enough to use >> >> the >> >> partition pruning correctly, since it operates on constant expression. >> >> But >> >> again, I'm not really sure, maybe someone with deeper knowledge of Hive >> >> optimizations will tell us more. It is actually quite interesting >> >> question. >> >> >> >> Another way to help Hive with the optimizations might be to skip >> >> passing >> >> the format string argument, if you have all dates in same format, you >> >> can >> >> call the function just like 'yesterdaydate()' and hardcode the format >> >> in the >> >> function. >> >> >> >> Jan >> >> >> >> >> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]> >> >> wrote: >> >>> >> >>> Hi Jan, >> >>> >> >>> >> >>> >> >>> I figured that out, it is working fine for me now. The only question I >> >>> have is, if I am doing like this- >> >>> >> >>> >> >>> >> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >> >>> >> >>> >> >>> >> >>> Then the above query will be evaluated as below right? >> >>> >> >>> >> >>> >> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; >> >>> >> >>> >> >>> >> >>> So that means it will look for data in the corresponding dt partition >> >>> (20120806) only right as above table is partitioned on dt column ? And >> >>> it >> >>> will not scan the whole table right? >> >>> >> >>> >> >>> >> >>> >> >>> Raihan Jamal >> >>> >> >>> >> >>> >> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> >> >>> wrote: >> >>>> >> +
Vijay 2012-08-07, 18:31
-
Re: Custom UserDefinedFunction in HiveJan Dolinár 2012-08-07, 18:18
By the way, even without hiveconf, you can run hive from shell like this to
achieve what you want using shell capabilities: hive -e "CREATE TEMPORARY FUNCTION yesterdaydate AS 'com.example.hive.udf.YesterdayDate'; SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;" At least if hive 6.0 supports -e option, I don't have where to check that. Jan On Tue, Aug 7, 2012 at 8:11 PM, Vijay <[EMAIL PROTECTED]> wrote: > Given the implementation of the UDF, I don't think hive would be able > to use partition pruning. Especially the version you're using. I'd > really recommend upgrading to a later version that has the hiveconf > support. That can save a lot of trouble rather than trying to get > things working on 0.6 > > On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]> > wrote: > > Hi Jan, > > > > I have date in different format also, so that is the reason I was > thinking > > to do by this approach. How can I make sure this will work on the > selected > > partition only and it will not scan the entire table. I will add your > > suggestion in my UDF as deterministic thing. > > > > My simple question here is- How to get the Yesterdays date which I can > use > > on the Date Partition I cannot use hiveconf here as I am working with > Hive > > 0.6 > > > > > > > > > > Raihan Jamal > > > > > > > > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> > wrote: > >> > >> I'm afraid that he query > >> > >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > >> > >> will scan entire table, because the functions is evaluated at runtime, > so > >> Hive doesn't know what the value is when it decides which files to > scan. I > >> am not 100% sure though, you should try it. > >> > >> Also, you might want to try to add annotation to your UDF saying that > the > >> function is deterministic: > >> @UDFType(deterministic=false) > >> > >> I think Hive might be clever enough to evaluate it early enough to use > the > >> partition pruning correctly, since it operates on constant expression. > But > >> again, I'm not really sure, maybe someone with deeper knowledge of Hive > >> optimizations will tell us more. It is actually quite interesting > question. > >> > >> Another way to help Hive with the optimizations might be to skip passing > >> the format string argument, if you have all dates in same format, you > can > >> call the function just like 'yesterdaydate()' and hardcode the format > in the > >> function. > >> > >> Jan > >> > >> > >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]> > >> wrote: > >>> > >>> Hi Jan, > >>> > >>> > >>> > >>> I figured that out, it is working fine for me now. The only question I > >>> have is, if I am doing like this- > >>> > >>> > >>> > >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > >>> > >>> > >>> > >>> Then the above query will be evaluated as below right? > >>> > >>> > >>> > >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; > >>> > >>> > >>> > >>> So that means it will look for data in the corresponding dt partition > >>> (20120806) only right as above table is partitioned on dt column ? And > it > >>> will not scan the whole table right? > >>> > >>> > >>> > >>> > >>> Raihan Jamal > >>> > >>> > >>> > >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> > wrote: > >>>> > >>>> Hi Jamal, > >>>> > >>>> Check if the function really returns what it should and that your data > >>>> are really in yyyyMMdd format. You can do this by simple query like > this: > >>>> > >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; > >>>> > >>>> I don't see anything wrong with the function itself, it works well for > >>>> me (although I tested it in hive 0.7.1). The only thing I would > change about > >>>> it would be to optimize it by calling 'new' only at the time of > construction > >>>> and reusing the object when the function is called, but that should > not > >>>> affect the functionality at all. +
Jan Dolinár 2012-08-07, 18:18
-
Re: Custom UserDefinedFunction in HiveRaihan Jamal 2012-08-07, 18:24
Yes it supports -e option, but in your query what is date?
hive -e "CREATE TEMPORARY FUNCTION yesterdaydate AS 'com.example.hive.udf.YesterdayDate'; SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;" *Raihan Jamal* On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > By the way, even without hiveconf, you can run hive from shell like this > to achieve what you want using shell capabilities: > > hive -e "CREATE TEMPORARY FUNCTION yesterdaydate > AS 'com.example.hive.udf.YesterdayDate'; > SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;" > > At least if hive 6.0 supports -e option, I don't have where to check that. > > Jan > > > On Tue, Aug 7, 2012 at 8:11 PM, Vijay <[EMAIL PROTECTED]> wrote: > >> Given the implementation of the UDF, I don't think hive would be able >> to use partition pruning. Especially the version you're using. I'd >> really recommend upgrading to a later version that has the hiveconf >> support. That can save a lot of trouble rather than trying to get >> things working on 0.6 >> >> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]> >> wrote: >> > Hi Jan, >> > >> > I have date in different format also, so that is the reason I was >> thinking >> > to do by this approach. How can I make sure this will work on the >> selected >> > partition only and it will not scan the entire table. I will add your >> > suggestion in my UDF as deterministic thing. >> > >> > My simple question here is- How to get the Yesterdays date which I can >> use >> > on the Date Partition I cannot use hiveconf here as I am working with >> Hive >> > 0.6 >> > >> > >> > >> > >> > Raihan Jamal >> > >> > >> > >> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> >> wrote: >> >> >> >> I'm afraid that he query >> >> >> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >> >> >> >> will scan entire table, because the functions is evaluated at runtime, >> so >> >> Hive doesn't know what the value is when it decides which files to >> scan. I >> >> am not 100% sure though, you should try it. >> >> >> >> Also, you might want to try to add annotation to your UDF saying that >> the >> >> function is deterministic: >> >> @UDFType(deterministic=false) >> >> >> >> I think Hive might be clever enough to evaluate it early enough to use >> the >> >> partition pruning correctly, since it operates on constant expression. >> But >> >> again, I'm not really sure, maybe someone with deeper knowledge of Hive >> >> optimizations will tell us more. It is actually quite interesting >> question. >> >> >> >> Another way to help Hive with the optimizations might be to skip >> passing >> >> the format string argument, if you have all dates in same format, you >> can >> >> call the function just like 'yesterdaydate()' and hardcode the format >> in the >> >> function. >> >> >> >> Jan >> >> >> >> >> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]> >> >> wrote: >> >>> >> >>> Hi Jan, >> >>> >> >>> >> >>> >> >>> I figured that out, it is working fine for me now. The only question I >> >>> have is, if I am doing like this- >> >>> >> >>> >> >>> >> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >> >>> >> >>> >> >>> >> >>> Then the above query will be evaluated as below right? >> >>> >> >>> >> >>> >> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; >> >>> >> >>> >> >>> >> >>> So that means it will look for data in the corresponding dt partition >> >>> (20120806) only right as above table is partitioned on dt column ? >> And it >> >>> will not scan the whole table right? >> >>> >> >>> >> >>> >> >>> >> >>> Raihan Jamal >> >>> >> >>> >> >>> >> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> >> wrote: >> >>>> >> >>>> Hi Jamal, >> >>>> >> >>>> Check if the function really returns what it should and that your >> data >> >>>> are really in yyyyMMdd format. You can do this by simple query like +
Raihan Jamal 2012-08-07, 18:24
-
Re: Custom UserDefinedFunction in HiveJan Dolinár 2012-08-07, 18:32
Hi Jamal,
date is standard linux/unix tool, see the manual page: http://linux.die.net/man/1/date. The $(...) tells the shell to execute the command and insert it's output into the string. So in this case it will execute command date -d -1day +%Y%m%d which returns yesterday date in the format you need. Jan On Tue, Aug 7, 2012 at 8:24 PM, Raihan Jamal <[EMAIL PROTECTED]> wrote: > Yes it supports -e option, but in your query what is date? > > hive -e "CREATE TEMPORARY FUNCTION yesterdaydate > AS 'com.example.hive.udf.YesterdayDate'; > SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;" > > > > *Raihan Jamal* > > > > On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > >> By the way, even without hiveconf, you can run hive from shell like this >> to achieve what you want using shell capabilities: >> >> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate >> AS 'com.example.hive.udf.YesterdayDate'; >> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;" >> >> At least if hive 6.0 supports -e option, I don't have where to check that. >> >> Jan >> >> >> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <[EMAIL PROTECTED]> wrote: >> >>> Given the implementation of the UDF, I don't think hive would be able >>> to use partition pruning. Especially the version you're using. I'd >>> really recommend upgrading to a later version that has the hiveconf >>> support. That can save a lot of trouble rather than trying to get >>> things working on 0.6 >>> >>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]> >>> wrote: >>> > Hi Jan, >>> > >>> > I have date in different format also, so that is the reason I was >>> thinking >>> > to do by this approach. How can I make sure this will work on the >>> selected >>> > partition only and it will not scan the entire table. I will add your >>> > suggestion in my UDF as deterministic thing. >>> > >>> > My simple question here is- How to get the Yesterdays date which I can >>> use >>> > on the Date Partition I cannot use hiveconf here as I am working with >>> Hive >>> > 0.6 >>> > >>> > >>> > >>> > >>> > Raihan Jamal >>> > >>> > >>> > >>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> >>> wrote: >>> >> >>> >> I'm afraid that he query >>> >> >>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >>> >> >>> >> will scan entire table, because the functions is evaluated at >>> runtime, so >>> >> Hive doesn't know what the value is when it decides which files to >>> scan. I >>> >> am not 100% sure though, you should try it. >>> >> >>> >> Also, you might want to try to add annotation to your UDF saying that >>> the >>> >> function is deterministic: >>> >> @UDFType(deterministic=false) >>> >> >>> >> I think Hive might be clever enough to evaluate it early enough to >>> use the >>> >> partition pruning correctly, since it operates on constant >>> expression. But >>> >> again, I'm not really sure, maybe someone with deeper knowledge of >>> Hive >>> >> optimizations will tell us more. It is actually quite interesting >>> question. >>> >> >>> >> Another way to help Hive with the optimizations might be to skip >>> passing >>> >> the format string argument, if you have all dates in same format, you >>> can >>> >> call the function just like 'yesterdaydate()' and hardcode the format >>> in the >>> >> function. >>> >> >>> >> Jan >>> >> >>> >> >>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]> >>> >> wrote: >>> >>> >>> >>> Hi Jan, >>> >>> >>> >>> >>> >>> >>> >>> I figured that out, it is working fine for me now. The only question >>> I >>> >>> have is, if I am doing like this- >>> >>> >>> >>> >>> >>> >>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >>> >>> >>> >>> >>> >>> >>> >>> Then the above query will be evaluated as below right? >>> >>> >>> >>> >>> >>> >>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; >>> >>> >>> >>> >>> >>> >>> >>> So that means it will look for data in the corresponding dt partition +
Jan Dolinár 2012-08-07, 18:32
-
Re: Custom UserDefinedFunction in HiveTechy Teck 2012-08-07, 18:39
Then that means I don't need to create that userdefinedfunction right?
On Tue, Aug 7, 2012 at 11:32 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > Hi Jamal, > > date is standard linux/unix tool, see the manual page: > http://linux.die.net/man/1/date. > > The $(...) tells the shell to execute the command and insert it's output > into the string. So in this case it will execute command > date -d -1day +%Y%m%d > which returns yesterday date in the format you need. > > Jan > > > On Tue, Aug 7, 2012 at 8:24 PM, Raihan Jamal <[EMAIL PROTECTED]>wrote: > >> Yes it supports -e option, but in your query what is date? >> >> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate >> AS 'com.example.hive.udf.YesterdayDate'; >> SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;" >> >> >> >> *Raihan Jamal* >> >> >> >> On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: >> >>> By the way, even without hiveconf, you can run hive from shell like this >>> to achieve what you want using shell capabilities: >>> >>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate >>> AS 'com.example.hive.udf.YesterdayDate'; >>> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;" >>> >>> At least if hive 6.0 supports -e option, I don't have where to check >>> that. >>> >>> Jan >>> >>> >>> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <[EMAIL PROTECTED]> wrote: >>> >>>> Given the implementation of the UDF, I don't think hive would be able >>>> to use partition pruning. Especially the version you're using. I'd >>>> really recommend upgrading to a later version that has the hiveconf >>>> support. That can save a lot of trouble rather than trying to get >>>> things working on 0.6 >>>> >>>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]> >>>> wrote: >>>> > Hi Jan, >>>> > >>>> > I have date in different format also, so that is the reason I was >>>> thinking >>>> > to do by this approach. How can I make sure this will work on the >>>> selected >>>> > partition only and it will not scan the entire table. I will add your >>>> > suggestion in my UDF as deterministic thing. >>>> > >>>> > My simple question here is- How to get the Yesterdays date which I >>>> can use >>>> > on the Date Partition I cannot use hiveconf here as I am working with >>>> Hive >>>> > 0.6 >>>> > >>>> > >>>> > >>>> > >>>> > Raihan Jamal >>>> > >>>> > >>>> > >>>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> >>>> wrote: >>>> >> >>>> >> I'm afraid that he query >>>> >> >>>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >>>> >> >>>> >> will scan entire table, because the functions is evaluated at >>>> runtime, so >>>> >> Hive doesn't know what the value is when it decides which files to >>>> scan. I >>>> >> am not 100% sure though, you should try it. >>>> >> >>>> >> Also, you might want to try to add annotation to your UDF saying >>>> that the >>>> >> function is deterministic: >>>> >> @UDFType(deterministic=false) >>>> >> >>>> >> I think Hive might be clever enough to evaluate it early enough to >>>> use the >>>> >> partition pruning correctly, since it operates on constant >>>> expression. But >>>> >> again, I'm not really sure, maybe someone with deeper knowledge of >>>> Hive >>>> >> optimizations will tell us more. It is actually quite interesting >>>> question. >>>> >> >>>> >> Another way to help Hive with the optimizations might be to skip >>>> passing >>>> >> the format string argument, if you have all dates in same format, >>>> you can >>>> >> call the function just like 'yesterdaydate()' and hardcode the >>>> format in the >>>> >> function. >>>> >> >>>> >> Jan >>>> >> >>>> >> >>>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]> >>>> >> wrote: >>>> >>> >>>> >>> Hi Jan, >>>> >>> >>>> >>> >>>> >>> >>>> >>> I figured that out, it is working fine for me now. The only >>>> question I >>>> >>> have is, if I am doing like this- >>>> >>> >>>> >>> >>>> >>> >>>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; +
Techy Teck 2012-08-07, 18:39
-
Re: Custom UserDefinedFunction in HiveJan Dolinár 2012-08-07, 18:42
Right, no need for the function at all. Sorry it is getting late here and
my brain refuses to work any more :) On Tue, Aug 7, 2012 at 8:39 PM, Techy Teck <[EMAIL PROTECTED]> wrote: > Then that means I don't need to create that userdefinedfunction right? > > > > On Tue, Aug 7, 2012 at 11:32 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > >> Hi Jamal, >> >> date is standard linux/unix tool, see the manual page: >> http://linux.die.net/man/1/date. >> >> The $(...) tells the shell to execute the command and insert it's output >> into the string. So in this case it will execute command >> date -d -1day +%Y%m%d >> which returns yesterday date in the format you need. >> >> Jan >> >> >> On Tue, Aug 7, 2012 at 8:24 PM, Raihan Jamal <[EMAIL PROTECTED]>wrote: >> >>> Yes it supports -e option, but in your query what is date? >>> >>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate >>> AS 'com.example.hive.udf.YesterdayDate'; >>> SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;" >>> >>> >>> >>> *Raihan Jamal* >>> >>> >>> >>> On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <[EMAIL PROTECTED]>wrote: >>> >>>> By the way, even without hiveconf, you can run hive from shell like >>>> this to achieve what you want using shell capabilities: >>>> >>>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate >>>> AS 'com.example.hive.udf.YesterdayDate'; >>>> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;" >>>> >>>> At least if hive 6.0 supports -e option, I don't have where to check >>>> that. >>>> >>>> Jan >>>> >>>> >>>> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <[EMAIL PROTECTED]> wrote: >>>> >>>>> Given the implementation of the UDF, I don't think hive would be able >>>>> to use partition pruning. Especially the version you're using. I'd >>>>> really recommend upgrading to a later version that has the hiveconf >>>>> support. That can save a lot of trouble rather than trying to get >>>>> things working on 0.6 >>>>> >>>>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]> >>>>> wrote: >>>>> > Hi Jan, >>>>> > >>>>> > I have date in different format also, so that is the reason I was >>>>> thinking >>>>> > to do by this approach. How can I make sure this will work on the >>>>> selected >>>>> > partition only and it will not scan the entire table. I will add your >>>>> > suggestion in my UDF as deterministic thing. >>>>> > >>>>> > My simple question here is- How to get the Yesterdays date which I >>>>> can use >>>>> > on the Date Partition I cannot use hiveconf here as I am working >>>>> with Hive >>>>> > 0.6 >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > Raihan Jamal >>>>> > >>>>> > >>>>> > >>>>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >> >>>>> >> I'm afraid that he query >>>>> >> >>>>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >>>>> >> >>>>> >> will scan entire table, because the functions is evaluated at >>>>> runtime, so >>>>> >> Hive doesn't know what the value is when it decides which files to >>>>> scan. I >>>>> >> am not 100% sure though, you should try it. >>>>> >> >>>>> >> Also, you might want to try to add annotation to your UDF saying >>>>> that the >>>>> >> function is deterministic: >>>>> >> @UDFType(deterministic=false) >>>>> >> >>>>> >> I think Hive might be clever enough to evaluate it early enough to >>>>> use the >>>>> >> partition pruning correctly, since it operates on constant >>>>> expression. But >>>>> >> again, I'm not really sure, maybe someone with deeper knowledge of >>>>> Hive >>>>> >> optimizations will tell us more. It is actually quite interesting >>>>> question. >>>>> >> >>>>> >> Another way to help Hive with the optimizations might be to skip >>>>> passing >>>>> >> the format string argument, if you have all dates in same format, >>>>> you can >>>>> >> call the function just like 'yesterdaydate()' and hardcode the >>>>> format in the >>>>> >> function. >>>>> >> >>>>> >> Jan >>>>> >> >>>>> >> +
Jan Dolinár 2012-08-07, 18:42
-
Re: Custom UserDefinedFunction in HiveRaihan Jamal 2012-08-07, 18:41
Let me try that and I will update on this thread.
*Raihan Jamal* On Tue, Aug 7, 2012 at 11:39 AM, Techy Teck <[EMAIL PROTECTED]> wrote: > Then that means I don't need to create that userdefinedfunction right? > > > > On Tue, Aug 7, 2012 at 11:32 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > >> Hi Jamal, >> >> date is standard linux/unix tool, see the manual page: >> http://linux.die.net/man/1/date. >> >> The $(...) tells the shell to execute the command and insert it's output >> into the string. So in this case it will execute command >> date -d -1day +%Y%m%d >> which returns yesterday date in the format you need. >> >> Jan >> >> >> On Tue, Aug 7, 2012 at 8:24 PM, Raihan Jamal <[EMAIL PROTECTED]>wrote: >> >>> Yes it supports -e option, but in your query what is date? >>> >>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate >>> AS 'com.example.hive.udf.YesterdayDate'; >>> SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;" >>> >>> >>> >>> *Raihan Jamal* >>> >>> >>> >>> On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <[EMAIL PROTECTED]>wrote: >>> >>>> By the way, even without hiveconf, you can run hive from shell like >>>> this to achieve what you want using shell capabilities: >>>> >>>> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate >>>> AS 'com.example.hive.udf.YesterdayDate'; >>>> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;" >>>> >>>> At least if hive 6.0 supports -e option, I don't have where to check >>>> that. >>>> >>>> Jan >>>> >>>> >>>> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <[EMAIL PROTECTED]> wrote: >>>> >>>>> Given the implementation of the UDF, I don't think hive would be able >>>>> to use partition pruning. Especially the version you're using. I'd >>>>> really recommend upgrading to a later version that has the hiveconf >>>>> support. That can save a lot of trouble rather than trying to get >>>>> things working on 0.6 >>>>> >>>>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <[EMAIL PROTECTED]> >>>>> wrote: >>>>> > Hi Jan, >>>>> > >>>>> > I have date in different format also, so that is the reason I was >>>>> thinking >>>>> > to do by this approach. How can I make sure this will work on the >>>>> selected >>>>> > partition only and it will not scan the entire table. I will add your >>>>> > suggestion in my UDF as deterministic thing. >>>>> > >>>>> > My simple question here is- How to get the Yesterdays date which I >>>>> can use >>>>> > on the Date Partition I cannot use hiveconf here as I am working >>>>> with Hive >>>>> > 0.6 >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > Raihan Jamal >>>>> > >>>>> > >>>>> > >>>>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >> >>>>> >> I'm afraid that he query >>>>> >> >>>>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >>>>> >> >>>>> >> will scan entire table, because the functions is evaluated at >>>>> runtime, so >>>>> >> Hive doesn't know what the value is when it decides which files to >>>>> scan. I >>>>> >> am not 100% sure though, you should try it. >>>>> >> >>>>> >> Also, you might want to try to add annotation to your UDF saying >>>>> that the >>>>> >> function is deterministic: >>>>> >> @UDFType(deterministic=false) >>>>> >> >>>>> >> I think Hive might be clever enough to evaluate it early enough to >>>>> use the >>>>> >> partition pruning correctly, since it operates on constant >>>>> expression. But >>>>> >> again, I'm not really sure, maybe someone with deeper knowledge of >>>>> Hive >>>>> >> optimizations will tell us more. It is actually quite interesting >>>>> question. >>>>> >> >>>>> >> Another way to help Hive with the optimizations might be to skip >>>>> passing >>>>> >> the format string argument, if you have all dates in same format, >>>>> you can >>>>> >> call the function just like 'yesterdaydate()' and hardcode the >>>>> format in the >>>>> >> function. >>>>> >> >>>>> >> Jan >>>>> >> >>>>> >> >>>>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED] +
Raihan Jamal 2012-08-07, 18:41
-
Re: Custom UserDefinedFunction in HiveJan Dolinár 2012-08-07, 17:56
Oops, sorry I made a copy&paste mistake :) The annotation should read
@*UDFType(deterministic=true*) Jan On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > I'm afraid that he query > > SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > > will scan entire table, because the functions is evaluated at runtime, so > Hive doesn't know what the value is when it decides which files to scan. I > am not 100% sure though, you should try it. > > Also, you might want to try to add annotation to your UDF saying that the > function is deterministic: > @*UDFType(deterministic=false*) > > I think Hive might be clever enough to evaluate it early enough to use the > partition pruning correctly, since it operates on constant expression. But > again, I'm not really sure, maybe someone with deeper knowledge of Hive > optimizations will tell us more. It is actually quite interesting question. > > Another way to help Hive with the optimizations might be to skip passing > the format string argument, if you have all dates in same format, you can > call the function just like 'yesterdaydate()' and hardcode the format in > the function. > > Jan > > > On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]>wrote: > >> Hi Jan, >> >> >> I figured that out, it is working fine for me now. The only question I >> have is, if I am doing like this- >> >> >> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >> >> >> >> Then the above query will be evaluated as below right? >> >> >> >> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; >> >> >> >> So that means it will look for data in the corresponding dt partition *(20120806) >> *only right as above table is partitioned on dt column ? And it will not >> scan the whole table right?** >> >> >> >> *Raihan Jamal* >> >> >> >> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote: >> >>> Hi Jamal, >>> >>> Check if the function really returns what it should and that your data >>> are really in yyyyMMdd format. You can do this by simple query like this: >>> >>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; >>> >>> I don't see anything wrong with the function itself, it works well for >>> me (although I tested it in hive 0.7.1). The only thing I would change >>> about it would be to optimize it by calling 'new' only at the time of >>> construction and reusing the object when the function is called, but that >>> should not affect the functionality at all. >>> >>> Best regards, >>> Jan >>> >>> >>> >>> >>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <[EMAIL PROTECTED]>wrote: >>> >>>> *Problem* >>>> >>>> I created the below UserDefinedFunction to get the yesterday's day in >>>> the format I wanted as I will be passing the format into this below method >>>> from the query. >>>> >>>> >>>> >>>> *public final class YesterdayDate extends UDF {* >>>> >>>> * * >>>> >>>> * public String evaluate(final String format) { * >>>> >>>> * DateFormat dateFormat = new >>>> SimpleDateFormat(format); * >>>> >>>> * Calendar cal = Calendar.getInstance(); >>>> * >>>> >>>> * cal.add(Calendar.DATE, -1); * >>>> >>>> * return >>>> dateFormat.format(cal.getTime()).toString(); * >>>> >>>> * } * >>>> >>>> *}* >>>> >>>> >>>> >>>> >>>> >>>> So whenever I try to run the query like below by adding the jar to >>>> classpath and creating the temporary function yesterdaydate, I always get >>>> zero result back- >>>> >>>> >>>> >>>> hive> create temporary function *yesterdaydate* as >>>> 'com.example.hive.udf.YesterdayDate'; >>>> >>>> OK >>>> >>>> Time taken: 0.512 seconds >>>> >>>> >>>> >>>> Below is the query I am running- >>>> >>>> >>>> >>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') >>>> LIMIT 10;* >>>> >>>> *OK* >>>> >>>> * * >>>> >>>> And I always get zero result back but the data is there in that table +
Jan Dolinár 2012-08-07, 17:56
-
Re: Custom UserDefinedFunction in HiveRaihan Jamal 2012-08-07, 18:03
@kulkarni,
When I did explain on my query, I got these things, I am not sure how to understand these thing. Any help will be appreciated whether my approach is right or not?- hive> EXPLAIN SELECT * FROM PDS_ATTRIBUTE_DATA_REALTIME where dt=yesterdaydate('yyyyMMdd', 2) LIMIT 5; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF PDS_ATTRIBUTE_DATA_REALTIME)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (TOK_TABLE_OR_COL dt) (TOK_FUNCTION yesterdaydate 'yyyyMMdd' 2))) (TOK_LIMIT 5))) STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: 5 Time taken: 12.126 seconds *Raihan Jamal* On Tue, Aug 7, 2012 at 10:56 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > Oops, sorry I made a copy&paste mistake :) The annotation should read > @*UDFType(deterministic=true*) > > Jan > > On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > >> I'm afraid that he query >> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >> >> will scan entire table, because the functions is evaluated at runtime, so >> Hive doesn't know what the value is when it decides which files to scan. I >> am not 100% sure though, you should try it. >> >> Also, you might want to try to add annotation to your UDF saying that the >> function is deterministic: >> @*UDFType(deterministic=false*) >> >> I think Hive might be clever enough to evaluate it early enough to use >> the partition pruning correctly, since it operates on constant expression. >> But again, I'm not really sure, maybe someone with deeper knowledge of Hive >> optimizations will tell us more. It is actually quite interesting question. >> >> Another way to help Hive with the optimizations might be to skip passing >> the format string argument, if you have all dates in same format, you can >> call the function just like 'yesterdaydate()' and hardcode the format in >> the function. >> >> Jan >> >> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <[EMAIL PROTECTED]>wrote: >> >>> Hi Jan, >>> >>> >>> I figured that out, it is working fine for me now. The only question I >>> have is, if I am doing like this- >>> >>> >>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >>> >>> >>> >>> Then the above query will be evaluated as below right? >>> >>> >>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; >>> >>> >>> >>> So that means it will look for data in the corresponding dt partition *(20120806) >>> *only right as above table is partitioned on dt column ? And it will >>> not scan the whole table right?** >>> >>> >>> >>> *Raihan Jamal* >>> >>> >>> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]>wrote: >>> >>>> Hi Jamal, >>>> >>>> Check if the function really returns what it should and that your data >>>> are really in yyyyMMdd format. You can do this by simple query like this: >>>> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; >>>> >>>> I don't see anything wrong with the function itself, it works well for >>>> me (although I tested it in hive 0.7.1). The only thing I would change >>>> about it would be to optimize it by calling 'new' only at the time of >>>> construction and reusing the object when the function is called, but that >>>> should not affect the functionality at all. >>>> >>>> Best regards, >>>> Jan >>>> >>>> >>>> >>>> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <[EMAIL PROTECTED]>wrote: >>>> >>>>> *Problem* >>>>> >>>>> I created the below UserDefinedFunction to get the yesterday's day in >>>>> the format I wanted as I will be passing the format into this below method >>>>> from the query. >>>>> >>>>> >>>>> >>>>> *public final class YesterdayDate extends UDF {* >>>>> >>>>> * * >>>>> >>>>> * public String evaluate(final String format) { * >>>>> >>>>> * DateFormat dateFormat = new >>>> +
Raihan Jamal 2012-08-07, 18:03
-
Re: Custom UserDefinedFunction in Hivekulkarni.swarnim@...) 2012-08-07, 17:35
Have you tried using EXPLAIN[1] on your query? I usually like to use that
to get a better understanding of what my query is actually doing and debugging at other times. [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain On Tue, Aug 7, 2012 at 12:20 PM, Raihan Jamal <[EMAIL PROTECTED]> wrote: > Hi Jan, > > > I figured that out, it is working fine for me now. The only question I > have is, if I am doing like this- > > > > SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > > > > Then the above query will be evaluated as below right? > > > > SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; > > > > So that means it will look for data in the corresponding dt partition *(20120806) > *only right as above table is partitioned on dt column ? And it will not > scan the whole table right?** > > > > *Raihan Jamal* > > > > On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > >> Hi Jamal, >> >> Check if the function really returns what it should and that your data >> are really in yyyyMMdd format. You can do this by simple query like this: >> >> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; >> >> I don't see anything wrong with the function itself, it works well for me >> (although I tested it in hive 0.7.1). The only thing I would change about >> it would be to optimize it by calling 'new' only at the time of >> construction and reusing the object when the function is called, but that >> should not affect the functionality at all. >> >> Best regards, >> Jan >> >> >> >> >> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <[EMAIL PROTECTED]>wrote: >> >>> *Problem* >>> >>> I created the below UserDefinedFunction to get the yesterday's day in >>> the format I wanted as I will be passing the format into this below method >>> from the query. >>> >>> >>> >>> *public final class YesterdayDate extends UDF {* >>> >>> * * >>> >>> * public String evaluate(final String format) { * >>> >>> * DateFormat dateFormat = new >>> SimpleDateFormat(format); * >>> >>> * Calendar cal = Calendar.getInstance();* >>> >>> * cal.add(Calendar.DATE, -1); * >>> >>> * return >>> dateFormat.format(cal.getTime()).toString(); * >>> >>> * } * >>> >>> *}* >>> >>> >>> >>> >>> >>> So whenever I try to run the query like below by adding the jar to >>> classpath and creating the temporary function yesterdaydate, I always get >>> zero result back- >>> >>> >>> >>> hive> create temporary function *yesterdaydate* as >>> 'com.example.hive.udf.YesterdayDate'; >>> >>> OK >>> >>> Time taken: 0.512 seconds >>> >>> >>> >>> Below is the query I am running- >>> >>> >>> >>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT >>> 10;* >>> >>> *OK* >>> >>> * * >>> >>> And I always get zero result back but the data is there in that table >>> for Aug 5th.** >>> >>> >>> >>> What wrong I am doing? Any suggestions will be appreciated. >>> >>> >>> >>> >>> >>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable >>> substitution thing, so I cannot use hiveconf here and the above table has >>> been partitioned on dt(date) column.** >>> >> >> > -- Swarnim +
kulkarni.swarnim@...) 2012-08-07, 17:35
-
Re: Custom UserDefinedFunction in HiveBejoy Ks 2012-08-08, 10:49
Hi Raihan
UDFs are evaluated at run time when the query is executed. But it is hive parser during query parse time decides the boundary of data to be used for the query, ie data from which all partitions has to be processed.Because of this the entire table will be scanned for your query. Regards, Bejoy KS ________________________________ From: Raihan Jamal <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Tuesday, August 7, 2012 10:50 PM Subject: Re: Custom UserDefinedFunction in Hive Hi Jan, I figured that out, it is working fine for me now. The only question I have is, if I am doing like this- SELECT * FROM REALTIME where dtyesterdaydate('yyyyMMdd') LIMIT 10; Then the above query will be evaluated as below right? SELECT * FROM REALTIME where dt= ���20120806’ LIMIT 10; So that means it will look for data in the corresponding dt partition (20120806) only right as above table is partitioned on dt column ? And it will not scan the whole table right? Raihan Jamal On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote: Hi Jamal, > > >Check if the function really returns what it should and that your data are really in yyyyMMdd format. You can do this by simple query like this: > > >SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; > > >I don't see anything wrong with the function itself, it works well for me (although I tested it in hive 0.7.1). The only thing I would change about it would be to optimize it by calling 'new' only at the time of construction and reusing the object when the function is called, but that should not affect the functionality at all. > > >Best regards, >Jan > > > > > > > >On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <[EMAIL PROTECTED]> wrote: > >Problem >>I created the below UserDefinedFunction to get the yesterday's day in the format I wanted as I will be passing the format into this below method from the query. >>�� >>public final class YesterdayDate extends UDF { >> >>�� public String evaluate(final String format) { >> �� �� DateFormat dateFormat = new SimpleDateFormat(format); >> �� Calendar cal = Calendar.getInstance(); >> �� cal.add(Calendar.DATE, -1); >> �� �� return dateFormat.format(cal.getTime()).toString(); >> �� } >>} >> >> >>So whenever I try to run the query like below by adding the jar to classpath and creating the temporary function yesterdaydate, I always get zero result back- >> >>hive> create temporary function yesterdaydate as 'com.example.hive.udf.YesterdayDate'; >>OK >>Time taken: 0.512 seconds >> >>Below is the query I am running- >> >>hive> SELECT * FROM REALTIME where dtyesterdaydate('yyyyMMdd') LIMIT 10; >>OK >> >>And I always get zero result back but the data is there in that table for Aug 5th. >> >>What wrong I am doing? Any suggestions will be appreciated. >> >> >>NOTE:- As I am working with Hive 0.6 so it doesn’t support variable substitution thing, so I cannot use hiveconf here and the above table has been partitioned on dt(date) column. > +
Bejoy Ks 2012-08-08, 10:49
-
RE: Custom UserDefinedFunction in HiveVenkatesh Kavuluri 2012-08-08, 23:41
Actually as the custom UDF "yesterday()" mentioned below is NOT marked with the annotation @UDFType(deterministic = false), partition pruning should work in practice. The PartitionPruner has a logic around this annotation to check if a generic UDF is deterministic or not and would skip partition pruning if it finds any non-deterministic function.
http://svn.apache.org/repos/asf/hive/branches/branch-0.9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionPruner.java I guess this check is implemented to avoid inconsistencies in result set that might arise in scenarios like below where predicate is :part_col = f(time) and regular_col = f(time). The expression involving "part_col" is evaluated at compile time and the expression involving "regular_col" is evaluated at run time and the function yesterday() might return different values if the query is executed around midnight. Thanks,Venkatesh Date: Wed, 8 Aug 2012 03:49:56 -0700 From: [EMAIL PROTECTED] Subject: Re: Custom UserDefinedFunction in Hive To: [EMAIL PROTECTED] Hi Raihan UDFs are evaluated at run time when the query is executed. But it is hive parser during query parse time decides the boundary of data to be used for the query, ie data from which all partitions has to be processed. Because of this the entire table will be scanned for your query. Regards,Bejoy KS From: Raihan Jamal <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Tuesday, August 7, 2012 10:50 PM Subject: Re: Custom UserDefinedFunction in Hive Hi Jan, I figured that out, it is working fine for me now. The only question I have is, if I am doing like this- SELECT * FROM REALTIME where dtyesterdaydate('yyyyMMdd') LIMIT 10; Then the above query will be evaluated as below right? SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; So that means it will look for data in the corresponding dt partition (20120806) only right as above table is partitioned on dt column ? And it will not scan the whole table right? Raihan Jamal On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote: Hi Jamal, Check if the function really returns what it should and that your data are really in yyyyMMdd format. You can do this by simple query like this: SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; I don't see anything wrong with the function itself, it works well for me (although I tested it in hive 0.7.1). The only thing I would change about it would be to optimize it by calling 'new' only at the time of construction and reusing the object when the function is called, but that should not affect the functionality at all. Best regards, Jan On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <[EMAIL PROTECTED]> wrote: Problem I created the below UserDefinedFunction to get the yesterday's day in the format I wanted as I will be passing the format into this below method from the query. public final class YesterdayDate extends UDF { public String evaluate(final String format) { DateFormat dateFormat = new SimpleDateFormat(format); Calendar cal = Calendar.getInstance(); cal.add(Calendar.DATE, -1); return dateFormat.format(cal.getTime()).toString(); } } So whenever I try to run the query like below by adding the jar to classpath and creating the temporary function yesterdaydate, I always get zero result back- hive> create temporary function yesterdaydate as 'com.example.hive.udf.YesterdayDate'; OK Time taken: 0.512 seconds Below is the query I am running- hive> SELECT * FROM REALTIME where dtyesterdaydate('yyyyMMdd') LIMIT 10; OK And I always get zero result back but the data is there in that table for Aug 5th. What wrong I am doing? Any suggestions will be appreciated. NOTE:- As I am working with Hive 0.6 so it doesn’t support variable substitution thing, so I cannot use hiveconf here and the above table has been partitioned on dt(date) column. +
Venkatesh Kavuluri 2012-08-08, 23:41
|