Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Re: Custom UserDefinedFunction in Hive


Copy link to this message
-
RE: Custom UserDefinedFunction in Hive
Actually as the custom UDF "yesterday()" mentioned below is NOT marked with the annotation @UDFType(deterministic = false), partition pruning should work in practice.  The PartitionPruner has a logic around this annotation to check if a generic UDF is deterministic or not and would skip partition pruning if it finds any non-deterministic function.  
http://svn.apache.org/repos/asf/hive/branches/branch-0.9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionPruner.java
I guess this check is implemented to avoid inconsistencies in result set that might arise in scenarios like below where predicate is :part_col = f(time) and regular_col = f(time).
The expression involving "part_col" is evaluated at compile time and the expression involving "regular_col" is evaluated at run time and the function yesterday() might return different values if the query is executed around midnight.
Thanks,Venkatesh

Date: Wed, 8 Aug 2012 03:49:56 -0700
From: [EMAIL PROTECTED]
Subject: Re: Custom UserDefinedFunction in Hive
To: [EMAIL PROTECTED]

Hi Raihan
UDFs are evaluated at run time when the query is executed. But it is hive parser during query parse time decides the boundary of data to be used for the query, ie data from which all partitions has to be processed. Because of this the entire table will be scanned for your query.
 Regards,Bejoy KS
        From: Raihan Jamal <[EMAIL PROTECTED]>
 To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
 Sent: Tuesday, August 7, 2012 10:50 PM
 Subject: Re: Custom UserDefinedFunction in Hive
  

Hi Jan,

 

I figured that out, it is working fine for me now. The only question I have is, if I am doing like this-

 

SELECT * FROM REALTIME where dtyesterdaydate('yyyyMMdd') LIMIT 10;

 

Then the above query will be evaluated as below right?

 

SELECT * FROM REALTIME where dt= ‘20120806’
LIMIT 10;

 

So that means it will look for data in the corresponding dt
partition (20120806) only right as above table is partitioned on dt column ?
And it will not scan the whole table right?
Raihan Jamal

On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote:
Hi Jamal,
Check if the function really returns what it should and that your data are really in yyyyMMdd format. You can do this by simple query like this:
SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
I don't see anything wrong with the function itself, it works well for me (although I tested it in hive 0.7.1). The only thing I would change about it would be to optimize it by calling 'new' only at the time of construction and reusing the object when the function is called, but that should not affect the functionality at all.
Best regards,

Jan

On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <[EMAIL PROTECTED]> wrote:
Problem

I created the below UserDefinedFunction to get the yesterday's
day in the format I wanted as I will be passing the format into this below
method from the query.

 

public final class YesterdayDate extends UDF {

 

              
public String evaluate(final String format) {

                              
DateFormat dateFormat = new SimpleDateFormat(format);

                              
Calendar cal = Calendar.getInstance();

                              
cal.add(Calendar.DATE, -1);    

                              
return dateFormat.format(cal.getTime()).toString();

              
}

}

 

 

So whenever I try to run the query like below by adding the jar
to classpath and creating the temporary function yesterdaydate, I always get
zero result back-

 

hive> create temporary function yesterdaydate as
'com.example.hive.udf.YesterdayDate';

OK

Time taken: 0.512 seconds

 

Below is the query I am running-

 

hive> SELECT * FROM REALTIME where dtyesterdaydate('yyyyMMdd') LIMIT 10;

OK

 

And I always get zero result back but the data is there in that
table for Aug 5th.

 

What wrong I am doing? Any suggestions will be appreciated.

 

 

NOTE:- As I am working with Hive 0.6 so it doesn’t support
variable substitution thing, so I cannot use hiveconf here and the above table
has been partitioned on dt(date) column.

        
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB