Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Effecient partitions usage in join


+
Dima Datsenko 2012-11-22, 13:56
+
Bennie Schut 2012-11-22, 14:27
+
Dima Datsenko 2012-11-22, 15:06
+
Bennie Schut 2012-11-23, 10:50
Copy link to this message
-
Re: Effecient partitions usage in join
A quick suggestion. If the format never changes, you might as well make the
string and SimpleDateFormat object static to eliminate the overhead of
creating them in every call to evaluate.

On Fri, Nov 23, 2012 at 4:50 AM, Bennie Schut <[EMAIL PROTECTED]> wrote:

> Well this is the udf:****
>
> ** **
>
> package com.ebuddy.dwhhive.udf;****
>
> ** **
>
> import org.apache.hadoop.hive.ql.exec.Description;****
>
> import org.apache.hadoop.hive.ql.exec.UDF;****
>
> import org.apache.hadoop.hive.ql.udf.UDFType;****
>
> import org.apache.hadoop.io.Text;****
>
> ** **
>
> import java.text.SimpleDateFormat;****
>
> import java.util.Calendar;****
>
> ** **
>
> @Description(****
>
>         name = "currentisodate",****
>
>         value = "currentisodate() - Get the current date. Incorrectly made"
> ****
>
>                 + " deterministic to get partition pruning to work."****
>
> )****
>
> ** **
>
> @UDFType(deterministic = true)****
>
> public class CurrentIsoDate extends UDF {****
>
> ** **
>
>     public static Text evaluate() {****
>
>         String pattern = "yyyy-MM-dd";****
>
>         SimpleDateFormat timeFormat = new SimpleDateFormat(pattern);****
>
>         return new
> Text(timeFormat.format(Calendar.getInstance().getTime()));****
>
>     }****
>
> }****
>
> ** **
>
> ** **
>
> And this is how we use it to query the last 30days:****
>
> ADD jar /opt/hive/udf/udf-2.1.2-jar-with-dependencies.jar;****
>
> CREATE TEMPORARY FUNCTION currectisodate AS
> 'com.ebuddy.dwhhive.udf.CurrentIsoDate';****
>
> select count(*) from test where record_date_iso >> DATE_SUB(currentisodate(), 30);****
>
> ** **
>
> I’ve always had a preference for iso dates since they sort nicely:
> 2012-11-23 but you can obviously pick your own pattern.****
>
> ** **
>
> ** **
>
> *From:* Dima Datsenko [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, November 22, 2012 4:07 PM
> *To:* Bennie Schut; [EMAIL PROTECTED]
>
> *Subject:* RE: Effecient partitions usage in join****
>
> ** **
>
> Hi Benny,****
>
>  ****
>
> The udf solution sounds like a plan. Much better than generating hive
> query with hardcoded partition out of table B. Can you please provide a
> sample of what you’re doing there?****
>
>  ****
>
> Thanks,****
>
> Dima****
>
>  ****
>
> *From:* Bennie Schut [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>]
> *Sent:* יום ה 22 נובמבר 2012 16:28
> *To:* [EMAIL PROTECTED]
> *Cc:* Dima Datsenko
> *Subject:* RE: Effecient partitions usage in join****
>
>  ****
>
> Unfortunately at the moment partition pruning is a bit limited in hive.
> When hive creates the query plan it decides what partitions to use. So if
> you put hardcoded list of partition_id items in the where clause it will
> know what to do. In the case of a join (or a subquery) it would have to run
> the query before it can know what it can prune.  There are obvious
> solutions to this but they are simply not implemented at the moment.****
>
> Generally speaking people try to work around this by not normalizing the
> data. So if you plan on doing a clean star schema with a calendar table
> then do yourself a favor and but the actual date in the fact table and not
> a meaningless key.****
>
> It’s also good to realize you can (in some special cases) work around it
> by using udf’s. I’ve used it once by creating a udf which produced the
> current date which I flagged as deterministic (ugly I know). This causes
> the planner to run the udf during planning and use the result as if it’s a
> constant and thus partition pruning works again. It’s currently the only
> way I know to select x days of data with partition pruning working.****
>
>  ****
>
>  ****
>
> *From:* Dima Datsenko [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>]
> *Sent:* Thursday, November 22, 2012 2:56 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Effecient partitions usage in join****
>
>  ****
>
> Hi Guys,****
>
>  ****
>
> I wonder if you could help me.****
>
>  ****
>
> I have a huge Hive table partitioned by some field. It has thousands of

*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330
+
Dima Datsenko 2012-11-23, 13:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB