Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Hive Queries Performance Tuning - Map side joins, Map side aggregations, Partitioning/Clustering


+
Ladda, Anand 2012-04-01, 18:29
Copy link to this message
-
Re: Hive Queries Performance Tuning - Map side joins, Map side aggregations, Partitioning/Clustering
Anand,

best place to understand the join queries on hive is from the presentation
by Namit Jain from Facebook.

Here is the pdf
https://cwiki.apache.org/Hive/presentations.data/Hive%20Summit%202011-join.pdf

you can search the video on youtube. Its very well described

On Sun, Apr 1, 2012 at 11:59 PM, Ladda, Anand <[EMAIL PROTECTED]>wrote:

>  I am trying to understand what are some of the options/settings
> available to tune the performance of Hive Queries. I have seen the benefits
> of Map side joins and Partitioning/Clustering. However I have yet to
> realize the impact map side aggregation has on query performance. I tried
> running this query against with and without map-side join turned on and did
> not see much difference in the execution times. The raw data in this
> partition is about 5.5 million. Looking for some pointers to see what type
> of queries benefit from Map-side aggregation****
>
> ** **
>
> set hive.auto.convert.join=false;****
>
> set hive.map.aggr=false;****
>
> Non-partitioned, non-clustered single table with where clause on date and
> no map side aggregation****
>
> select a11.emp_id, count(1), count (distinct a11.customer_id),
> sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008'
> group by a11.emp_id;****
>
> 400 secs****
>
> set hive.map.aggr=true;****
>
> Non-partitioned, non-clustered single table with where clause with where
> clause on date and map side aggregation****
>
> select a11.emp_id, count(1), count (distinct a11.customer_id),
> sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008'
> group by a11.emp_id;****
>
> 390 secs****
>
> ** **
>
> Also is there any reason to not turn on map-side joins all the time. In my
> tests I have always seen the performance either be the same or improve with
> map-side joins turned on. Are there any other parameters or Hive features
> that can help improve the performance of Hive queries. ****
>
> Thanks****
>
> Anand****
>
> ** **
>

--
Nitin Pawar
+
Bejoy Ks 2012-04-01, 21:34
+
Ladda, Anand 2012-04-03, 13:33
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB