|
|
-
RE: [Marketing Mail] Re: Why BucketJoinMap consume too much memoryLadda, Anand 2012-04-10, 20:23
Hi Bejoy/Binh
Been following this thread to better understand where bucket map join would help and it’s been a great thread to follow. I have struggling with this on my end as well. I have two tables one of which is about 22GB (orderdetailpartclust2) in size and the other is 1.5GB (orderfactpartclust2) in size (all partitions combined) and I wanted to see the impact of different kind of joins on one of the partitions of these table . I created a partitioned (order_date) and bucketed (on order_id, on which I want to join these tables) version for these tables for this analysis. Data was loaded from their non-partitioned counterparts and setting the following parameters to ensure that data makes it into the right partitions and is bucketed correctly by Hive set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.dynamic.partitions.pernode=100000; set hive.enforce.bucketing = true; However when I try to do the following join query, I don’t get any bucketed map side join select /*+ MAPJOIN(orderfactpartclust2) */ a12.shipper_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = a12.order_id) where (a11.order_date = '09-30-2008' and a12.order_date = '2008-09-30') group by a12.shipper_id; Below are the relevant pieces of information on each of these tables. Can you please help take a look to see what I might be missing to get map side joins. Is it because my tables are also partitioned that this isn’t working? 1. hive> describe formatted orderdetailpartclust2; OK # col_name data_type comment order_id int from deserializer item_id int from deserializer emp_id int from deserializer promotion_id int from deserializer customer_id int from deserializer qty_sold float from deserializer unit_price float from deserializer unit_cost float from deserializer discount float from deserializer # Partition Information # col_name data_type comment order_date string None # Detailed Table Information Database: default Owner: hdfs CreateTime: Thu Apr 05 17:01:22 EDT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2 Table Type: MANAGED_TABLE Table Parameters: SORTBUCKETCOLSPREFIX TRUE numFiles 19200 numPartitions 75 numRows 0 totalSize 22814162038 transient_lastDdlTime 1333725153 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat Compressed: No Num Buckets: 256 Bucket Columns: [order_id] Sort Columns: [Order(col:order_id, order:1)] Storage Desc Params: escape.delim \\ field.delim \t serialization.format \t Time taken: 3.255 seconds 2. hive> describe formatted orderfactpartclust2; OK # col_name data_type comment order_id int from deserializer emp_id int from deserializer order_amt float from deserializer order_cost float from deserializer qty_sold float from deserializer freight float from deserializer gross_dollar_sales float from deserializer ship_date string from deserializer rush_order string from deserializer customer_id int from deserializer pymt_type int from deserializer shipper_id int from deserializer # Partition Information # col_name data_type comment order_date string None # Detailed Table Information Database: default Owner: hdfs CreateTime: Thu Apr 05 18:09:28 EDT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2 Table Type: MANAGED_TABLE Table Parameters: SORTBUCKETCOLSPREFIX TRUE numFiles 7680 numPartitions 30 numRows 0 totalSize 1528946078 transient_lastDdlTime 1333722539 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat Compressed: No Num Buckets: 256 Bucket Columns: [order_id] Sort Columns: [Order(col:order_id, order:1)] Storage Desc Params: escape.delim \\ field.delim \t serialization.format \t Time taken: 1.737 seconds 3. -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderdetailpartclust2; 299867901 hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_date=01-01-2008 . . . 311033139 hdfs:/ +
binhnt22 2012-04-11, 01:36
|