Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - RE: [Marketing Mail] Re: Why BucketJoinMap consume too much memory


Copy link to this message
-
RE: [Marketing Mail] Re: Why BucketJoinMap consume too much memory
Ladda, Anand 2012-04-10, 20:23
Hi Bejoy/Binh
Been following this thread to better understand where bucket map join would help and it’s been a great thread to follow. I have struggling with this on my end as well.

I have two tables one of which is about 22GB (orderdetailpartclust2) in size and the other is 1.5GB (orderfactpartclust2) in size (all partitions combined) and I wanted to see the impact of different kind of joins on one of the partitions of these table .

I created a partitioned (order_date) and bucketed (on order_id, on which I want to join these tables) version for these tables for this analysis. Data was loaded from their non-partitioned counterparts and setting the following parameters to ensure that data makes it into the right partitions and is bucketed correctly by Hive

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
set hive.enforce.bucketing = true;

However when I try to do the following join query, I don’t get any bucketed map side join

select /*+ MAPJOIN(orderfactpartclust2) */ a12.shipper_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = a12.order_id) where (a11.order_date = '09-30-2008' and a12.order_date = '2008-09-30') group by a12.shipper_id;

Below are the relevant pieces of information on each of these tables. Can you please help take a look to see what I might be missing to get map side joins. Is it because my tables are also partitioned that this isn’t working?

1.       hive> describe formatted orderdetailpartclust2;
OK
# col_name              data_type               comment

order_id                int                     from deserializer
item_id                 int                     from deserializer
emp_id                  int                     from deserializer
promotion_id            int                     from deserializer
customer_id             int                     from deserializer
qty_sold                float                   from deserializer
unit_price              float                   from deserializer
unit_cost               float                   from deserializer
discount                float                   from deserializer

# Partition Information
# col_name              data_type               comment

order_date              string                  None

# Detailed Table Information
Database:               default
Owner:                  hdfs
CreateTime:             Thu Apr 05 17:01:22 EDT 2012
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2
Table Type:             MANAGED_TABLE
Table Parameters:
        SORTBUCKETCOLSPREFIX    TRUE
        numFiles                19200
        numPartitions           75
        numRows                 0
        totalSize               22814162038
        transient_lastDdlTime   1333725153

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.RCFileInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.RCFileOutputFormat
Compressed:             No
Num Buckets:            256
Bucket Columns:         [order_id]
Sort Columns:           [Order(col:order_id, order:1)]
Storage Desc Params:
        escape.delim            \\
        field.delim             \t
        serialization.format    \t
Time taken: 3.255 seconds

2.       hive> describe formatted orderfactpartclust2;
OK
# col_name              data_type               comment

order_id                int                     from deserializer
emp_id                  int                     from deserializer
order_amt               float                   from deserializer
order_cost              float                   from deserializer
qty_sold                float                   from deserializer
freight                 float                   from deserializer
gross_dollar_sales      float                   from deserializer
ship_date               string                  from deserializer
rush_order              string                  from deserializer
customer_id             int                     from deserializer
pymt_type               int                     from deserializer
shipper_id              int                     from deserializer

# Partition Information
# col_name              data_type               comment

order_date              string                  None

# Detailed Table Information
Database:               default
Owner:                  hdfs
CreateTime:             Thu Apr 05 18:09:28 EDT 2012
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2
Table Type:             MANAGED_TABLE
Table Parameters:
        SORTBUCKETCOLSPREFIX    TRUE
        numFiles                7680
        numPartitions           30
        numRows                 0
        totalSize               1528946078
        transient_lastDdlTime   1333722539

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.RCFileInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.RCFileOutputFormat
Compressed:             No
Num Buckets:            256
Bucket Columns:         [order_id]
Sort Columns:           [Order(col:order_id, order:1)]
Storage Desc Params:
        escape.delim            \\
        field.delim             \t
        serialization.format    \t
Time taken: 1.737 seconds
3.       -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderdetailpartclust2;
299867901   hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_date=01-01-2008
.
.
.
311033139   hdfs:/
+
binhnt22 2012-04-11, 01:36