|
|
-
Condition for doing a sort merge bucket map join
Bruce Bian 2012-05-22, 15:07
Hi , I've got 7 large tables to join(each ~10G in size) into one table, all with the same* 2 *join keys, I've read some documents on sort merge bucket map join, but failed to fire that. I've bucketed all the 7 tables into 20 buckets and sorted by one of the join key, set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; Set the above parameters while doing the join. What else do I miss? Do I have to bucket on both of the join keys(I'm currently trying this)? And does each bucket file has to be smaller than one HDFS block? Thanks a lot.
-
Re: Condition for doing a sort merge bucket map join
Mark Grover 2012-05-22, 15:43
Hi Bruce, Instead of joining 7 tables in the query, can you please start off with 2 tables and see if that works? If it doesn't, feel free to paste your table definitions and join query along with any properties you are setting and folks on the mailing list can take a jab at it. Mark
----- Original Message ----- From: "Bruce Bian" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, May 22, 2012 11:07:38 AM Subject: Condition for doing a sort merge bucket map join
Hi , I've got 7 large tables to join(each ~10G in size) into one table, all with the same 2 join keys, I've read some documents on sort merge bucket map join, but failed to fire that. I've bucketed all the 7 tables into 20 buckets and sorted by one of the join key, set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; Set the above parameters while doing the join. What else do I miss? Do I have to bucket on both of the join keys(I'm currently trying this)? And does each bucket file has to be smaller than one HDFS block? Thanks a lot.
-
Re: Condition for doing a sort merge bucket map join
ameet chaubal 2012-05-22, 18:25
you should have the bucket columns = join keys = sort columns. When this condition is true, I was able to make SMB work. Even if one of the join keys is a partition (i.e. cannot be part of clustering/sorting set), it did not work for me. So, I'd say just check that all the 7 table joins use the same join keys which are all clustered/sorted. Sincerely, Ameet ________________________________ From: Bruce Bian <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, May 22, 2012 11:07 AM Subject: Condition for doing a sort merge bucket map join
Hi , I've got 7 large tables to join(each ~10G in size) into one table, all with the same 2 join keys, I've read some documents on sort merge bucket map join, but failed to fire that. I've bucketed all the 7 tables into 20 buckets and sorted by one of the join key, set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; Set the above parameters while doing the join. What else do I miss? Do I have to bucket on both of the join keys(I'm currently trying this)? And does each bucket file has to be smaller than one HDFS block? Thanks a lot.
|
|