

Re: number of buckets in bucket map join
Hi Mahsa,
Just to elaborate a little further. Bucket joins are quicker than regular joins because they lessen the number of logical multiplications that need to happen between the records from two tables being joined. A regular bucket join would logically multiply each row from one table with each row of another table, leading to O(n^2) logical multiplications. A bucket map join reduces the number of multiplications by only joining the corresponding buckets. The answer to your question lies in the explanation of how Hive figures out which buckets from the tables correspond to each other. The default method by which Hive distributes rows across buckets is by hash_function(bucketing_column) mod num_buckets (reference: http://hive.apache.org/docs/r0.8.1/language_manual/working_with_bucketed_tables.html). Given that the hash function is deterministic, it gives the same result for same argument every single time it's called. Let's say table t1 has n1 buckets and t2 has n2 buckets. Therefore a record with key some_key would go in bucket number hash_function(some_key)/mod(n1) in t1 and bucket number hash_function(some_key)/mod(n2) in t2. If n1 and n2 are equal, the bucket number would be the same in both tables. Consequently, Hive only needs to join same bucket numbers with each other, because if a record with some join key is present in bucket i of table t1, all records in table t2 with the same join key must be present in bucket i of table t2. Now, let's come to a case where the number of buckets of the 2 tables is not equal. Without loss of generality, we can assume that t1 has n1 buckets and t2 has n2 buckets where n1 < n2. For a bucket join to work, Hive has to be able to deterministically figure out which bucket from table2 should bucket i from table1 be joined with. If n2 is a multiple of n1, then it can be proved that hash_function(some_key)/mod (n2) is one of the following n2/n1 values: hash_function(some_key)/mod(n1), hash_function(some_key)/mod(n1) + n1, hash_function(some_key)/mod(n1) + 2 * n1, ...., hash_function(some_key)/mod(n1) + (n2/n1)  1. In other words, if a record with a given join key exists in bucket i of table1, a record with the same join key can only exist in bucket i, i + n1, i + 2*n1, ...., or i + (n2/n1)  1. Consequently, Hive only needs to logically multiply records from bucket i of table1 with the above n2/n1 buckets from table2 to perform a join. If n2 is not a multiple of n1, no nice relationship between the mods of n1 and n2 exist so it can't be determined which bucket from table1 should be joined with which bucket from table2. So we are back to joining the entire table instead of individual buckets. Therefore, the number of buckets in one table has to be a multiple of the other in order for bucketed join to work. Hope that helps. BTW, the above is just my understanding of bucketed joins, I haven't verified that this in fact what the present code does. Mark On Wed, Oct 31, 2012 at 4:50 PM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote: > Hi, > > Hive summit 2011 says for performing a bucket join the number of buckets of > all tables has to be a multiple of each other. > Does anybody know the reason? > > Thanks you. > Mahsa > 

