Hi Mahsa,

Just to elaborate a little further.

Bucket joins are quicker than regular joins because they lessen the number

of logical multiplications that need to happen between the records from two

tables being joined.

A regular bucket join would logically multiply each row from one table with

each row of another table, leading to O(n^2) logical multiplications. A

bucket map join reduces the number of multiplications by only joining the

corresponding buckets. The answer to your question lies in the explanation

of how Hive figures out which buckets from the tables correspond to each

other.

The default method by which Hive distributes rows across buckets is by

hash_function(bucketing_column) mod num_buckets (reference:

http://hive.apache.org/docs/r0.8.1/language_manual/working_with_bucketed_tables.html).

Given that the hash function is deterministic, it gives the same result for

same argument every single time it's called.

Let's say table t1 has n1 buckets and t2 has n2 buckets. Therefore a record

with key some_key would go in bucket number hash_function(some_key)/mod(n1)

in t1 and bucket number hash_function(some_key)/mod(n2) in t2.

If n1 and n2 are equal, the bucket number would be the same in both tables.

Consequently, Hive only needs to join same bucket numbers with each other,

because if a record with some join key is present in bucket i of table t1,

all records in table t2 with the same join key must be present in bucket i

of table t2.

Now, let's come to a case where the number of buckets of the 2 tables is

not equal. Without loss of generality, we can assume that t1 has n1 buckets

and t2 has n2 buckets where n1 < n2. For a bucket join to work, Hive has

to be able to deterministically figure out which bucket from table2 should

bucket i from table1 be joined with.

If n2 is a multiple of n1, then it can be proved that

hash_function(some_key)/mod (n2) is one of the following n2/n1 values:

hash_function(some_key)/mod(n1), hash_function(some_key)/mod(n1) + n1,

hash_function(some_key)/mod(n1) + 2 * n1, ....,

hash_function(some_key)/mod(n1) + (n2/n1) - 1. In other words, if a record

with a given join key exists in bucket i of table1, a record with the same

join key can only exist in bucket i, i + n1, i + 2*n1, ...., or i + (n2/n1)

- 1. Consequently, Hive only needs to logically multiply records from

bucket i of table1 with the above n2/n1 buckets from table2 to perform a

join.

If n2 is not a multiple of n1, no nice relationship between the mods of n1

and n2 exist so it can't be determined which bucket from table1 should be

joined with which bucket from table2. So we are back to joining the entire

table instead of individual buckets.

Therefore, the number of buckets in one table has to be a multiple of the

other in order for bucketed join to work.

Hope that helps. BTW, the above is just my understanding of bucketed joins,

I haven't verified that this in fact what the present code does.

Mark

On Wed, Oct 31, 2012 at 4:50 PM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote:

> Hi,

>

> Hive summit 2011 says for performing a bucket join the number of buckets of

> all tables has to be a multiple of each other.

> Does anybody know the reason?

>

> Thanks you.

> Mahsa

>