Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> number of buckets in bucket map join


Copy link to this message
-
Re: number of buckets in bucket map join
Hi Mahsa,
Just to elaborate a little further.
Bucket joins are quicker than regular joins because they lessen the number
of logical multiplications that need to happen between the records from two
tables being joined.

A regular bucket join would logically multiply each row from one table with
each row of another table, leading to O(n^2) logical multiplications. A
bucket map join reduces the number of multiplications by only joining the
corresponding buckets. The answer to your question lies in the explanation
of how Hive figures out which buckets from the tables correspond to each
other.

The default method by which Hive distributes rows across buckets is by
hash_function(bucketing_column) mod num_buckets (reference:
http://hive.apache.org/docs/r0.8.1/language_manual/working_with_bucketed_tables.html).
Given that the hash function is deterministic, it gives the same result for
same argument every single time it's called.

Let's say table t1 has n1 buckets and t2 has n2 buckets. Therefore a record
with key some_key would go in bucket number hash_function(some_key)/mod(n1)
in t1 and bucket number hash_function(some_key)/mod(n2) in t2.

If n1 and n2 are equal, the bucket number would be the same in both tables.
Consequently, Hive only needs to join same bucket numbers with each other,
because if a record with some join key is present in bucket i of table t1,
all records in table t2 with the same join key must be present in bucket i
of table t2.

Now, let's come to a case where the number of buckets of the 2 tables is
not equal. Without loss of generality, we can assume that t1 has n1 buckets
and t2 has n2 buckets where  n1 < n2. For a bucket join to work, Hive has
to be able to deterministically figure out which bucket from table2 should
bucket i from table1 be joined with.

If n2 is a multiple of n1, then it can be proved that
hash_function(some_key)/mod (n2) is one of the following n2/n1 values:
hash_function(some_key)/mod(n1), hash_function(some_key)/mod(n1) + n1,
hash_function(some_key)/mod(n1) + 2 * n1, ....,
hash_function(some_key)/mod(n1) + (n2/n1) - 1. In other words, if a record
with a given join key exists in bucket i of table1, a record with the same
join key can only exist in bucket i, i + n1, i + 2*n1, ...., or i + (n2/n1)
- 1. Consequently, Hive only needs to logically multiply records from
bucket i of table1 with the above n2/n1 buckets from table2 to perform a
join.

If n2 is not a multiple of n1, no nice relationship between the mods of n1
and n2 exist so it can't be determined which bucket from table1 should be
joined with which bucket from table2. So we are back to joining the entire
table instead of individual buckets.

Therefore, the number of buckets in one table has to be a multiple of the
other in order for bucketed join to work.

Hope that helps. BTW, the above is just my understanding of bucketed joins,
I haven't verified that this in fact what the present code does.

Mark

On Wed, Oct 31, 2012 at 4:50 PM, Mahsa Mofidpoor <[EMAIL PROTECTED]>wrote:

> Hi,
>
> Hive summit 2011 says for performing a bucket join the number of buckets of
> all tables has to be a multiple of each other.
> Does anybody know the reason?
>
> Thanks you.
> Mahsa
>