Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - map join, bucket map join, and sort merge bucket map join


Copy link to this message
-
map join, bucket map join, and sort merge bucket map join
Hadi Moshayedi 2012-10-06, 10:24
Hi,
 When I want to do bucket map join between tables A and B (A has small
buckets), which should load matching buckets of A into each mapper of B,
the things I do is (1) bucket A and B on the join keys with equal number of
buckets, (2) set hive.optimize.bucketmapjoin = true; (3) run the following
query:

SELECT /*+ MAPJOIN(a) */
 count(*)
FROM a JOIN B ON  a.join_key = b.join_key;

 Question 1: Is my sequence of actions enough for triggering bucket map
join?

After running the above query, what I observe is that a local task starts
hashing A, and creating multiple hashed buckets and then loads it into
mapper of B, and starts a job which doesn't have any reducers.

 Question 2: Why is it hashing it in the local task and not in the mapper
stage? This seems to take a lot of time, which can be avoided. I mean why
it doesn't upload the row bucket instead of hashed bucket, and do the
hashing in the mapper (and parallelize the hashing, instead of doing it
sequentialy)?

 If I have tables A and B sorted and bucketed by join keys and have the
same number of buckets and setting the 3 flags specified in the
manual<http://archive.cloudera.com/cdh/3/hive/language_manual/joins.html>,
and again running the above query, I see that this time it doesn't do the
hashing, but it does the join in just 1 mapper.

 Question 3: Why is it doing it in one mapper? Joins between different
buckets seem to be independent, so why it doesn't doing it in multiple
mappers?

 Thanks