Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> map join, bucket map join, and sort merge bucket map join


Copy link to this message
-
map join, bucket map join, and sort merge bucket map join
Hi,
 When I want to do bucket map join between tables A and B (A has small
buckets), which should load matching buckets of A into each mapper of B,
the things I do is (1) bucket A and B on the join keys with equal number of
buckets, (2) set hive.optimize.bucketmapjoin = true; (3) run the following
query:

SELECT /*+ MAPJOIN(a) */
 count(*)
FROM a JOIN B ON  a.join_key = b.join_key;

 Question 1: Is my sequence of actions enough for triggering bucket map
join?

After running the above query, what I observe is that a local task starts
hashing A, and creating multiple hashed buckets and then loads it into
mapper of B, and starts a job which doesn't have any reducers.

 Question 2: Why is it hashing it in the local task and not in the mapper
stage? This seems to take a lot of time, which can be avoided. I mean why
it doesn't upload the row bucket instead of hashed bucket, and do the
hashing in the mapper (and parallelize the hashing, instead of doing it
sequentialy)?

 If I have tables A and B sorted and bucketed by join keys and have the
same number of buckets and setting the 3 flags specified in the
manual<http://archive.cloudera.com/cdh/3/hive/language_manual/joins.html>,
and again running the above query, I see that this time it doesn't do the
hashing, but it does the join in just 1 mapper.

 Question 3: Why is it doing it in one mapper? Joins between different
buckets seem to be independent, so why it doesn't doing it in multiple
mappers?

 Thanks
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB