Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> Why does SMB join generate hash table locally, even if input tables are large?


Copy link to this message
-
Why does SMB join generate hash table locally, even if input tables are large?
Hi,

I am testing SMB join for 2 large tables. The tables are bucketed and
sorted on the join column. I notice that even though the table is large,
Hive attempts to generate hash table for the 'small' table locally,
 similar to map join. Since the table is large in my case, the client runs
out of memory and the query fails.

I am using Hive 0.12 with the following settings:

set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.input.format =
org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

My test query does a simple join and a select, no subqueries/nested queries
etc.

I understand why a (bucket) map join requires hash table generation, but
why is that included for an SMB join? Shouldn't a SMB join just spin up one
mapper for each bucket and perform a sort merge join directly on the mapper?
Thanks,
pala

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB