I am testing SMB join for 2 large tables. The tables are bucketed and
sorted on the join column. I notice that even though the table is large,
Hive attempts to generate hash table for the 'small' table locally,
similar to map join. Since the table is large in my case, the client runs
out of memory and the query fails.
I am using Hive 0.12 with the following settings:
set hive.input.format =
My test query does a simple join and a select, no subqueries/nested queries
I understand why a (bucket) map join requires hash table generation, but
why is that included for an SMB join? Shouldn't a SMB join just spin up one
mapper for each bucket and perform a sort merge join directly on the mapper?