Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> MapSide join in Hive


Copy link to this message
-
Re: MapSide join in Hive
Viraj,

1. No
2. Yes, smaller table needs to fit in jvm memory (typically more than
1GB for small table is too large).

See slide 7 and after in this preso for different join strategies that
can help in case the tables are bucketed and sorted.

http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team

There is also the /*+STREAMTABLE(tablealias)*/ hint, which you should
use for very large tables (or make sure it is the rightmost table in the
join clause).

-- amr

On 6/24/2010 10:43 AM, Viraj Bhat wrote:
>
> Hi all,
>
>  I am joining 2 datasets, one is around 1.5TB in size and the other is
> around 350MB in size.
>
> I wanted to do a Map Side join using "id" as the join column between
> the two tables. I read about the Mapside join in Hive.
>
> http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins. Are there
> some technical specs on Mapside join on a wiki/jira?
>
> Here are some questions:
>
> 1) Do the tables need to be sorted on "id"?
>
> 2) Is there a restriction on the smaller table size?
>
> Are there other join optimizations that Hive provides which I can
> apply here?
>
> Viraj
>