Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> MapSide join in Hive


+
Viraj Bhat 2010-06-24, 17:43
Copy link to this message
-
Re: MapSide join in Hive
Viraj,

1. No
2. Yes, smaller table needs to fit in jvm memory (typically more than
1GB for small table is too large).

See slide 7 and after in this preso for different join strategies that
can help in case the tables are bucketed and sorted.

http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team

There is also the /*+STREAMTABLE(tablealias)*/ hint, which you should
use for very large tables (or make sure it is the rightmost table in the
join clause).

-- amr

On 6/24/2010 10:43 AM, Viraj Bhat wrote:
>
> Hi all,
>
>  I am joining 2 datasets, one is around 1.5TB in size and the other is
> around 350MB in size.
>
> I wanted to do a Map Side join using "id" as the join column between
> the two tables. I read about the Mapside join in Hive.
>
> http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins. Are there
> some technical specs on Mapside join on a wiki/jira?
>
> Here are some questions:
>
> 1) Do the tables need to be sorted on "id"?
>
> 2) Is there a restriction on the smaller table size?
>
> Are there other join optimizations that Hive provides which I can
> apply here?
>
> Viraj
>
+
Viraj Bhat 2010-06-29, 20:42
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB