Hi Amr,
Thanks for your help. Let me try the STREAMTABLE option, if one of the
datasets exceeds 1GB.
Vira
________________________________
From: Amr Awadallah [mailto:[EMAIL PROTECTED]]
Sent: Saturday, June 26, 2010 12:58 AM
To: [EMAIL PROTECTED]
Subject: Re: MapSide join in Hive
Viraj,
1. No
2. Yes, smaller table needs to fit in jvm memory (typically more than
1GB for small table is too large).
See slide 7 and after in this preso for different join strategies that
can help in case the tables are bucketed and sorted.
http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-teamThere is also the /*+STREAMTABLE(tablealias)*/ hint, which you should
use for very large tables (or make sure it is the rightmost table in the
join clause).
-- amr
On 6/24/2010 10:43 AM, Viraj Bhat wrote:
Hi all,
I am joining 2 datasets, one is around 1.5TB in size and the other is
around 350MB in size.
I wanted to do a Map Side join using "id" as the join column between the
two tables. I read about the Mapside join in Hive.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins. Are there some
technical specs on Mapside join on a wiki/jira?
Here are some questions:
Do the tables need to be sorted on "id"?
Is there a restriction on the smaller table size?
Are there other join optimizations that Hive provides which I can apply
here?
Viraj