Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - one-to-many Map Side Join without reducer


Copy link to this message
-
Re: one-to-many Map Side Join without reducer
madhu phatak 2011-06-21, 11:05
I think HIVE is best suited for ur use case where it gives you the sql based
interface to the hadoop to make these type of things.

On Fri, Jun 10, 2011 at 2:39 AM, Shi Yu <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I have two datasets: dataset 1 has the format:
>
> MasterKey1    SubKey1    SubKey2    SubKey3
> MasterKey2    Subkey4     Subkey5     Subkey6
> ....
>
>
> dataset 2 has the format:
>
> SubKey1    Value1
> SubKey2    Value2
> ...
>
> I want to have one-to-many join based on the SubKey, and the final goal is
> to have an output like:
>
> MasterKey1    Value1    Value2    Value3
> MasterKey2    Value4    Value5    Value6
> ...
>
>
> After studying and experimenting some example code, I understand that it is
> doable if I transform the first data set as
>
> SubKey1    MasterKey1
> SubKey2    MasterKey1
> SubKey3    MasterKey1
> SubKey4    MasterKey2
> SubKey5    MasterKey2
> SubKey6    MasterKey2
>
> then using the inner join with the dataset 2 on SubKey. Then I probably
> need a reducer to perform secondary sort on MasterKey to get the result.
> However, the bottleneck is still on the reducer if each MasterKey has lots
> of SubKey.
> My question is, suppose that dataset2 contains all the Subkeys and never
> split, is it possible to join the key of dataset 2 with multiple values of
> dataset 1 at the Mapper Side? Any hint is highly appreciated.
>
> Shi
>
>
>