Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - An explanation of LEFT OUTER JOIN and NULL values


Copy link to this message
-
Re: An explanation of LEFT OUTER JOIN and NULL values
bejoy_ks@... 2013-01-24, 17:16
Hi David

An explain extended would give you the exact pointer.

From my understanding, this is how it could work.

 You have two tables then two different map reduce job would be processing those. Based on the join keys, combination of corresponding columns would be chosen as key from mapper1 and mapper2. So if the combination of columns having the same value those records from two set of mappers would go into the same reducer.

On the reducer if there is a corresponding value for a key from table 1 to  table 2/mapper 2 that value would be populated. If no val for mapper 2 then those columns from table 2 are made null.

If there is a key-value just from table 2/mapper 2 and no  corresponding value from mapper 1. That value is just discarded.
Regards
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: "David Morel" <[EMAIL PROTECTED]>
Date: Thu, 24 Jan 2013 18:03:40
To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: An explanation of LEFT OUTER JOIN and NULL values

Hi!

After hitting the "curse of the last reducer" many times on LEFT OUTER
JOIN queries, and trying to think about it, I came to the conclusion
there's something I am missing regarding how keys are handled in mapred
jobs.

The problem shows when I have table A containing billions of rows with
distinctive keys, that I need to join to table B that has a much lower
number of rows.

I need to keep all the A rows, populated with NULL values from the B
side, so that's what a LEFT OUTER is for.

Now, when transforming that into a mapred job, my -naive- understanding
would be that for every key on the A table, a missing key on the B table
would be generated with a NULL value. If that were the case, I fail to
understand why all NULL valued B keys would end up on the same reducer,
since the key defines which reducer is used, not the value.

So, obviously, this is not how it works.

So my question is: how is this construct handled?

Thanks a lot!

D.Morel