Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - An explanation of LEFT OUTER JOIN and NULL values


Copy link to this message
-
An explanation of LEFT OUTER JOIN and NULL values
David Morel 2013-01-24, 17:03
Hi!

After hitting the "curse of the last reducer" many times on LEFT OUTER
JOIN queries, and trying to think about it, I came to the conclusion
there's something I am missing regarding how keys are handled in mapred
jobs.

The problem shows when I have table A containing billions of rows with
distinctive keys, that I need to join to table B that has a much lower
number of rows.

I need to keep all the A rows, populated with NULL values from the B
side, so that's what a LEFT OUTER is for.

Now, when transforming that into a mapred job, my -naive- understanding
would be that for every key on the A table, a missing key on the B table
would be generated with a NULL value. If that were the case, I fail to
understand why all NULL valued B keys would end up on the same reducer,
since the key defines which reducer is used, not the value.

So, obviously, this is not how it works.

So my question is: how is this construct handled?

Thanks a lot!

D.Morel