Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - An explanation of LEFT OUTER JOIN and NULL values


Copy link to this message
-
Re: An explanation of LEFT OUTER JOIN and NULL values
David Morel 2013-01-25, 06:34
On 24 Jan 2013, at 20:39, [EMAIL PROTECTED] wrote:

> Hi David,
>
> The default partitioner used in map reduce is the hash partitioner. So
> based on your keys they are send to a particular reducer.
>
> May be in your current data set, the keys that have no values in table
> are all falling in the same hash bucket and hence being processed by
> the same reducer.

Really not the case, no, so it doesn't make any sort of sense to me :\
At this point I am starting to think the only way to figure it out is to
set or add debugging at the reducer level. I hoped I could avoid it,
alas...

> If you are noticing a skew on a particular reducer, sometimes  a
> simple work around like increasing the no of reducers explicitly might
> help you get pass the hurdle.

Yes, that seemed to work in some cases, but is not very handy: it's
hard to tell my users 'try setting a different number of reducers when
your query is stuck' ;-)

> Also please ensure you have enabled skew join optimization.

Didn't have much success with this, but maybe my version of hive is a
bit old.

So to emulate a LEFT OUTER JOIN I had to do something really horrible:

JOIN (
    -- TABLE B is the joined table
SELECT key, value FROM TABLE B
UNION ALL
    -- TABLE A has all the unique ids
SELECT key, '' FROM TABLE A
) AS SUB

and then use a count -1 to get a count of non-null rows, etc. At least
it does work with no slowdowns, but I mean, yuk!

It's the best I could come up with so far, so if I could fully understand
the root cause of the problem, that would be much better. I guess I'll
dig in a bit deeper then.

Thanks a lot!

David

>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
>
> -----Original Message-----
> From: "David Morel" <[EMAIL PROTECTED]>
> Date: Thu, 24 Jan 2013 18:39:56
> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Reply-To: [EMAIL PROTECTED]
> Subject: Re: An explanation of LEFT OUTER JOIN and NULL values
>
> On 24 Jan 2013, at 18:16, [EMAIL PROTECTED] wrote:
>
>> Hi David
>>
>> An explain extended would give you the exact pointer.
>>
>> From my understanding, this is how it could work.
>>
>> You have two tables then two different map reduce job would be
>> processing those. Based on the join keys, combination of corresponding
>> columns would be chosen as key from mapper1 and mapper2. So if the
>> combination of columns having the same value those records from two
>> set of mappers would go into the same reducer.
>>
>> On the reducer if there is a corresponding value for a key from table
>> 1 to  table 2/mapper 2 that value would be populated. If no val for
>> mapper 2 then those columns from table 2 are made null.
>>
>> If there is a key-value just from table 2/mapper 2 and no
>> corresponding value from mapper 1. That value is just discarded.
>
> Hi Bejoy,
>
> Thanks! So schematically, something like this, right?
>
> mapper1 (bigger table):
> K1-A, V1A
> K2-A, V2A
> K3-A, V3A
>
> mapper2 (joined, smaller table):
> K1-B, V1B
>
> reducer1:
> K1-A, V1A
> K1-B, V1B
>
> returns:
> K1, V1A, V1B etc
>
> reducer2:
> K2-A, V2A
> *no* K2-B, V so: K2-B, NULL is created, same for next row.
> K3-A, V3A
>
> returns:
> K2, V2A, NULL etc
> K3, V3A, NULL etc
>
> I still don't understand why my reducer2 (and only this one, which
> apparently gets all the keys for which we don't have a row on table B)
> would become overloaded. Am I completely misunderstanding the whole
> thing?
>
> David
>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>>
>> -----Original Message-----
>> From: "David Morel" <[EMAIL PROTECTED]>
>> Date: Thu, 24 Jan 2013 18:03:40
>> To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>
>> Reply-To: [EMAIL PROTECTED]
>> Subject: An explanation of LEFT OUTER JOIN and NULL values
>>
>> Hi!
>>
>> After hitting the "curse of the last reducer" many times on LEFT OUTER
>> JOIN queries, and trying to think about it, I came to the conclusion
>> there's something I am missing regarding how keys are handled in