Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> map side join with group by


Copy link to this message
-
Re: map side join with group by
chen in mapside join .. there are no reducers .. its MAP ONLY job
On Thu, Dec 13, 2012 at 11:54 PM, Chen Song <[EMAIL PROTECTED]> wrote:

> Understood that fact that it is impossible in the same MR job if both join
> and group by are gonna happen in the reduce phase (because the join keys
> and group by keys are different). But for map side join, the joins would be
> complete by the end of the map phase, and outputs should be ready to be
> distributed to reducers based on group by keys.
>
> Chen
>
>
> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>
>> Thats because for the first job the join keys are different and second
>> job group by keys are different, you just cant assume join keys and group
>> keys will be same so they are two different jobs
>>
>>
>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <[EMAIL PROTECTED]>wrote:
>>
>>> Yeah, my abridged version of query might be a little broken but my point
>>> is that when a query has a map join and group by, even in its simplified
>>> incarnation, it will launch two jobs. I was just wondering why map join and
>>> group by cannot be accomplished in one MR job.
>>>
>>> Best,
>>> Chen
>>>
>>>
>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>
>>>> I think Chen wanted to know why this is two phased query if I
>>>> understood it correctly
>>>>
>>>> When you run a mapside join .. it just performs the join query .. after
>>>> that to execute the group by part it launches the second job.
>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>> queries
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hi Chen,
>>>>> I think we would need some more information.
>>>>>
>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>> sense when the right table is the one being "mapped" (in other words,
>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>>> explanation.
>>>>>
>>>>> In your query, the where clause on a column called "hour", at this
>>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>>> column on table1, that predicate would get pushed up (if you have
>>>>> hive.optimize.ppd property set to true), so it could possibly be done
>>>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>>>> to check the explain plan). If however, the where clause is on a
>>>>> column in the right table (table2 in your example), it can't be pushed
>>>>> up since a column of the right table can have different values before
>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>>>> to be applied in a separate MR job.
>>>>>
>>>>> This is just my understanding, the full proof answer would lie in
>>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>>
>>>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>>>> that will convert your joins automatically to map joins where
>>>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>>>> property.
>>>>>
>>>>> Mark
>>>>>
>>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <[EMAIL PROTECTED]>
>>>>> wrote:
>>>>> > I have a silly question on how Hive interpretes a simple query with
>>>>> both map
>>>>> > side join and group by.
>>>>> >
>>>>> > Below query will translate into two jobs, with the 1st one as a map
>>>>> only job
>>>>> > doing the join and storing the output in a intermediary location,
>>>>> and the
>>>>> > 2nd one as a map-reduce job taking the output of the 1st job as
>>>>> input and
>>>>> > doing the group by.
>>>>> >
>>>>> > SELECT
>>>>> > /*+ MAPJOIN(d) */
>>>>> > table.a, sum(table2.b)
>>>>> > from table
Nitin Pawar