Chen Song 2012-12-12, 23:32
Mark Grover 2012-12-13, 01:41
Nitin Pawar 2012-12-13, 05:30
Chen Song 2012-12-13, 14:56
-Re: map side join with group by
Nitin Pawar 2012-12-13, 16:04
Thats because for the first job the join keys are different and second job
group by keys are different, you just cant assume join keys and group keys
will be same so they are two different jobs
On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <[EMAIL PROTECTED]> wrote:
> Yeah, my abridged version of query might be a little broken but my point
> is that when a query has a map join and group by, even in its simplified
> incarnation, it will launch two jobs. I was just wondering why map join and
> group by cannot be accomplished in one MR job.
> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>> I think Chen wanted to know why this is two phased query if I understood
>> it correctly
>> When you run a mapside join .. it just performs the join query .. after
>> that to execute the group by part it launches the second job.
>> I may be wrong but this is how I saw it whenever I executed group by
>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <[EMAIL PROTECTED]
>> > wrote:
>>> Hi Chen,
>>> I think we would need some more information.
>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>> there is not such table in the query. Moreover, Map joins only make
>>> sense when the right table is the one being "mapped" (in other words,
>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>> left table is the one being "mapped" in case of a Right Outer Join.
>>> Let me know if this is not clear, I'd be happy to offer a better
>>> In your query, the where clause on a column called "hour", at this
>>> point I am unsure if that's a column of table1 or table2. If it's
>>> column on table1, that predicate would get pushed up (if you have
>>> hive.optimize.ppd property set to true), so it could possibly be done
>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>> to check the explain plan). If however, the where clause is on a
>>> column in the right table (table2 in your example), it can't be pushed
>>> up since a column of the right table can have different values before
>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>> to be applied in a separate MR job.
>>> This is just my understanding, the full proof answer would lie in
>>> checking out the explain plans and the Semantic Analyzer code.
>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>> that will convert your joins automatically to map joins where
>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <[EMAIL PROTECTED]>
>>> > I have a silly question on how Hive interpretes a simple query with
>>> both map
>>> > side join and group by.
>>> > Below query will translate into two jobs, with the 1st one as a map
>>> only job
>>> > doing the join and storing the output in a intermediary location, and
>>> > 2nd one as a map-reduce job taking the output of the 1st job as input
>>> > doing the group by.
>>> > SELECT
>>> > /*+ MAPJOIN(d) */
>>> > table.a, sum(table2.b)
>>> > from table
>>> > LEFT OUTER JOIN table2
>>> > ON table.id = table2.id
>>> > where hour = '2012-12-11 11'
>>> > group by table.a
>>> > Why can't this be done within a single map reduce job? As what I can
>>> > from the query plan is that all 2nd job mapper do is taking the 1st
>>> > mapper output.
>>> > --
>>> > Chen Song
>> Nitin Pawar
> Chen Song
Chen Song 2012-12-13, 18:24
Nitin Pawar 2012-12-13, 18:42
Chen Song 2012-12-13, 19:12
Nitin Pawar 2012-12-13, 19:30
Chen Song 2012-12-13, 19:50