Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> map side join with group by

Copy link to this message
Re: map side join with group by
Yeah, my abridged version of query might be a little broken but my point is
that when a query has a map join and group by, even in its simplified
incarnation, it will launch two jobs. I was just wondering why map join and
group by cannot be accomplished in one MR job.


On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:

> I think Chen wanted to know why this is two phased query if I understood
> it correctly
> When you run a mapside join .. it just performs the join query .. after
> that to execute the group by part it launches the second job.
> I may be wrong but this is how I saw it whenever I executed group by
> queries
> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <[EMAIL PROTECTED]>wrote:
>> Hi Chen,
>> I think we would need some more information.
>> The query is referring to a table called "d" in the MAPJOIN hint but
>> there is not such table in the query. Moreover, Map joins only make
>> sense when the right table is the one being "mapped" (in other words,
>> being kept in memory) in case of a Left Outer Join, similarly if the
>> left table is the one being "mapped" in case of a Right Outer Join.
>> Let me know if this is not clear, I'd be happy to offer a better
>> explanation.
>> In your query, the where clause on a column called "hour", at this
>> point I am unsure if that's a column of table1 or table2. If it's
>> column on table1, that predicate would get pushed up (if you have
>> hive.optimize.ppd property set to true), so it could possibly be done
>> in 1 MR job (I am not sure if that's presently the case, you will have
>> to check the explain plan). If however, the where clause is on a
>> column in the right table (table2 in your example), it can't be pushed
>> up since a column of the right table can have different values before
>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>> to be applied in a separate MR job.
>> This is just my understanding, the full proof answer would lie in
>> checking out the explain plans and the Semantic Analyzer code.
>> And for completeness, there is a conditional task (starting Hive 0.7)
>> that will convert your joins automatically to map joins where
>> applicable. This can be enabled by enabling hive.auto.convert.join
>> property.
>> Mark
>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <[EMAIL PROTECTED]>
>> wrote:
>> > I have a silly question on how Hive interpretes a simple query with
>> both map
>> > side join and group by.
>> >
>> > Below query will translate into two jobs, with the 1st one as a map
>> only job
>> > doing the join and storing the output in a intermediary location, and
>> the
>> > 2nd one as a map-reduce job taking the output of the 1st job as input
>> and
>> > doing the group by.
>> >
>> > /*+ MAPJOIN(d) */
>> > table.a, sum(table2.b)
>> > from table
>> > LEFT OUTER JOIN table2
>> > ON table.id = table2.id
>> > where hour = '2012-12-11 11'
>> > group by table.a
>> >
>> > Why can't this be done within a single map reduce job? As what I can see
>> > from the query plan is that all 2nd job mapper do is taking the 1st
>> job's
>> > mapper output.
>> >
>> > --
>> > Chen Song
>> >
>> >
> --
> Nitin Pawar

Chen Song