Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> map side join with group by


Copy link to this message
-
Re: map side join with group by
Hi Chen,
I think we would need some more information.

The query is referring to a table called "d" in the MAPJOIN hint but
there is not such table in the query. Moreover, Map joins only make
sense when the right table is the one being "mapped" (in other words,
being kept in memory) in case of a Left Outer Join, similarly if the
left table is the one being "mapped" in case of a Right Outer Join.
Let me know if this is not clear, I'd be happy to offer a better
explanation.

In your query, the where clause on a column called "hour", at this
point I am unsure if that's a column of table1 or table2. If it's
column on table1, that predicate would get pushed up (if you have
hive.optimize.ppd property set to true), so it could possibly be done
in 1 MR job (I am not sure if that's presently the case, you will have
to check the explain plan). If however, the where clause is on a
column in the right table (table2 in your example), it can't be pushed
up since a column of the right table can have different values before
and after the LEFT OUTER JOIN. Therefore, the where clause would need
to be applied in a separate MR job.

This is just my understanding, the full proof answer would lie in
checking out the explain plans and the Semantic Analyzer code.

And for completeness, there is a conditional task (starting Hive 0.7)
that will convert your joins automatically to map joins where
applicable. This can be enabled by enabling hive.auto.convert.join
property.

Mark

On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <[EMAIL PROTECTED]> wrote:
> I have a silly question on how Hive interpretes a simple query with both map
> side join and group by.
>
> Below query will translate into two jobs, with the 1st one as a map only job
> doing the join and storing the output in a intermediary location, and the
> 2nd one as a map-reduce job taking the output of the 1st job as input and
> doing the group by.
>
> SELECT
> /*+ MAPJOIN(d) */
> table.a, sum(table2.b)
> from table
> LEFT OUTER JOIN table2
> ON table.id = table2.id
> where hour = '2012-12-11 11'
> group by table.a
>
> Why can't this be done within a single map reduce job? As what I can see
> from the query plan is that all 2nd job mapper do is taking the 1st job's
> mapper output.
>
> --
> Chen Song
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB