Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> single MR stage for join and group by


Copy link to this message
-
Re: single MR stage for join and group by
If the join is a reduce side join,
https://issues.apache.org/jira/browse/HIVE-2206 will optimize this query
and generate a single MR job. The optimizer introduced by HIVE-2206 is in
trunk. Currently, it only handles the same column(s).

If the join is a MapJoin, hive 0.11 can generate a single MR job (In this
case, if join and group by use the same column(s) does not matter). To
enable it, you need to ...
set hive.auto.convert.join=true
set hive.auto.convert.join.noconditionaltask=true;
set hive.optimize.mapjoin.mapreduce=true;
and also make sure hive.auto.convert.join.noconditionaltask.size is larger
than the size of the small table.
For hive trunk, https://issues.apache.org/jira/browse/HIVE-4827 drops the
flag of "hive.optimize.mapjoin.mapreduce". So, in future release, you will
not need to set hive.optimize.mapjoin.mapreduce.

Thanks,

Yin
On Thu, Aug 1, 2013 at 5:32 PM, Stephen Sprague <[EMAIL PROTECTED]> wrote:

> and what version of hive are you running your test on?  i do believe - not
> certain - that hive 0.11 includes the optimization you seek.
>
>
> On Thu, Aug 1, 2013 at 10:19 AM, Chen Song <[EMAIL PROTECTED]> wrote:
>
>> Suppose we have 2 simple tables
>>
>> A
>> id int
>> value string
>>
>> B
>> id
>>
>> When hive translates the following query
>>
>> select max(A.value), A.id from A join B on A.id = A.id group by A.id;
>>
>> It launches 2 stages, one for the join and one for the group by.
>>
>> My understanding is that if the join key set is a sub set of the group by
>> key set, it can be achieved in the same map reduce job. If that is correct
>> in theory, could it be a feature in hive?
>>
>> Chen
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB