Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> single MR stage for join and group by


+
Chen Song 2013-08-01, 17:19
+
Stephen Sprague 2013-08-02, 00:32
+
Yin Huai 2013-08-02, 04:14
Copy link to this message
-
Re: single MR stage for join and group by
We are currently using cloudera's CDH4.2.0 based on Hive 0.10.0. So neither
optimizations were incorporated.

Thank you all for the valuable feedback and this is really helpful. I will
look further into the details of the JIRA.

On Fri, Aug 2, 2013 at 12:14 AM, Yin Huai <[EMAIL PROTECTED]> wrote:

> If the join is a reduce side join,
> https://issues.apache.org/jira/browse/HIVE-2206 will optimize this query
> and generate a single MR job. The optimizer introduced by HIVE-2206 is in
> trunk. Currently, it only handles the same column(s).
>
> If the join is a MapJoin, hive 0.11 can generate a single MR job (In this
> case, if join and group by use the same column(s) does not matter). To
> enable it, you need to ...
> set hive.auto.convert.join=true
> set hive.auto.convert.join.noconditionaltask=true;
> set hive.optimize.mapjoin.mapreduce=true;
> and also make sure hive.auto.convert.join.noconditionaltask.size is larger
> than the size of the small table.
> For hive trunk, https://issues.apache.org/jira/browse/HIVE-4827 drops the
> flag of "hive.optimize.mapjoin.mapreduce". So, in future release, you will
> not need to set hive.optimize.mapjoin.mapreduce.
>
> Thanks,
>
> Yin
>
>
> On Thu, Aug 1, 2013 at 5:32 PM, Stephen Sprague <[EMAIL PROTECTED]>wrote:
>
>> and what version of hive are you running your test on?  i do believe -
>> not certain - that hive 0.11 includes the optimization you seek.
>>
>>
>> On Thu, Aug 1, 2013 at 10:19 AM, Chen Song <[EMAIL PROTECTED]>wrote:
>>
>>> Suppose we have 2 simple tables
>>>
>>> A
>>> id int
>>> value string
>>>
>>> B
>>> id
>>>
>>> When hive translates the following query
>>>
>>> select max(A.value), A.id from A join B on A.id = A.id group by A.id;
>>>
>>> It launches 2 stages, one for the join and one for the group by.
>>>
>>> My understanding is that if the join key set is a sub set of the group
>>> by key set, it can be achieved in the same map reduce job. If that is
>>> correct in theory, could it be a feature in hive?
>>>
>>> Chen
>>>
>>>
>>
>
--
Chen Song
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB