-Re: single MR stage for join and group by
Chen Song 2013-08-02, 17:32
We are currently using cloudera's CDH4.2.0 based on Hive 0.10.0. So neither
optimizations were incorporated.
Thank you all for the valuable feedback and this is really helpful. I will
look further into the details of the JIRA.
On Fri, Aug 2, 2013 at 12:14 AM, Yin Huai <[EMAIL PROTECTED]> wrote:
> If the join is a reduce side join,
> https://issues.apache.org/jira/browse/HIVE-2206 will optimize this query
> and generate a single MR job. The optimizer introduced by HIVE-2206 is in
> trunk. Currently, it only handles the same column(s).
> If the join is a MapJoin, hive 0.11 can generate a single MR job (In this
> case, if join and group by use the same column(s) does not matter). To
> enable it, you need to ...
> set hive.auto.convert.join=true
> set hive.auto.convert.join.noconditionaltask=true;
> set hive.optimize.mapjoin.mapreduce=true;
> and also make sure hive.auto.convert.join.noconditionaltask.size is larger
> than the size of the small table.
> For hive trunk, https://issues.apache.org/jira/browse/HIVE-4827 drops the
> flag of "hive.optimize.mapjoin.mapreduce". So, in future release, you will
> not need to set hive.optimize.mapjoin.mapreduce.
> On Thu, Aug 1, 2013 at 5:32 PM, Stephen Sprague <[EMAIL PROTECTED]>wrote:
>> and what version of hive are you running your test on? i do believe -
>> not certain - that hive 0.11 includes the optimization you seek.
>> On Thu, Aug 1, 2013 at 10:19 AM, Chen Song <[EMAIL PROTECTED]>wrote:
>>> Suppose we have 2 simple tables
>>> id int
>>> value string
>>> When hive translates the following query
>>> select max(A.value), A.id from A join B on A.id = A.id group by A.id;
>>> It launches 2 stages, one for the join and one for the group by.
>>> My understanding is that if the join key set is a sub set of the group
>>> by key set, it can be achieved in the same map reduce job. If that is
>>> correct in theory, could it be a feature in hive?