If the join is a reduce side join,
https://issues.apache.org/jira/browse/HIVE-2206 will optimize this query
and generate a single MR job. The optimizer introduced by HIVE-2206 is in
trunk. Currently, it only handles the same column(s).
If the join is a MapJoin, hive 0.11 can generate a single MR job (In this
case, if join and group by use the same column(s) does not matter). To
enable it, you need to ...
and also make sure hive.auto.convert.join.noconditionaltask.size is larger
than the size of the small table.
For hive trunk, https://issues.apache.org/jira/browse/HIVE-4827 drops the
flag of "hive.optimize.mapjoin.mapreduce". So, in future release, you will
not need to set hive.optimize.mapjoin.mapreduce.
On Thu, Aug 1, 2013 at 5:32 PM, Stephen Sprague <[EMAIL PROTECTED]> wrote:
> and what version of hive are you running your test on? i do believe - not
> certain - that hive 0.11 includes the optimization you seek.
> On Thu, Aug 1, 2013 at 10:19 AM, Chen Song <[EMAIL PROTECTED]> wrote:
>> Suppose we have 2 simple tables
>> id int
>> value string
>> When hive translates the following query
>> select max(A.value), A.id from A join B on A.id = A.id group by A.id;
>> It launches 2 stages, one for the join and one for the group by.
>> My understanding is that if the join key set is a sub set of the group by
>> key set, it can be achieved in the same map reduce job. If that is correct
>> in theory, could it be a feature in hive?