|
|
-
how may map-reduce needed in a hive query
Richard 2013-01-23, 03:45
I am wondering how to determine the number of map-reduce for a hive query. for example, the following query select sum(c1), sum(c2), k1 from { select transform(*) using 'mymapper' as c1, c2, k1 from t1 } a group by k1; when i run this query, it takes two map-reduce, but I expect it to take only 1. in the map stage, using 'mymapper' as the mapper, then shuffle the mapper output by k1 and perform sum reduce in the reducer. so why hive takes 2 map-reduce?
-
Re: how may map-reduce needed in a hive query
Nitin Pawar 2013-01-23, 04:07
you can run explain extended (your query) to get more details On Wed, Jan 23, 2013 at 9:15 AM, Richard <[EMAIL PROTECTED]> wrote:
> I am wondering how to determine the number of map-reduce for a hive query. > > for example, the following query > > select > sum(c1), > sum(c2), > k1 > from > { > select transform(*) using 'mymapper' as c1, c2, k1 > from t1 > } a group by k1; > > when i run this query, it takes two map-reduce, but I expect it to take > only 1. > in the map stage, using 'mymapper' as the mapper, then shuffle the mapper > output by k1 and perform sum reduce in the reducer. > > so why hive takes 2 map-reduce? > > > -- Nitin Pawar
-
Re:how may map-reduce needed in a hive query
Richard 2013-01-23, 05:54
thanks. I used explain command and get the plan, but I am still confused. The below is the description of two map-reduce stages: it seems that in stage-1 the aggregation has already been done, why stage-2 has aggregation again? =========================STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a:t1 TableScan alias: t1 Select Operator expressions: expr: f type: string outputColumnNames: _col0 Transform Operator command: mymapper output info: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: string outputColumnNames: _col0, _col1, _col2 Group By Operator aggregations: expr: sum(_col0) expr: sum(_col1) bucketGroup: false keys: expr: _col2 type: string mode: hash outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: rand() type: double tag: -1 value expressions: expr: _col1 type: double expr: _col2 type: double Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) expr: sum(VALUE._col1) bucketGroup: false keys: expr: KEY._col0 type: string mode: partials outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: hdfs://hdpnn:9000/mydata/hive/hive_2013-01-23_13-46-09_628_5487089660360786955/10002 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: double expr: _col2 type: double Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) expr: sum(VALUE._col1) bucketGroup: false keys: expr: KEY._col0 type: string mode: final outputColumnNames: _col0, _col1, _col2 Select Operator expressions: expr: _col1 type: double expr: _col2 type: double expr: _col0 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
===========================
At 2013-01-23 11:45:13,Richard <[EMAIL PROTECTED]> wrote:
I am wondering how to determine the number of map-reduce for a hive query. for example, the following query select sum(c1), sum(c2), k1 from { select transform(*) using 'mymapper' as c1, c2, k1 from t1 } a group by k1; when i run this query, it takes two map-reduce, but I expect it to take only 1. in the map stage, using 'mymapper' as the mapper, then shuffle the mapper output by k1 and perform sum reduce in the reducer. so why hive takes 2 map-reduce?
-
Re: how may map-reduce needed in a hive query
Nitin Pawar 2013-01-23, 06:05
if you look closely in first phase it executes your transform and in second it does your sum operation On Wed, Jan 23, 2013 at 11:24 AM, Richard <[EMAIL PROTECTED]> wrote:
> thanks. I used explain command and get the plan, but I am still confused. > The below is the description of two map-reduce stages: > > it seems that in stage-1 the aggregation has already been done, why > stage-2 has aggregation again? > > > =========================> STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > a:t1 > TableScan > alias: t1 > Select Operator > expressions: > &nbs p; expr: f > type: string > outputColumnNames: _col0 > Transform Operator > command: mymapper > output info: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Select Operator > expressions: > expr: _col0 > type: string > expr: _col1 > type: string > expr: _col2 > &n bsp; type: string > outputColumnNames: _col0, _col1, _col2 > Group By Operator > aggregations: > expr: sum(_col0) > expr: sum(_col1) >   ; bucketGroup: false > keys: > expr: _col2 > type: string > mode: hash > outputColumnNames: _col0, _col1, _col2 > Reduce Output Operator >   ; key expressions: > expr: _col0 > type: string > sort order: + > Map-reduce partition columns: > expr: rand() > &nb sp; type: double > tag: -1 > value expressions: > expr: _col1 > type: double > expr: _col2 >   ; type: double > Reduce Operator Tree: > Group By Operator > aggregations: > expr: sum(VALUE._col0) > expr: sum(VALUE._col1) > bucketGroup: false > keys: > expr: KEY._col0 > type: string > mode: partials > &n bsp; outputColumnNames: _col0, _col1, _col2 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > > Stage: Stage-2 > Map Reduce > Alias -> Map Operator Tree: > > hdfs://hdpnn:9000/mydata/hive/hive_2013-01-23_13-46-09_628_5487089660360786955/10002 > & nbsp; Reduce Output Operator > key expressions: > expr: _col0 > type: string > sort order: + > Map-reduce partition columns: > expr: _col0 > type: string > tag: -1 > &nbs p; value expressions: > expr: _col1 > type: double > expr: _col2 > type: double > Reduce Operator Tree: > Group By Operator > aggregations: > expr: sum(VALUE._col0) > &nb sp; expr: sum(VALUE._col1) > bucketGroup: false > keys: > expr: KEY._col0 > type: string Nitin Pawar
|
|