|
Saurabh Mishra
2012-10-15, 12:09
MiaoMiao
2012-10-15, 13:10
Saurabh Mishra
2012-10-15, 14:23
Philip Tromans
2012-10-15, 15:29
Saurabh Mishra
2012-10-15, 20:45
Navis류승우
2012-10-16, 05:17
Saurabh Mishra
2012-10-16, 05:53
Saurabh Mishra
2012-10-18, 08:56
Philip Tromans
2012-10-18, 09:03
|
-
Hive Query Unable to distribute load evenly in reducersSaurabh Mishra 2012-10-15, 12:09
Hi,
I am firing some hive queries joining tables containing upto 30millions records each. Since the load on the reducers is very significant in these cases, i specifically set the following parameters before executing the queries : set mapred.reduce.tasks=100; set hive.exec.reducers.bytes.per.reducer=500000000; set hive.optimize.cp=true; The number of reducer the job spouts in now 160, but despite the high number most of the load remains upon 1 or 2 reducers. Hence in the final statistics, 158 reducers go completed with 2-3 minutes of start and 2 reducers took 2 hrs to run. Is there any way to overcome this load distribution disparity. Any help in this regards will be highly appreciated. Sincerely Saurabh Mishra
-
Re: Hive Query Unable to distribute load evenly in reducersMiaoMiao 2012-10-15, 13:10
And your queries were?
On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra <[EMAIL PROTECTED]> wrote: > Hi, > I am firing some hive queries joining tables containing upto 30millions > records each. Since the load on the reducers is very significant in these > cases, i specifically set the following parameters before executing the > queries : > > set mapred.reduce.tasks=100; > set hive.exec.reducers.bytes.per.reducer=500000000; > set hive.optimize.cp=true; > > The number of reducer the job spouts in now 160, but despite the high number > most of the load remains upon 1 or 2 reducers. Hence in the final > statistics, 158 reducers go completed with 2-3 minutes of start and 2 > reducers took 2 hrs to run. > Is there any way to overcome this load distribution disparity. > Any help in this regards will be highly appreciated. > > Sincerely > Saurabh Mishra
-
RE: Hive Query Unable to distribute load evenly in reducersSaurabh Mishra 2012-10-15, 14:23
The queries are simple joins, something on the lines of
select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group by a, b,c; > From: [EMAIL PROTECTED] > Date: Mon, 15 Oct 2012 21:10:39 +0800 > Subject: Re: Hive Query Unable to distribute load evenly in reducers > To: [EMAIL PROTECTED] > > And your queries were? > > On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra > <[EMAIL PROTECTED]> wrote: > > Hi, > > I am firing some hive queries joining tables containing upto 30millions > > records each. Since the load on the reducers is very significant in these > > cases, i specifically set the following parameters before executing the > > queries : > > > > set mapred.reduce.tasks=100; > > set hive.exec.reducers.bytes.per.reducer=500000000; > > set hive.optimize.cp=true; > > > > The number of reducer the job spouts in now 160, but despite the high number > > most of the load remains upon 1 or 2 reducers. Hence in the final > > statistics, 158 reducers go completed with 2-3 minutes of start and 2 > > reducers took 2 hrs to run. > > Is there any way to overcome this load distribution disparity. > > Any help in this regards will be highly appreciated. > > > > Sincerely > > Saurabh Mishra
-
Re: Hive Query Unable to distribute load evenly in reducersPhilip Tromans 2012-10-15, 15:29
Is your data heavily skewed towards certain values of a.x etc?
On 15 October 2012 15:23, Saurabh Mishra <[EMAIL PROTECTED]> wrote: > The queries are simple joins, something on the lines of > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group > by a, b,c; > > >> From: [EMAIL PROTECTED] >> Date: Mon, 15 Oct 2012 21:10:39 +0800 >> Subject: Re: Hive Query Unable to distribute load evenly in reducers >> To: [EMAIL PROTECTED] > >> >> And your queries were? >> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra >> <[EMAIL PROTECTED]> wrote: >> > Hi, >> > I am firing some hive queries joining tables containing upto 30millions >> > records each. Since the load on the reducers is very significant in >> > these >> > cases, i specifically set the following parameters before executing the >> > queries : >> > >> > set mapred.reduce.tasks=100; >> > set hive.exec.reducers.bytes.per.reducer=500000000; >> > set hive.optimize.cp=true; >> > >> > The number of reducer the job spouts in now 160, but despite the high >> > number >> > most of the load remains upon 1 or 2 reducers. Hence in the final >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2 >> > reducers took 2 hrs to run. >> > Is there any way to overcome this load distribution disparity. >> > Any help in this regards will be highly appreciated. >> > >> > Sincerely >> > Saurabh Mishra
-
RE: Hive Query Unable to distribute load evenly in reducersSaurabh Mishra 2012-10-15, 20:45
no there is apparently no heavy skewing. also another stats i wanted to point was, following is approximate table contents in this 4 table join query :
tableA : 170 million (actual number, + i am also exploding these records, so the number could be much much higher) tableB:15 tableC:45 tableD:45 tableE : 45 tableF : 14000 Also i cannot put any filter condition on tableA ,situation does not permit so. :( Kindly suggest, some alternative solution or some hive configuration to better load distribute in the reducers > Date: Mon, 15 Oct 2012 16:29:56 +0100 > Subject: Re: Hive Query Unable to distribute load evenly in reducers > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Is your data heavily skewed towards certain values of a.x etc? > > On 15 October 2012 15:23, Saurabh Mishra <[EMAIL PROTECTED]> wrote: > > The queries are simple joins, something on the lines of > > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group > > by a, b,c; > > > > > >> From: [EMAIL PROTECTED] > >> Date: Mon, 15 Oct 2012 21:10:39 +0800 > >> Subject: Re: Hive Query Unable to distribute load evenly in reducers > >> To: [EMAIL PROTECTED] > > > >> > >> And your queries were? > >> > >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra > >> <[EMAIL PROTECTED]> wrote: > >> > Hi, > >> > I am firing some hive queries joining tables containing upto 30millions > >> > records each. Since the load on the reducers is very significant in > >> > these > >> > cases, i specifically set the following parameters before executing the > >> > queries : > >> > > >> > set mapred.reduce.tasks=100; > >> > set hive.exec.reducers.bytes.per.reducer=500000000; > >> > set hive.optimize.cp=true; > >> > > >> > The number of reducer the job spouts in now 160, but despite the high > >> > number > >> > most of the load remains upon 1 or 2 reducers. Hence in the final > >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2 > >> > reducers took 2 hrs to run. > >> > Is there any way to overcome this load distribution disparity. > >> > Any help in this regards will be highly appreciated. > >> > > >> > Sincerely > >> > Saurabh Mishra
-
Re: Hive Query Unable to distribute load evenly in reducersNavis류승우 2012-10-16, 05:17
How about using MapJoin?
2012/10/16 Saurabh Mishra <[EMAIL PROTECTED]> > no there is apparently no heavy skewing. also another stats i wanted to > point was, following is approximate table contents in this 4 table join > query : > tableA : 170 million (actual number, + i am also exploding these records, > so the number could be much much higher) > tableB:15 > tableC:45 > tableD:45 > tableE : 45 > tableF : 14000 > > Also i cannot put any filter condition on tableA ,situation does not > permit so. :( > Kindly suggest, some alternative solution or some hive configuration to > better load distribute in the reducers > > > Date: Mon, 15 Oct 2012 16:29:56 +0100 > > > Subject: Re: Hive Query Unable to distribute load evenly in reducers > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > > > > Is your data heavily skewed towards certain values of a.x etc? > > > > On 15 October 2012 15:23, Saurabh Mishra <[EMAIL PROTECTED]> > wrote: > > > The queries are simple joins, something on the lines of > > > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... > group > > > by a, b,c; > > > > > > > > >> From: [EMAIL PROTECTED] > > >> Date: Mon, 15 Oct 2012 21:10:39 +0800 > > >> Subject: Re: Hive Query Unable to distribute load evenly in reducers > > >> To: [EMAIL PROTECTED] > > > > > >> > > >> And your queries were? > > >> > > >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra > > >> <[EMAIL PROTECTED]> wrote: > > >> > Hi, > > >> > I am firing some hive queries joining tables containing upto > 30millions > > >> > records each. Since the load on the reducers is very significant in > > >> > these > > >> > cases, i specifically set the following parameters before executing > the > > >> > queries : > > >> > > > >> > set mapred.reduce.tasks=100; > > >> > set hive.exec.reducers.bytes.per.reducer=500000000; > > >> > set hive.optimize.cp=true; > > >> > > > >> > The number of reducer the job spouts in now 160, but despite the > high > > >> > number > > >> > most of the load remains upon 1 or 2 reducers. Hence in the final > > >> > statistics, 158 reducers go completed with 2-3 minutes of start and > 2 > > >> > reducers took 2 hrs to run. > > >> > Is there any way to overcome this load distribution disparity. > > >> > Any help in this regards will be highly appreciated. > > >> > > > >> > Sincerely > > >> > Saurabh Mishra >
-
RE: Hive Query Unable to distribute load evenly in reducersSaurabh Mishra 2012-10-16, 05:53
by using mapjoin if you are implying setting
set hive.auto.convert.join=true; then this configuration i am already using, but to no avail...:( Date: Tue, 16 Oct 2012 14:17:47 +0900 Subject: Re: Hive Query Unable to distribute load evenly in reducers From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] How about using MapJoin? 2012/10/16 Saurabh Mishra <[EMAIL PROTECTED]> no there is apparently no heavy skewing. also another stats i wanted to point was, following is approximate table contents in this 4 table join query : tableA : 170 million (actual number, + i am also exploding these records, so the number could be much much higher) tableB:15 tableC:45 tableD:45 tableE : 45 tableF : 14000 Also i cannot put any filter condition on tableA ,situation does not permit so. :( Kindly suggest, some alternative solution or some hive configuration to better load distribute in the reducers > Date: Mon, 15 Oct 2012 16:29:56 +0100 > Subject: Re: Hive Query Unable to distribute load evenly in reducers > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Is your data heavily skewed towards certain values of a.x etc? > > On 15 October 2012 15:23, Saurabh Mishra <[EMAIL PROTECTED]> wrote: > > The queries are simple joins, something on the lines of > > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group > > by a, b,c; > > > > > >> From: [EMAIL PROTECTED] > >> Date: Mon, 15 Oct 2012 21:10:39 +0800 > >> Subject: Re: Hive Query Unable to distribute load evenly in reducers > >> To: [EMAIL PROTECTED] > > > >> > >> And your queries were? > >> > >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra > >> <[EMAIL PROTECTED]> wrote: > >> > Hi, > >> > I am firing some hive queries joining tables containing upto 30millions > >> > records each. Since the load on the reducers is very significant in > >> > these > >> > cases, i specifically set the following parameters before executing the > >> > queries : > >> > > >> > set mapred.reduce.tasks=100; > >> > set hive.exec.reducers.bytes.per.reducer=500000000; > >> > set hive.optimize.cp=true; > >> > > >> > The number of reducer the job spouts in now 160, but despite the high > >> > number > >> > most of the load remains upon 1 or 2 reducers. Hence in the final > >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2 > >> > reducers took 2 hrs to run. > >> > Is there any way to overcome this load distribution disparity. > >> > Any help in this regards will be highly appreciated. > >> > > >> > Sincerely > >> > Saurabh Mishra
-
RE: Hive Query Unable to distribute load evenly in reducersSaurabh Mishra 2012-10-18, 08:56
any views on the problem
From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: Hive Query Unable to distribute load evenly in reducers Date: Tue, 16 Oct 2012 11:23:29 +0530 by using mapjoin if you are implying setting set hive.auto.convert.join=true; then this configuration i am already using, but to no avail...:( Date: Tue, 16 Oct 2012 14:17:47 +0900 Subject: Re: Hive Query Unable to distribute load evenly in reducers From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] How about using MapJoin? 2012/10/16 Saurabh Mishra <[EMAIL PROTECTED]> no there is apparently no heavy skewing. also another stats i wanted to point was, following is approximate table contents in this 4 table join query : tableA : 170 million (actual number, + i am also exploding these records, so the number could be much much higher) tableB:15 tableC:45 tableD:45 tableE : 45 tableF : 14000 Also i cannot put any filter condition on tableA ,situation does not permit so. :( Kindly suggest, some alternative solution or some hive configuration to better load distribute in the reducers > Date: Mon, 15 Oct 2012 16:29:56 +0100 > Subject: Re: Hive Query Unable to distribute load evenly in reducers > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Is your data heavily skewed towards certain values of a.x etc? > > On 15 October 2012 15:23, Saurabh Mishra <[EMAIL PROTECTED]> wrote: > > The queries are simple joins, something on the lines of > > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group > > by a, b,c; > > > > > >> From: [EMAIL PROTECTED] > >> Date: Mon, 15 Oct 2012 21:10:39 +0800 > >> Subject: Re: Hive Query Unable to distribute load evenly in reducers > >> To: [EMAIL PROTECTED] > > > >> > >> And your queries were? > >> > >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra > >> <[EMAIL PROTECTED]> wrote: > >> > Hi, > >> > I am firing some hive queries joining tables containing upto 30millions > >> > records each. Since the load on the reducers is very significant in > >> > these > >> > cases, i specifically set the following parameters before executing the > >> > queries : > >> > > >> > set mapred.reduce.tasks=100; > >> > set hive.exec.reducers.bytes.per.reducer=500000000; > >> > set hive.optimize.cp=true; > >> > > >> > The number of reducer the job spouts in now 160, but despite the high > >> > number > >> > most of the load remains upon 1 or 2 reducers. Hence in the final > >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2 > >> > reducers took 2 hrs to run. > >> > Is there any way to overcome this load distribution disparity. > >> > Any help in this regards will be highly appreciated. > >> > > >> > Sincerely > >> > Saurabh Mishra
-
Re: Hive Query Unable to distribute load evenly in reducersPhilip Tromans 2012-10-18, 09:03
I'm really not convinced that there's no skew in your data. Look at
the counters from the Hadoop TaskTracker pages, and thoroughly check that the numbers of reducer input records / groups and output records are all similar. Phil. On 18 October 2012 09:56, Saurabh Mishra <[EMAIL PROTECTED]> wrote: > any views on the problem > > ________________________________ > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: RE: Hive Query Unable to distribute load evenly in reducers > Date: Tue, 16 Oct 2012 11:23:29 +0530 > > > by using mapjoin if you are implying setting > set hive.auto.convert.join=true; > then this configuration i am already using, but to no avail...:( > > ________________________________ > Date: Tue, 16 Oct 2012 14:17:47 +0900 > Subject: Re: Hive Query Unable to distribute load evenly in reducers > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > How about using MapJoin? > > 2012/10/16 Saurabh Mishra <[EMAIL PROTECTED]> > > no there is apparently no heavy skewing. also another stats i wanted to > point was, following is approximate table contents in this 4 table join > query : > tableA : 170 million (actual number, + i am also exploding these records, so > the number could be much much higher) > tableB:15 > tableC:45 > tableD:45 > tableE : 45 > tableF : 14000 > > Also i cannot put any filter condition on tableA ,situation does not permit > so. :( > Kindly suggest, some alternative solution or some hive configuration to > better load distribute in the reducers > >> Date: Mon, 15 Oct 2012 16:29:56 +0100 > >> Subject: Re: Hive Query Unable to distribute load evenly in reducers >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] > >> >> Is your data heavily skewed towards certain values of a.x etc? >> >> On 15 October 2012 15:23, Saurabh Mishra <[EMAIL PROTECTED]> >> wrote: >> > The queries are simple joins, something on the lines of >> > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... >> > group >> > by a, b,c; >> > >> > >> >> From: [EMAIL PROTECTED] >> >> Date: Mon, 15 Oct 2012 21:10:39 +0800 >> >> Subject: Re: Hive Query Unable to distribute load evenly in reducers >> >> To: [EMAIL PROTECTED] >> > >> >> >> >> And your queries were? >> >> >> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra >> >> <[EMAIL PROTECTED]> wrote: >> >> > Hi, >> >> > I am firing some hive queries joining tables containing upto >> >> > 30millions >> >> > records each. Since the load on the reducers is very significant in >> >> > these >> >> > cases, i specifically set the following parameters before executing >> >> > the >> >> > queries : >> >> > >> >> > set mapred.reduce.tasks=100; >> >> > set hive.exec.reducers.bytes.per.reducer=500000000; >> >> > set hive.optimize.cp=true; >> >> > >> >> > The number of reducer the job spouts in now 160, but despite the high >> >> > number >> >> > most of the load remains upon 1 or 2 reducers. Hence in the final >> >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2 >> >> > reducers took 2 hrs to run. >> >> > Is there any way to overcome this load distribution disparity. >> >> > Any help in this regards will be highly appreciated. >> >> > >> >> > Sincerely >> >> > Saurabh Mishra > > |