|
|
-
Local executes faster compared to mapreduce mode
Michael Lok 2012-01-06, 07:51
Hi folks,
I've a simple script which does CROSS join (thanks to Dimitry for the tip :D) and calls a UDF to perform simple matching between 2 values from the joined result.
The script was initially executed via local mode and the average execution time is around 1 minute.
However, when the script is executed via mapreduce mode, it averages 2+ minutes. The cluster I've setup consists of 4 datanodes.
I've tried setting the "default_parallel" setting to 5 and 10, but it doesn't affect the performance.
Is there anything I should look at? BTW, the data size is pretty small; around 4 million records generated from the CROSS operation.
Here's the script which I'm referring to:
set debug 'on'; set job.name 'vacancy cross'; set default_parallel 5;
register pig/*.jar;
define DIST com.pig.udf.Distance();
js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, jsstate:chararray);
vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, vacstate:chararray);
cx = cross js, vac;
d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, vacstate);
store d into 'out' using PigStorage(','); Thanks!
-
Re: Local executes faster compared to mapreduce mode
Prashant Kommireddi 2012-01-06, 08:04
What is the filesize' of the 2 data sets? If the datasets are really small, making it run distributed might not really give any advantage over local mode.
Also the benefits of parallelism depends on how much data is being sent to the reducers.
-Prashant
On Jan 5, 2012, at 11:52 PM, Michael Lok <[EMAIL PROTECTED]> wrote:
> Hi folks, > > I've a simple script which does CROSS join (thanks to Dimitry for the > tip :D) and calls a UDF to perform simple matching between 2 values > from the joined result. > > The script was initially executed via local mode and the average > execution time is around 1 minute. > > However, when the script is executed via mapreduce mode, it averages > 2+ minutes. The cluster I've setup consists of 4 datanodes. > > I've tried setting the "default_parallel" setting to 5 and 10, but it > doesn't affect the performance. > > Is there anything I should look at? BTW, the data size is pretty > small; around 4 million records generated from the CROSS operation. > > Here's the script which I'm referring to: > > set debug 'on'; > set job.name 'vacancy cross'; > set default_parallel 5; > > register pig/*.jar; > > define DIST com.pig.udf.Distance(); > > js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, > jsstate:chararray); > > vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, > vacstate:chararray); > > cx = cross js, vac; > > d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, vacstate); > > store d into 'out' using PigStorage(','); > > > Thanks!
-
Re: Local executes faster compared to mapreduce mode
Michael Lok 2012-01-06, 08:10
Hi Prashant,
1000 and 4600 records respectively :) Hence the output from the cross join is 4 million records.
I suppose I should increase the number of records to take advantage of the parallel features? :) Thanks.
On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote: > What is the filesize' of the 2 data sets? If the datasets are really > small, making it run distributed might not really give any advantage > over local mode. > > Also the benefits of parallelism depends on how much data is being > sent to the reducers. > > -Prashant > > On Jan 5, 2012, at 11:52 PM, Michael Lok <[EMAIL PROTECTED]> wrote: > >> Hi folks, >> >> I've a simple script which does CROSS join (thanks to Dimitry for the >> tip :D) and calls a UDF to perform simple matching between 2 values >> from the joined result. >> >> The script was initially executed via local mode and the average >> execution time is around 1 minute. >> >> However, when the script is executed via mapreduce mode, it averages >> 2+ minutes. The cluster I've setup consists of 4 datanodes. >> >> I've tried setting the "default_parallel" setting to 5 and 10, but it >> doesn't affect the performance. >> >> Is there anything I should look at? BTW, the data size is pretty >> small; around 4 million records generated from the CROSS operation. >> >> Here's the script which I'm referring to: >> >> set debug 'on'; >> set job.name 'vacancy cross'; >> set default_parallel 5; >> >> register pig/*.jar; >> >> define DIST com.pig.udf.Distance(); >> >> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, >> jsstate:chararray); >> >> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, >> vacstate:chararray); >> >> cx = cross js, vac; >> >> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, vacstate); >> >> store d into 'out' using PigStorage(','); >> >> >> Thanks!
-
Re: Local executes faster compared to mapreduce mode
Prashant Kommireddi 2012-01-06, 08:29
Hi Michael,
That does not seem large enough for benchmarking/comparison. Please try increasing the filesize to make it a fair comparison :) It might be possible the cost of spawning multiple tasks across the nodes is more than cost of running the job with little data locally.
Thanks, Prashant
On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[EMAIL PROTECTED]> wrote:
> Hi Prashant, > > 1000 and 4600 records respectively :) Hence the output from the cross > join is 4 million records. > > I suppose I should increase the number of records to take advantage of > the parallel features? :) > > > Thanks. > > On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[EMAIL PROTECTED]> > wrote: > > What is the filesize' of the 2 data sets? If the datasets are really > > small, making it run distributed might not really give any advantage > > over local mode. > > > > Also the benefits of parallelism depends on how much data is being > > sent to the reducers. > > > > -Prashant > > > > On Jan 5, 2012, at 11:52 PM, Michael Lok <[EMAIL PROTECTED]> wrote: > > > >> Hi folks, > >> > >> I've a simple script which does CROSS join (thanks to Dimitry for the > >> tip :D) and calls a UDF to perform simple matching between 2 values > >> from the joined result. > >> > >> The script was initially executed via local mode and the average > >> execution time is around 1 minute. > >> > >> However, when the script is executed via mapreduce mode, it averages > >> 2+ minutes. The cluster I've setup consists of 4 datanodes. > >> > >> I've tried setting the "default_parallel" setting to 5 and 10, but it > >> doesn't affect the performance. > >> > >> Is there anything I should look at? BTW, the data size is pretty > >> small; around 4 million records generated from the CROSS operation. > >> > >> Here's the script which I'm referring to: > >> > >> set debug 'on'; > >> set job.name 'vacancy cross'; > >> set default_parallel 5; > >> > >> register pig/*.jar; > >> > >> define DIST com.pig.udf.Distance(); > >> > >> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, > >> jsstate:chararray); > >> > >> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, > >> vacstate:chararray); > >> > >> cx = cross js, vac; > >> > >> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, > vacstate); > >> > >> store d into 'out' using PigStorage(','); > >> > >> > >> Thanks! >
-
Re: Local executes faster compared to mapreduce mode
Michael Lok 2012-01-06, 08:44
Hi Prashant,
Thanks for the input. Any idea what would be a good size to perform benchmark on? Thanks.
On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote: > Hi Michael, > > That does not seem large enough for benchmarking/comparison. Please try > increasing the filesize to make it a fair comparison :) > It might be possible the cost of spawning multiple tasks across the nodes > is more than cost of running the job with little data locally. > > Thanks, > Prashant > > On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[EMAIL PROTECTED]> wrote: > >> Hi Prashant, >> >> 1000 and 4600 records respectively :) Hence the output from the cross >> join is 4 million records. >> >> I suppose I should increase the number of records to take advantage of >> the parallel features? :) >> >> >> Thanks. >> >> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[EMAIL PROTECTED]> >> wrote: >> > What is the filesize' of the 2 data sets? If the datasets are really >> > small, making it run distributed might not really give any advantage >> > over local mode. >> > >> > Also the benefits of parallelism depends on how much data is being >> > sent to the reducers. >> > >> > -Prashant >> > >> > On Jan 5, 2012, at 11:52 PM, Michael Lok <[EMAIL PROTECTED]> wrote: >> > >> >> Hi folks, >> >> >> >> I've a simple script which does CROSS join (thanks to Dimitry for the >> >> tip :D) and calls a UDF to perform simple matching between 2 values >> >> from the joined result. >> >> >> >> The script was initially executed via local mode and the average >> >> execution time is around 1 minute. >> >> >> >> However, when the script is executed via mapreduce mode, it averages >> >> 2+ minutes. The cluster I've setup consists of 4 datanodes. >> >> >> >> I've tried setting the "default_parallel" setting to 5 and 10, but it >> >> doesn't affect the performance. >> >> >> >> Is there anything I should look at? BTW, the data size is pretty >> >> small; around 4 million records generated from the CROSS operation. >> >> >> >> Here's the script which I'm referring to: >> >> >> >> set debug 'on'; >> >> set job.name 'vacancy cross'; >> >> set default_parallel 5; >> >> >> >> register pig/*.jar; >> >> >> >> define DIST com.pig.udf.Distance(); >> >> >> >> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, >> >> jsstate:chararray); >> >> >> >> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, >> >> vacstate:chararray); >> >> >> >> cx = cross js, vac; >> >> >> >> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, >> vacstate); >> >> >> >> store d into 'out' using PigStorage(','); >> >> >> >> >> >> Thanks! >>
-
Re: Local executes faster compared to mapreduce mode
Prashant Kommireddi 2012-01-06, 09:16
I would recommend trying it with a few GBs.
I'm curious as to why you are benchmarking local vs mapreduce?
Thanks, Prashant
On Jan 6, 2012, at 12:46 AM, Michael Lok <[EMAIL PROTECTED]> wrote:
> Hi Prashant, > > Thanks for the input. Any idea what would be a good size to perform > benchmark on? > > > Thanks. > > On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote: >> Hi Michael, >> >> That does not seem large enough for benchmarking/comparison. Please try >> increasing the filesize to make it a fair comparison :) >> It might be possible the cost of spawning multiple tasks across the nodes >> is more than cost of running the job with little data locally. >> >> Thanks, >> Prashant >> >> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[EMAIL PROTECTED]> wrote: >> >>> Hi Prashant, >>> >>> 1000 and 4600 records respectively :) Hence the output from the cross >>> join is 4 million records. >>> >>> I suppose I should increase the number of records to take advantage of >>> the parallel features? :) >>> >>> >>> Thanks. >>> >>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[EMAIL PROTECTED]> >>> wrote: >>>> What is the filesize' of the 2 data sets? If the datasets are really >>>> small, making it run distributed might not really give any advantage >>>> over local mode. >>>> >>>> Also the benefits of parallelism depends on how much data is being >>>> sent to the reducers. >>>> >>>> -Prashant >>>> >>>> On Jan 5, 2012, at 11:52 PM, Michael Lok <[EMAIL PROTECTED]> wrote: >>>> >>>>> Hi folks, >>>>> >>>>> I've a simple script which does CROSS join (thanks to Dimitry for the >>>>> tip :D) and calls a UDF to perform simple matching between 2 values >>>>> from the joined result. >>>>> >>>>> The script was initially executed via local mode and the average >>>>> execution time is around 1 minute. >>>>> >>>>> However, when the script is executed via mapreduce mode, it averages >>>>> 2+ minutes. The cluster I've setup consists of 4 datanodes. >>>>> >>>>> I've tried setting the "default_parallel" setting to 5 and 10, but it >>>>> doesn't affect the performance. >>>>> >>>>> Is there anything I should look at? BTW, the data size is pretty >>>>> small; around 4 million records generated from the CROSS operation. >>>>> >>>>> Here's the script which I'm referring to: >>>>> >>>>> set debug 'on'; >>>>> set job.name 'vacancy cross'; >>>>> set default_parallel 5; >>>>> >>>>> register pig/*.jar; >>>>> >>>>> define DIST com.pig.udf.Distance(); >>>>> >>>>> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, >>>>> jsstate:chararray); >>>>> >>>>> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, >>>>> vacstate:chararray); >>>>> >>>>> cx = cross js, vac; >>>>> >>>>> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, >>> vacstate); >>>>> >>>>> store d into 'out' using PigStorage(','); >>>>> >>>>> >>>>> Thanks! >>>
-
Re: Local executes faster compared to mapreduce mode
Prashant Kommireddi 2012-01-06, 09:22
FYI, local mode is ideally suited for debugging (easier since its a single process). It is not suited for large datasets, that is the goal of mapreduce.
It might never be apples to apples if you are comparing the 2, since the variables differ. For large datasets you might notice the job choking on local mode.
-Prashant
On Jan 6, 2012, at 1:16 AM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote:
> I would recommend trying it with a few GBs. > > I'm curious as to why you are benchmarking local vs mapreduce? > > Thanks, > Prashant > > On Jan 6, 2012, at 12:46 AM, Michael Lok <[EMAIL PROTECTED]> wrote: > >> Hi Prashant, >> >> Thanks for the input. Any idea what would be a good size to perform >> benchmark on? >> >> >> Thanks. >> >> On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote: >>> Hi Michael, >>> >>> That does not seem large enough for benchmarking/comparison. Please try >>> increasing the filesize to make it a fair comparison :) >>> It might be possible the cost of spawning multiple tasks across the nodes >>> is more than cost of running the job with little data locally. >>> >>> Thanks, >>> Prashant >>> >>> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[EMAIL PROTECTED]> wrote: >>> >>>> Hi Prashant, >>>> >>>> 1000 and 4600 records respectively :) Hence the output from the cross >>>> join is 4 million records. >>>> >>>> I suppose I should increase the number of records to take advantage of >>>> the parallel features? :) >>>> >>>> >>>> Thanks. >>>> >>>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[EMAIL PROTECTED]> >>>> wrote: >>>>> What is the filesize' of the 2 data sets? If the datasets are really >>>>> small, making it run distributed might not really give any advantage >>>>> over local mode. >>>>> >>>>> Also the benefits of parallelism depends on how much data is being >>>>> sent to the reducers. >>>>> >>>>> -Prashant >>>>> >>>>> On Jan 5, 2012, at 11:52 PM, Michael Lok <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Hi folks, >>>>>> >>>>>> I've a simple script which does CROSS join (thanks to Dimitry for the >>>>>> tip :D) and calls a UDF to perform simple matching between 2 values >>>>>> from the joined result. >>>>>> >>>>>> The script was initially executed via local mode and the average >>>>>> execution time is around 1 minute. >>>>>> >>>>>> However, when the script is executed via mapreduce mode, it averages >>>>>> 2+ minutes. The cluster I've setup consists of 4 datanodes. >>>>>> >>>>>> I've tried setting the "default_parallel" setting to 5 and 10, but it >>>>>> doesn't affect the performance. >>>>>> >>>>>> Is there anything I should look at? BTW, the data size is pretty >>>>>> small; around 4 million records generated from the CROSS operation. >>>>>> >>>>>> Here's the script which I'm referring to: >>>>>> >>>>>> set debug 'on'; >>>>>> set job.name 'vacancy cross'; >>>>>> set default_parallel 5; >>>>>> >>>>>> register pig/*.jar; >>>>>> >>>>>> define DIST com.pig.udf.Distance(); >>>>>> >>>>>> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, >>>>>> jsstate:chararray); >>>>>> >>>>>> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, >>>>>> vacstate:chararray); >>>>>> >>>>>> cx = cross js, vac; >>>>>> >>>>>> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, >>>> vacstate); >>>>>> >>>>>> store d into 'out' using PigStorage(','); >>>>>> >>>>>> >>>>>> Thanks! >>>>
-
Re: Local executes faster compared to mapreduce mode
Gianmarco De Francisci Mo... 2012-01-07, 14:38
Hi, there is a fixed overhead for scheduling and starting the job in MR mode. The minimum job time I have seen in my (limited) experience is around 1 minute for a piece of code that did basically nothing on a small dataset. If your job takes 1 minute locally, it's not a good candidate for parallelization :)
As suggested by others, try bigger numbers. A 10x increase should already give you something more meaningful in my opinion.
Cheers, -- Gianmarco
On Fri, Jan 6, 2012 at 10:22, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
> FYI, local mode is ideally suited for debugging (easier since its a > single process). It is not suited for large datasets, that is the goal > of mapreduce. > > It might never be apples to apples if you are comparing the 2, since > the variables differ. For large datasets you might notice the job > choking on local mode. > > -Prashant > > On Jan 6, 2012, at 1:16 AM, Prashant Kommireddi <[EMAIL PROTECTED]> > wrote: > > > I would recommend trying it with a few GBs. > > > > I'm curious as to why you are benchmarking local vs mapreduce? > > > > Thanks, > > Prashant > > > > On Jan 6, 2012, at 12:46 AM, Michael Lok <[EMAIL PROTECTED]> wrote: > > > >> Hi Prashant, > >> > >> Thanks for the input. Any idea what would be a good size to perform > >> benchmark on? > >> > >> > >> Thanks. > >> > >> On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi < > [EMAIL PROTECTED]> wrote: > >>> Hi Michael, > >>> > >>> That does not seem large enough for benchmarking/comparison. Please try > >>> increasing the filesize to make it a fair comparison :) > >>> It might be possible the cost of spawning multiple tasks across the > nodes > >>> is more than cost of running the job with little data locally. > >>> > >>> Thanks, > >>> Prashant > >>> > >>> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[EMAIL PROTECTED]> > wrote: > >>> > >>>> Hi Prashant, > >>>> > >>>> 1000 and 4600 records respectively :) Hence the output from the cross > >>>> join is 4 million records. > >>>> > >>>> I suppose I should increase the number of records to take advantage of > >>>> the parallel features? :) > >>>> > >>>> > >>>> Thanks. > >>>> > >>>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi < > [EMAIL PROTECTED]> > >>>> wrote: > >>>>> What is the filesize' of the 2 data sets? If the datasets are really > >>>>> small, making it run distributed might not really give any advantage > >>>>> over local mode. > >>>>> > >>>>> Also the benefits of parallelism depends on how much data is being > >>>>> sent to the reducers. > >>>>> > >>>>> -Prashant > >>>>> > >>>>> On Jan 5, 2012, at 11:52 PM, Michael Lok <[EMAIL PROTECTED]> wrote: > >>>>> > >>>>>> Hi folks, > >>>>>> > >>>>>> I've a simple script which does CROSS join (thanks to Dimitry for > the > >>>>>> tip :D) and calls a UDF to perform simple matching between 2 values > >>>>>> from the joined result. > >>>>>> > >>>>>> The script was initially executed via local mode and the average > >>>>>> execution time is around 1 minute. > >>>>>> > >>>>>> However, when the script is executed via mapreduce mode, it averages > >>>>>> 2+ minutes. The cluster I've setup consists of 4 datanodes. > >>>>>> > >>>>>> I've tried setting the "default_parallel" setting to 5 and 10, but > it > >>>>>> doesn't affect the performance. > >>>>>> > >>>>>> Is there anything I should look at? BTW, the data size is pretty > >>>>>> small; around 4 million records generated from the CROSS operation. > >>>>>> > >>>>>> Here's the script which I'm referring to: > >>>>>> > >>>>>> set debug 'on'; > >>>>>> set job.name 'vacancy cross'; > >>>>>> set default_parallel 5; > >>>>>> > >>>>>> register pig/*.jar; > >>>>>> > >>>>>> define DIST com.pig.udf.Distance(); > >>>>>> > >>>>>> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, > >>>>>> jsstate:chararray); > >>>>>> > >>>>>> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, > >>>>>> vacstate:chararray); > >>>>>> > >>>>>> cx = cross js, vac; > >>>>>> > >>>>>> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate,
-
Re: Local executes faster compared to mapreduce mode
Michael Lok 2012-01-07, 16:17
Hi,
Thanks for the additional info. Will be trying a substantial increase in record size and see how that performs.
The benchmark is to see what kind of performance gain I can get from mapreduce mode; depending on the number of nodes. Thanks.
On Sat, Jan 7, 2012 at 10:38 PM, Gianmarco De Francisci Morales <[EMAIL PROTECTED]> wrote: > Hi, > there is a fixed overhead for scheduling and starting the job in MR mode. > The minimum job time I have seen in my (limited) experience is around 1 > minute for a piece of code that did basically nothing on a small dataset. > If your job takes 1 minute locally, it's not a good candidate for > parallelization :) > > As suggested by others, try bigger numbers. > A 10x increase should already give you something more meaningful in my > opinion. > > Cheers, > -- > Gianmarco > > > > On Fri, Jan 6, 2012 at 10:22, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > >> FYI, local mode is ideally suited for debugging (easier since its a >> single process). It is not suited for large datasets, that is the goal >> of mapreduce. >> >> It might never be apples to apples if you are comparing the 2, since >> the variables differ. For large datasets you might notice the job >> choking on local mode. >> >> -Prashant >> >> On Jan 6, 2012, at 1:16 AM, Prashant Kommireddi <[EMAIL PROTECTED]> >> wrote: >> >> > I would recommend trying it with a few GBs. >> > >> > I'm curious as to why you are benchmarking local vs mapreduce? >> > >> > Thanks, >> > Prashant >> > >> > On Jan 6, 2012, at 12:46 AM, Michael Lok <[EMAIL PROTECTED]> wrote: >> > >> >> Hi Prashant, >> >> >> >> Thanks for the input. Any idea what would be a good size to perform >> >> benchmark on? >> >> >> >> >> >> Thanks. >> >> >> >> On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi < >> [EMAIL PROTECTED]> wrote: >> >>> Hi Michael, >> >>> >> >>> That does not seem large enough for benchmarking/comparison. Please try >> >>> increasing the filesize to make it a fair comparison :) >> >>> It might be possible the cost of spawning multiple tasks across the >> nodes >> >>> is more than cost of running the job with little data locally. >> >>> >> >>> Thanks, >> >>> Prashant >> >>> >> >>> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[EMAIL PROTECTED]> >> wrote: >> >>> >> >>>> Hi Prashant, >> >>>> >> >>>> 1000 and 4600 records respectively :) Hence the output from the cross >> >>>> join is 4 million records. >> >>>> >> >>>> I suppose I should increase the number of records to take advantage of >> >>>> the parallel features? :) >> >>>> >> >>>> >> >>>> Thanks. >> >>>> >> >>>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi < >> [EMAIL PROTECTED]> >> >>>> wrote: >> >>>>> What is the filesize' of the 2 data sets? If the datasets are really >> >>>>> small, making it run distributed might not really give any advantage >> >>>>> over local mode. >> >>>>> >> >>>>> Also the benefits of parallelism depends on how much data is being >> >>>>> sent to the reducers. >> >>>>> >> >>>>> -Prashant >> >>>>> >> >>>>> On Jan 5, 2012, at 11:52 PM, Michael Lok <[EMAIL PROTECTED]> wrote: >> >>>>> >> >>>>>> Hi folks, >> >>>>>> >> >>>>>> I've a simple script which does CROSS join (thanks to Dimitry for >> the >> >>>>>> tip :D) and calls a UDF to perform simple matching between 2 values >> >>>>>> from the joined result. >> >>>>>> >> >>>>>> The script was initially executed via local mode and the average >> >>>>>> execution time is around 1 minute. >> >>>>>> >> >>>>>> However, when the script is executed via mapreduce mode, it averages >> >>>>>> 2+ minutes. The cluster I've setup consists of 4 datanodes. >> >>>>>> >> >>>>>> I've tried setting the "default_parallel" setting to 5 and 10, but >> it >> >>>>>> doesn't affect the performance. >> >>>>>> >> >>>>>> Is there anything I should look at? BTW, the data size is pretty >> >>>>>> small; around 4 million records generated from the CROSS operation. >> >>>>>> >> >>>>>> Here's the script which I'm referring to: >> >>>>>> >> >>>>
|
|