Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Local executes faster compared to mapreduce mode


Copy link to this message
-
Re: Local executes faster compared to mapreduce mode
Gianmarco De Francisci Mo... 2012-01-07, 14:38
Hi,
there is a fixed overhead for scheduling and starting the job in MR mode.
The minimum job time I have seen in my (limited) experience is around 1
minute for a piece of code that did basically nothing on a small dataset.
If your job takes 1 minute locally, it's not a good candidate for
parallelization :)

As suggested by others, try bigger numbers.
A 10x increase should already give you something more meaningful in my
opinion.

Cheers,
--
Gianmarco

On Fri, Jan 6, 2012 at 10:22, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> FYI, local mode is ideally suited for debugging (easier since its a
> single process). It is not suited for large datasets, that is the goal
> of mapreduce.
>
> It might never be apples to apples if you are comparing the 2, since
> the variables differ. For large datasets you might notice the job
> choking on local mode.
>
> -Prashant
>
> On Jan 6, 2012, at 1:16 AM, Prashant Kommireddi <[EMAIL PROTECTED]>
> wrote:
>
> > I would recommend trying it with a few GBs.
> >
> > I'm curious as to why you are benchmarking local vs mapreduce?
> >
> > Thanks,
> > Prashant
> >
> > On Jan 6, 2012, at 12:46 AM, Michael Lok <[EMAIL PROTECTED]> wrote:
> >
> >> Hi Prashant,
> >>
> >> Thanks for the input.  Any idea what would be a good size to perform
> >> benchmark on?
> >>
> >>
> >> Thanks.
> >>
> >> On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi <
> [EMAIL PROTECTED]> wrote:
> >>> Hi Michael,
> >>>
> >>> That does not seem large enough for benchmarking/comparison. Please try
> >>> increasing the filesize to make it a fair comparison :)
> >>> It might be possible the cost of spawning multiple tasks across the
> nodes
> >>> is more than cost of running the job with little data locally.
> >>>
> >>> Thanks,
> >>> Prashant
> >>>
> >>> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[EMAIL PROTECTED]>
> wrote:
> >>>
> >>>> Hi Prashant,
> >>>>
> >>>> 1000 and 4600 records respectively :)  Hence the output from the cross
> >>>> join is 4 million records.
> >>>>
> >>>> I suppose I should increase the number of records to take advantage of
> >>>> the parallel features? :)
> >>>>
> >>>>
> >>>> Thanks.
> >>>>
> >>>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <
> [EMAIL PROTECTED]>
> >>>> wrote:
> >>>>> What is the filesize' of the 2 data sets? If the datasets are really
> >>>>> small, making it run distributed might not really give any advantage
> >>>>> over local mode.
> >>>>>
> >>>>> Also the benefits of parallelism depends on how much data is being
> >>>>> sent to the reducers.
> >>>>>
> >>>>> -Prashant
> >>>>>
> >>>>> On Jan 5, 2012, at 11:52 PM, Michael Lok <[EMAIL PROTECTED]> wrote:
> >>>>>
> >>>>>> Hi folks,
> >>>>>>
> >>>>>> I've a simple script which does CROSS join (thanks to Dimitry for
> the
> >>>>>> tip :D) and calls a UDF to perform simple matching between 2 values
> >>>>>> from the joined result.
> >>>>>>
> >>>>>> The script was initially executed via local mode and the average
> >>>>>> execution time is around 1 minute.
> >>>>>>
> >>>>>> However, when the script is executed via mapreduce mode, it averages
> >>>>>> 2+ minutes.  The cluster I've setup consists of 4 datanodes.
> >>>>>>
> >>>>>> I've tried setting the "default_parallel" setting to 5 and 10, but
> it
> >>>>>> doesn't affect the performance.
> >>>>>>
> >>>>>> Is there anything I should look at?  BTW, the data size is pretty
> >>>>>> small; around 4 million records generated from the CROSS operation.
> >>>>>>
> >>>>>> Here's the script which I'm referring to:
> >>>>>>
> >>>>>> set debug 'on';
> >>>>>> set job.name 'vacancy cross';
> >>>>>> set default_parallel 5;
> >>>>>>
> >>>>>> register pig/*.jar;
> >>>>>>
> >>>>>> define DIST com.pig.udf.Distance();
> >>>>>>
> >>>>>> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray,
> >>>>>> jsstate:chararray);
> >>>>>>
> >>>>>> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray,
> >>>>>> vacstate:chararray);
> >>>>>>
> >>>>>> cx = cross js, vac;
> >>>>>>
> >>>>>> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate,