Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Slow Group By operator


Copy link to this message
-
Re: Slow Group By operator
I guess you mean "combiner + mapPartAgg set to true" not "no combiner +
mapPartAgg set to true".
On Sun, Aug 25, 2013 at 10:10 AM, Benjamin Jakobus
<[EMAIL PROTECTED]>wrote:

> Hi Cheolsoo,
>
> Just ran the benchmarks: no luck.
>
> No combiner + mapPartAgg set to true is slower than without the combiner:
> real 752.85
> real 757.41
> real 749.03
>
>
>
> On 25 August 2013 17:11, Benjamin Jakobus <[EMAIL PROTECTED]> wrote:
>
> > Hi Cheolsoo,
> >
> > Thanks - let's see, I'll give it a try now.
> >
> > Best Regards,
> > Ben
> >
> >
> > On 25 August 2013 02:27, Cheolsoo Park <[EMAIL PROTECTED]> wrote:
> >
> >> Hi Benjamin,
> >>
> >> Thanks for letting us know. That means my original assumption was wrong.
> >> The size of bags is not small. In fact, you can compute the avg size of
> >> bags as follows: total number of input records / ( reduce input groups x
> >> number of reducers ).
> >>
> >> One more thing you can try is turning on "pig.exec.mapPartAgg". That may
> >> help mappers run faster. If this doesn't work, I run out of ideas. :-)
> >>
> >> Thanks,
> >> Cheolsoo
> >>
> >>
> >>
> >> On Sat, Aug 24, 2013 at 3:27 AM, Benjamin Jakobus <
> [EMAIL PROTECTED]
> >> >wrote:
> >>
> >> > Hi Alan, Cheolsoo,
> >> >
> >> > I re-ran the benchmarks with and without the combiner. Enabling the
> >> > combiner is faster:
> >> >
> >> > With combiner:
> >> > real 668.44
> >> > real 663.10
> >> > real 665.05
> >> >
> >> > Without combiner:
> >> > real 795.97
> >> > real 810.51
> >> > real 810.16
> >> >
> >> > Best Regards,
> >> > Ben
> >> >
> >> >
> >> > On 22 August 2013 16:33, Cheolsoo Park <[EMAIL PROTECTED]> wrote:
> >> >
> >> > > Hi Benjamin,
> >> > >
> >> > > To answer your question, how the Hadoop combiner works is that 1)
> >> mappers
> >> > > write outputs to disk and 2) combiners read them, combine and write
> >> them
> >> > > again. So you're paying extra disk I/O as well as
> >> > > serialization/deserialization.
> >> > >
> >> > > This will pay off if combiners significantly reduce the intermediate
> >> > > outputs that reducers need to fetch from mappers. But if reduction
> is
> >> not
> >> > > significant, it will only slow down mappers. You can identify
> whether
> >> > this
> >> > > is really a problem by comparing the time spent by map and combine
> >> > > functions in the task logs.
> >> > >
> >> > > What I usually do are:
> >> > > 1) If there are many small bags, disable combiners.
> >> > > 2) If there are many large bags, enable combiners. Furthermore,
> >> turning
> >> > on
> >> > > "pig.exec.mapPartAgg" helps. (see the Pig
> >> > > blog<https://blogs.apache.org/pig/entry/apache_pig_it_goes_to>for
> >> > > details.
> >> > > )
> >> > >
> >> > > Thanks,
> >> > > Cheolsoo
> >> > >
> >> > >
> >> > > On Thu, Aug 22, 2013 at 4:01 AM, Benjamin Jakobus <
> >> > [EMAIL PROTECTED]
> >> > > >wrote:
> >> > >
> >> > > > Hi Cheolsoo,
> >> > > >
> >> > > > Thanks - I will try this now and get back to you.
> >> > > >
> >> > > > Out of interest; could you explain (or point me towards resources
> >> that
> >> > > > would) why the combiner would be a problem?
> >> > > >
> >> > > > Also, could the fact that Pig builds an intermediary data
> structure
> >> (?)
> >> > > > whilst Hive just performs a sort then the arithmetic operation
> >> explain
> >> > > the
> >> > > > slowdown?
> >> > > >
> >> > > > (Apologies, I'm quite new to Pig/Hive - just my guesses).
> >> > > >
> >> > > > Regards,
> >> > > > Benjamin
> >> > > >
> >> > > >
> >> > > > On 22 August 2013 01:07, Cheolsoo Park <[EMAIL PROTECTED]>
> >> wrote:
> >> > > >
> >> > > > > Hi Benjamin,
> >> > > > >
> >> > > > > Thank you very much for sharing detailed information!
> >> > > > >
> >> > > > > 1) From the runtime numbers that you provided, the mappers are
> >> very
> >> > > slow.
> >> > > > >
> >> > > > > CPU time spent (ms)5,081,610168,7405,250,350CPU time spent
> >> > > (ms)5,052,700
> >> > > > > 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910