Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Slow Group By operator


Copy link to this message
-
Re: Slow Group By operator
I have no more suggestion. If you find anything, please share with us. I
would be interested in understanding what you're seeing.

On Sun, Aug 25, 2013 at 11:14 AM, Benjamin Jakobus
<[EMAIL PROTECTED]>wrote:

> "combiner + mapPartAgg set to true" - yup!
>
>
> On 25 August 2013 18:57, Cheolsoo Park <[EMAIL PROTECTED]> wrote:
>
> > I guess you mean "combiner + mapPartAgg set to true" not "no combiner +
> > mapPartAgg set to true".
> >
> >
> > On Sun, Aug 25, 2013 at 10:10 AM, Benjamin Jakobus
> > <[EMAIL PROTECTED]>wrote:
> >
> > > Hi Cheolsoo,
> > >
> > > Just ran the benchmarks: no luck.
> > >
> > > No combiner + mapPartAgg set to true is slower than without the
> combiner:
> > > real 752.85
> > > real 757.41
> > > real 749.03
> > >
> > >
> > >
> > > On 25 August 2013 17:11, Benjamin Jakobus <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Hi Cheolsoo,
> > > >
> > > > Thanks - let's see, I'll give it a try now.
> > > >
> > > > Best Regards,
> > > > Ben
> > > >
> > > >
> > > > On 25 August 2013 02:27, Cheolsoo Park <[EMAIL PROTECTED]> wrote:
> > > >
> > > >> Hi Benjamin,
> > > >>
> > > >> Thanks for letting us know. That means my original assumption was
> > wrong.
> > > >> The size of bags is not small. In fact, you can compute the avg size
> > of
> > > >> bags as follows: total number of input records / ( reduce input
> > groups x
> > > >> number of reducers ).
> > > >>
> > > >> One more thing you can try is turning on "pig.exec.mapPartAgg". That
> > may
> > > >> help mappers run faster. If this doesn't work, I run out of ideas.
> :-)
> > > >>
> > > >> Thanks,
> > > >> Cheolsoo
> > > >>
> > > >>
> > > >>
> > > >> On Sat, Aug 24, 2013 at 3:27 AM, Benjamin Jakobus <
> > > [EMAIL PROTECTED]
> > > >> >wrote:
> > > >>
> > > >> > Hi Alan, Cheolsoo,
> > > >> >
> > > >> > I re-ran the benchmarks with and without the combiner. Enabling
> the
> > > >> > combiner is faster:
> > > >> >
> > > >> > With combiner:
> > > >> > real 668.44
> > > >> > real 663.10
> > > >> > real 665.05
> > > >> >
> > > >> > Without combiner:
> > > >> > real 795.97
> > > >> > real 810.51
> > > >> > real 810.16
> > > >> >
> > > >> > Best Regards,
> > > >> > Ben
> > > >> >
> > > >> >
> > > >> > On 22 August 2013 16:33, Cheolsoo Park <[EMAIL PROTECTED]>
> > wrote:
> > > >> >
> > > >> > > Hi Benjamin,
> > > >> > >
> > > >> > > To answer your question, how the Hadoop combiner works is that
> 1)
> > > >> mappers
> > > >> > > write outputs to disk and 2) combiners read them, combine and
> > write
> > > >> them
> > > >> > > again. So you're paying extra disk I/O as well as
> > > >> > > serialization/deserialization.
> > > >> > >
> > > >> > > This will pay off if combiners significantly reduce the
> > intermediate
> > > >> > > outputs that reducers need to fetch from mappers. But if
> reduction
> > > is
> > > >> not
> > > >> > > significant, it will only slow down mappers. You can identify
> > > whether
> > > >> > this
> > > >> > > is really a problem by comparing the time spent by map and
> combine
> > > >> > > functions in the task logs.
> > > >> > >
> > > >> > > What I usually do are:
> > > >> > > 1) If there are many small bags, disable combiners.
> > > >> > > 2) If there are many large bags, enable combiners. Furthermore,
> > > >> turning
> > > >> > on
> > > >> > > "pig.exec.mapPartAgg" helps. (see the Pig
> > > >> > > blog<https://blogs.apache.org/pig/entry/apache_pig_it_goes_to
> >for
> > > >> > > details.
> > > >> > > )
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Cheolsoo
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Aug 22, 2013 at 4:01 AM, Benjamin Jakobus <
> > > >> > [EMAIL PROTECTED]
> > > >> > > >wrote:
> > > >> > >
> > > >> > > > Hi Cheolsoo,
> > > >> > > >
> > > >> > > > Thanks - I will try this now and get back to you.
> > > >> > > >
> > > >> > > > Out of interest; could you explain (or point me towards
> > resources
> > > >> that
> > > >> > > > would) why the combiner would be a problem?
> > > >> > > >
> > > >> > > > Also, could the fact that Pig builds an intermediary data