Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Slow Group By operator


Copy link to this message
-
Re: Slow Group By operator
I have no more suggestion. If you find anything, please share with us. I
would be interested in understanding what you're seeing.

On Sun, Aug 25, 2013 at 11:14 AM, Benjamin Jakobus
<[EMAIL PROTECTED]>wrote:

> "combiner + mapPartAgg set to true" - yup!
>
>
> On 25 August 2013 18:57, Cheolsoo Park <[EMAIL PROTECTED]> wrote:
>
> > I guess you mean "combiner + mapPartAgg set to true" not "no combiner +
> > mapPartAgg set to true".
> >
> >
> > On Sun, Aug 25, 2013 at 10:10 AM, Benjamin Jakobus
> > <[EMAIL PROTECTED]>wrote:
> >
> > > Hi Cheolsoo,
> > >
> > > Just ran the benchmarks: no luck.
> > >
> > > No combiner + mapPartAgg set to true is slower than without the
> combiner:
> > > real 752.85
> > > real 757.41
> > > real 749.03
> > >
> > >
> > >
> > > On 25 August 2013 17:11, Benjamin Jakobus <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Hi Cheolsoo,
> > > >
> > > > Thanks - let's see, I'll give it a try now.
> > > >
> > > > Best Regards,
> > > > Ben
> > > >
> > > >
> > > > On 25 August 2013 02:27, Cheolsoo Park <[EMAIL PROTECTED]> wrote:
> > > >
> > > >> Hi Benjamin,
> > > >>
> > > >> Thanks for letting us know. That means my original assumption was
> > wrong.
> > > >> The size of bags is not small. In fact, you can compute the avg size
> > of
> > > >> bags as follows: total number of input records / ( reduce input
> > groups x
> > > >> number of reducers ).
> > > >>
> > > >> One more thing you can try is turning on "pig.exec.mapPartAgg". That
> > may
> > > >> help mappers run faster. If this doesn't work, I run out of ideas.
> :-)
> > > >>
> > > >> Thanks,
> > > >> Cheolsoo
> > > >>
> > > >>
> > > >>
> > > >> On Sat, Aug 24, 2013 at 3:27 AM, Benjamin Jakobus <
> > > [EMAIL PROTECTED]
> > > >> >wrote:
> > > >>
> > > >> > Hi Alan, Cheolsoo,
> > > >> >
> > > >> > I re-ran the benchmarks with and without the combiner. Enabling
> the
> > > >> > combiner is faster:
> > > >> >
> > > >> > With combiner:
> > > >> > real 668.44
> > > >> > real 663.10
> > > >> > real 665.05
> > > >> >
> > > >> > Without combiner:
> > > >> > real 795.97
> > > >> > real 810.51
> > > >> > real 810.16
> > > >> >
> > > >> > Best Regards,
> > > >> > Ben
> > > >> >
> > > >> >
> > > >> > On 22 August 2013 16:33, Cheolsoo Park <[EMAIL PROTECTED]>
> > wrote:
> > > >> >
> > > >> > > Hi Benjamin,
> > > >> > >
> > > >> > > To answer your question, how the Hadoop combiner works is that
> 1)
> > > >> mappers
> > > >> > > write outputs to disk and 2) combiners read them, combine and
> > write
> > > >> them
> > > >> > > again. So you're paying extra disk I/O as well as
> > > >> > > serialization/deserialization.
> > > >> > >
> > > >> > > This will pay off if combiners significantly reduce the
> > intermediate
> > > >> > > outputs that reducers need to fetch from mappers. But if
> reduction
> > > is
> > > >> not
> > > >> > > significant, it will only slow down mappers. You can identify
> > > whether
> > > >> > this
> > > >> > > is really a problem by comparing the time spent by map and
> combine
> > > >> > > functions in the task logs.
> > > >> > >
> > > >> > > What I usually do are:
> > > >> > > 1) If there are many small bags, disable combiners.
> > > >> > > 2) If there are many large bags, enable combiners. Furthermore,
> > > >> turning
> > > >> > on
> > > >> > > "pig.exec.mapPartAgg" helps. (see the Pig
> > > >> > > blog<https://blogs.apache.org/pig/entry/apache_pig_it_goes_to
> >for
> > > >> > > details.
> > > >> > > )
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Cheolsoo
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Aug 22, 2013 at 4:01 AM, Benjamin Jakobus <
> > > >> > [EMAIL PROTECTED]
> > > >> > > >wrote:
> > > >> > >
> > > >> > > > Hi Cheolsoo,
> > > >> > > >
> > > >> > > > Thanks - I will try this now and get back to you.
> > > >> > > >
> > > >> > > > Out of interest; could you explain (or point me towards
> > resources
> > > >> that
> > > >> > > > would) why the combiner would be a problem?
> > > >> > > >
> > > >> > > > Also, could the fact that Pig builds an intermediary data
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB