Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Slow Group By operator


Copy link to this message
-
Re: Slow Group By operator
I guess you mean "combiner + mapPartAgg set to true" not "no combiner +
mapPartAgg set to true".
On Sun, Aug 25, 2013 at 10:10 AM, Benjamin Jakobus
<[EMAIL PROTECTED]>wrote:

> Hi Cheolsoo,
>
> Just ran the benchmarks: no luck.
>
> No combiner + mapPartAgg set to true is slower than without the combiner:
> real 752.85
> real 757.41
> real 749.03
>
>
>
> On 25 August 2013 17:11, Benjamin Jakobus <[EMAIL PROTECTED]> wrote:
>
> > Hi Cheolsoo,
> >
> > Thanks - let's see, I'll give it a try now.
> >
> > Best Regards,
> > Ben
> >
> >
> > On 25 August 2013 02:27, Cheolsoo Park <[EMAIL PROTECTED]> wrote:
> >
> >> Hi Benjamin,
> >>
> >> Thanks for letting us know. That means my original assumption was wrong.
> >> The size of bags is not small. In fact, you can compute the avg size of
> >> bags as follows: total number of input records / ( reduce input groups x
> >> number of reducers ).
> >>
> >> One more thing you can try is turning on "pig.exec.mapPartAgg". That may
> >> help mappers run faster. If this doesn't work, I run out of ideas. :-)
> >>
> >> Thanks,
> >> Cheolsoo
> >>
> >>
> >>
> >> On Sat, Aug 24, 2013 at 3:27 AM, Benjamin Jakobus <
> [EMAIL PROTECTED]
> >> >wrote:
> >>
> >> > Hi Alan, Cheolsoo,
> >> >
> >> > I re-ran the benchmarks with and without the combiner. Enabling the
> >> > combiner is faster:
> >> >
> >> > With combiner:
> >> > real 668.44
> >> > real 663.10
> >> > real 665.05
> >> >
> >> > Without combiner:
> >> > real 795.97
> >> > real 810.51
> >> > real 810.16
> >> >
> >> > Best Regards,
> >> > Ben
> >> >
> >> >
> >> > On 22 August 2013 16:33, Cheolsoo Park <[EMAIL PROTECTED]> wrote:
> >> >
> >> > > Hi Benjamin,
> >> > >
> >> > > To answer your question, how the Hadoop combiner works is that 1)
> >> mappers
> >> > > write outputs to disk and 2) combiners read them, combine and write
> >> them
> >> > > again. So you're paying extra disk I/O as well as
> >> > > serialization/deserialization.
> >> > >
> >> > > This will pay off if combiners significantly reduce the intermediate
> >> > > outputs that reducers need to fetch from mappers. But if reduction
> is
> >> not
> >> > > significant, it will only slow down mappers. You can identify
> whether
> >> > this
> >> > > is really a problem by comparing the time spent by map and combine
> >> > > functions in the task logs.
> >> > >
> >> > > What I usually do are:
> >> > > 1) If there are many small bags, disable combiners.
> >> > > 2) If there are many large bags, enable combiners. Furthermore,
> >> turning
> >> > on
> >> > > "pig.exec.mapPartAgg" helps. (see the Pig
> >> > > blog<https://blogs.apache.org/pig/entry/apache_pig_it_goes_to>for
> >> > > details.
> >> > > )
> >> > >
> >> > > Thanks,
> >> > > Cheolsoo
> >> > >
> >> > >
> >> > > On Thu, Aug 22, 2013 at 4:01 AM, Benjamin Jakobus <
> >> > [EMAIL PROTECTED]
> >> > > >wrote:
> >> > >
> >> > > > Hi Cheolsoo,
> >> > > >
> >> > > > Thanks - I will try this now and get back to you.
> >> > > >
> >> > > > Out of interest; could you explain (or point me towards resources
> >> that
> >> > > > would) why the combiner would be a problem?
> >> > > >
> >> > > > Also, could the fact that Pig builds an intermediary data
> structure
> >> (?)
> >> > > > whilst Hive just performs a sort then the arithmetic operation
> >> explain
> >> > > the
> >> > > > slowdown?
> >> > > >
> >> > > > (Apologies, I'm quite new to Pig/Hive - just my guesses).
> >> > > >
> >> > > > Regards,
> >> > > > Benjamin
> >> > > >
> >> > > >
> >> > > > On 22 August 2013 01:07, Cheolsoo Park <[EMAIL PROTECTED]>
> >> wrote:
> >> > > >
> >> > > > > Hi Benjamin,
> >> > > > >
> >> > > > > Thank you very much for sharing detailed information!
> >> > > > >
> >> > > > > 1) From the runtime numbers that you provided, the mappers are
> >> very
> >> > > slow.
> >> > > > >
> >> > > > > CPU time spent (ms)5,081,610168,7405,250,350CPU time spent
> >> > > (ms)5,052,700
> >> > > > > 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB