Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Slow Group By operator


+
Benjamin Jakobus 2013-08-20, 09:27
+
Cheolsoo Park 2013-08-20, 18:56
+
Benjamin Jakobus 2013-08-21, 10:52
+
Cheolsoo Park 2013-08-22, 00:07
+
Benjamin Jakobus 2013-08-22, 11:01
+
Alan Gates 2013-08-22, 15:38
+
Benjamin Jakobus 2013-08-24, 10:11
Copy link to this message
-
Re: Slow Group By operator
Hi Benjamin,

To answer your question, how the Hadoop combiner works is that 1) mappers
write outputs to disk and 2) combiners read them, combine and write them
again. So you're paying extra disk I/O as well as
serialization/deserialization.

This will pay off if combiners significantly reduce the intermediate
outputs that reducers need to fetch from mappers. But if reduction is not
significant, it will only slow down mappers. You can identify whether this
is really a problem by comparing the time spent by map and combine
functions in the task logs.

What I usually do are:
1) If there are many small bags, disable combiners.
2) If there are many large bags, enable combiners. Furthermore, turning on
"pig.exec.mapPartAgg" helps. (see the Pig
blog<https://blogs.apache.org/pig/entry/apache_pig_it_goes_to>for
details.
)

Thanks,
Cheolsoo
On Thu, Aug 22, 2013 at 4:01 AM, Benjamin Jakobus <[EMAIL PROTECTED]>wrote:

> Hi Cheolsoo,
>
> Thanks - I will try this now and get back to you.
>
> Out of interest; could you explain (or point me towards resources that
> would) why the combiner would be a problem?
>
> Also, could the fact that Pig builds an intermediary data structure (?)
> whilst Hive just performs a sort then the arithmetic operation explain the
> slowdown?
>
> (Apologies, I'm quite new to Pig/Hive - just my guesses).
>
> Regards,
> Benjamin
>
>
> On 22 August 2013 01:07, Cheolsoo Park <[EMAIL PROTECTED]> wrote:
>
> > Hi Benjamin,
> >
> > Thank you very much for sharing detailed information!
> >
> > 1) From the runtime numbers that you provided, the mappers are very slow.
> >
> > CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700
> > 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910
> >
> > 2) In your GROUP BY query, you have an algebraic UDF "COUNT".
> >
> > I am wondering whether disabling combiner will help here. I have seen a
> lot
> > of cases where combiner actually hurt performance significantly if it
> > doesn't combine mapper outputs significantly. Briefly looking at
> > generate_data.pl in PIG-200, it looks like a lot of random keys are
> > generated. So I guess you will end up with a large number of small bags
> > rather than a small number of large bags. If that's the case, combiner
> will
> > only add overhead to mappers.
> >
> > Can you try to include this "set pig.exec.nocombiner true;" and see
> whether
> > it helps?
> >
> > Thanks,
> > Cheolsoo
> >
> >
> >
> >
> >
> >
> > On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi Cheolsoo,
> > >
> > > >>What's your query like? Can you share it? Do you call any algebraic
> UDF
> > > >> after group by? I am wondering whether combiner matters in your
> test.
> > > I have been running 3 different types of queries.
> > >
> > > The first was performed on datasets of 6 different sizes:
> > >
> > >
> > >    - Dataset size 1: 30,000 records (772KB)
> > >    - Dataset size 2: 300,000 records (6.4MB)
> > >    - Dataset size 3: 3,000,000 records (63MB)
> > >    - Dataset size 4: 30 million records (628MB)
> > >    - Dataset size 5: 300 million records (6.2GB)
> > >    - Dataset size 6: 3 billion records (62GB)
> > >
> > > The datasets scale linearly, whereby the size equates to 3000 * 10n .
> > > A seventh dataset consisting of 1,000 records (23KB) was produced to
> > > perform join
> > > operations on. Its schema is as follows:
> > > name - string
> > > marks - integer
> > > gpa - float
> > > The data was generated using the generate data.pl perl script
> available
> > > for
> > > download
> > >  from https://issues.apache.org/jira/browse/PIG-200 to produce the
> > > datasets. The results are as follows:
> > >
> > >
> > >  *      * *      * *      * *Set 1      * *Set 2**      * *Set 3**
>  *
> > > *Set
> > > 4**      * *Set 5**      * *Set 6*
> > > *Arithmetic**      * 32.82*      * 36.21*      * 49.49*      * 83.25*
> > >  *
> > >  423.63*      * 3900.78
> > > *Filter 10%**      * 32.94*      * 34.32*      * 44.56*      * 66.68*
+
Benjamin Jakobus 2013-08-24, 10:27
+
Cheolsoo Park 2013-08-25, 01:27
+
Benjamin Jakobus 2013-08-25, 16:11
+
Benjamin Jakobus 2013-08-25, 17:10
+
Cheolsoo Park 2013-08-25, 17:57
+
Benjamin Jakobus 2013-08-25, 18:14
+
Cheolsoo Park 2013-08-25, 19:31
+
Benjamin Jakobus 2013-08-25, 20:01
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB