Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Slow Group By operator


+
Benjamin Jakobus 2013-08-20, 09:27
+
Cheolsoo Park 2013-08-20, 18:56
+
Benjamin Jakobus 2013-08-21, 10:52
Copy link to this message
-
Re: Slow Group By operator
Hi Benjamin,

Thank you very much for sharing detailed information!

1) From the runtime numbers that you provided, the mappers are very slow.

CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700
178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910

2) In your GROUP BY query, you have an algebraic UDF "COUNT".

I am wondering whether disabling combiner will help here. I have seen a lot
of cases where combiner actually hurt performance significantly if it
doesn't combine mapper outputs significantly. Briefly looking at
generate_data.pl in PIG-200, it looks like a lot of random keys are
generated. So I guess you will end up with a large number of small bags
rather than a small number of large bags. If that's the case, combiner will
only add overhead to mappers.

Can you try to include this "set pig.exec.nocombiner true;" and see whether
it helps?

Thanks,
Cheolsoo
On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus <[EMAIL PROTECTED]>wrote:

> Hi Cheolsoo,
>
> >>What's your query like? Can you share it? Do you call any algebraic UDF
> >> after group by? I am wondering whether combiner matters in your test.
> I have been running 3 different types of queries.
>
> The first was performed on datasets of 6 different sizes:
>
>
>    - Dataset size 1: 30,000 records (772KB)
>    - Dataset size 2: 300,000 records (6.4MB)
>    - Dataset size 3: 3,000,000 records (63MB)
>    - Dataset size 4: 30 million records (628MB)
>    - Dataset size 5: 300 million records (6.2GB)
>    - Dataset size 6: 3 billion records (62GB)
>
> The datasets scale linearly, whereby the size equates to 3000 * 10n .
> A seventh dataset consisting of 1,000 records (23KB) was produced to
> perform join
> operations on. Its schema is as follows:
> name - string
> marks - integer
> gpa - float
> The data was generated using the generate data.pl perl script available
> for
> download
>  from https://issues.apache.org/jira/browse/PIG-200 to produce the
> datasets. The results are as follows:
>
>
>  *      * *      * *      * *Set 1      * *Set 2**      * *Set 3**      *
> *Set
> 4**      * *Set 5**      * *Set 6*
> *Arithmetic**      * 32.82*      * 36.21*      * 49.49*      * 83.25*
>  *
>  423.63*      * 3900.78
> *Filter 10%**      * 32.94*      * 34.32*      * 44.56*      * 66.68*
>  *
>  295.59*      * 2640.52
> *Filter 90%**      * 33.93*      * 32.55*      * 37.86*      * 53.22*
>  *
>  197.36*      * 1657.37
> *Group**      * *      *49.43*      * 53.34*      * 69.84*      * 105.12*
>    *497.61*      * 4394.21
> *Join**      * *      *   49.89*      * 50.08*      * 78.55*      * 150.39*
>    *1045.34*     *10258.19
> *Averaged performance of arithmetic, join, group, order, distinct select
> and filter operations on six datasets using Pig. Scripts were configured as
> to use 8 reduce and 11 map tasks.*
>
>
>
>  *      * *              Set 1**      * *Set 2**      * *Set 3**      *
> *Set
> 4**      * *Set 5**      * *Set 6*
> *Arithmetic**      *  32.84*      * 37.33*      * 72.55*      * 300.08
>  2633.72    27821.19
> *Filter 10%      *   32.36*      * 53.28*      * 59.22*      * 209.5*    *
> 1672.3*     *18222.19
> *Filter 90%      *  31.23*      * 32.68*      *  36.8*      *  69.55*
>  *
> 331.88*     *3320.59
> *Group      * *      * 48.27*      * 47.68*      * 46.87*      * 53.66*
>  *141.36*     *1233.4
> *Join      * *      * *   *48.54*      *56.86*      * 104.6*      * 517.5*
>    * 4388.34*      * -
> *Distinct**      * *     *48.73*      *53.28*      * 72.54*      * 109.77*
>    * - *      * *      *  -
> *Averaged performance of arithmetic, join, group, distinct select and
> filter operations on six datasets using Hive. Scripts were configured as to
> use 8 reduce and 11 map tasks.*
>
> (If you want to see the standard deviation, let me know).
>
> So, to summarize the results: Pig outperforms Hive, with the exception of
> using *Group By*.
>
> The Pig scripts used for this benchmark are as follows:
> *Arithmetic*
+
Benjamin Jakobus 2013-08-22, 11:01
+
Alan Gates 2013-08-22, 15:38
+
Benjamin Jakobus 2013-08-24, 10:11
+
Cheolsoo Park 2013-08-22, 15:33
+
Benjamin Jakobus 2013-08-24, 10:27
+
Cheolsoo Park 2013-08-25, 01:27
+
Benjamin Jakobus 2013-08-25, 16:11
+
Benjamin Jakobus 2013-08-25, 17:10
+
Cheolsoo Park 2013-08-25, 17:57
+
Benjamin Jakobus 2013-08-25, 18:14
+
Cheolsoo Park 2013-08-25, 19:31
+
Benjamin Jakobus 2013-08-25, 20:01
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB