Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Slow Group By operator


Copy link to this message
-
Re: Slow Group By operator
When data comes out of a map task, Hadoop serializes it so that it can know its exact size as it writes it into the output buffer.  To run it through the combiner it needs to deserialize it again, and then re-serialize it when it comes out.  So each pass through the combiner costs a serialize/deserialization pass, which is expensive and not worth it unless the data reduction is significant.  

In other words, the combiner can be slow because Java lacks a sizeof operator.

Alan.

On Aug 22, 2013, at 4:01 AM, Benjamin Jakobus wrote:

> Hi Cheolsoo,
>
> Thanks - I will try this now and get back to you.
>
> Out of interest; could you explain (or point me towards resources that
> would) why the combiner would be a problem?
>
> Also, could the fact that Pig builds an intermediary data structure (?)
> whilst Hive just performs a sort then the arithmetic operation explain the
> slowdown?
>
> (Apologies, I'm quite new to Pig/Hive - just my guesses).
>
> Regards,
> Benjamin
>
>
> On 22 August 2013 01:07, Cheolsoo Park <[EMAIL PROTECTED]> wrote:
>
>> Hi Benjamin,
>>
>> Thank you very much for sharing detailed information!
>>
>> 1) From the runtime numbers that you provided, the mappers are very slow.
>>
>> CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700
>> 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910
>>
>> 2) In your GROUP BY query, you have an algebraic UDF "COUNT".
>>
>> I am wondering whether disabling combiner will help here. I have seen a lot
>> of cases where combiner actually hurt performance significantly if it
>> doesn't combine mapper outputs significantly. Briefly looking at
>> generate_data.pl in PIG-200, it looks like a lot of random keys are
>> generated. So I guess you will end up with a large number of small bags
>> rather than a small number of large bags. If that's the case, combiner will
>> only add overhead to mappers.
>>
>> Can you try to include this "set pig.exec.nocombiner true;" and see whether
>> it helps?
>>
>> Thanks,
>> Cheolsoo
>>
>>
>>
>>
>>
>>
>> On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus <[EMAIL PROTECTED]
>>> wrote:
>>
>>> Hi Cheolsoo,
>>>
>>>>> What's your query like? Can you share it? Do you call any algebraic UDF
>>>>> after group by? I am wondering whether combiner matters in your test.
>>> I have been running 3 different types of queries.
>>>
>>> The first was performed on datasets of 6 different sizes:
>>>
>>>
>>>   - Dataset size 1: 30,000 records (772KB)
>>>   - Dataset size 2: 300,000 records (6.4MB)
>>>   - Dataset size 3: 3,000,000 records (63MB)
>>>   - Dataset size 4: 30 million records (628MB)
>>>   - Dataset size 5: 300 million records (6.2GB)
>>>   - Dataset size 6: 3 billion records (62GB)
>>>
>>> The datasets scale linearly, whereby the size equates to 3000 * 10n .
>>> A seventh dataset consisting of 1,000 records (23KB) was produced to
>>> perform join
>>> operations on. Its schema is as follows:
>>> name - string
>>> marks - integer
>>> gpa - float
>>> The data was generated using the generate data.pl perl script available
>>> for
>>> download
>>> from https://issues.apache.org/jira/browse/PIG-200 to produce the
>>> datasets. The results are as follows:
>>>
>>>
>>> *      * *      * *      * *Set 1      * *Set 2**      * *Set 3**      *
>>> *Set
>>> 4**      * *Set 5**      * *Set 6*
>>> *Arithmetic**      * 32.82*      * 36.21*      * 49.49*      * 83.25*
>>> *
>>> 423.63*      * 3900.78
>>> *Filter 10%**      * 32.94*      * 34.32*      * 44.56*      * 66.68*
>>> *
>>> 295.59*      * 2640.52
>>> *Filter 90%**      * 33.93*      * 32.55*      * 37.86*      * 53.22*
>>> *
>>> 197.36*      * 1657.37
>>> *Group**      * *      *49.43*      * 53.34*      * 69.84*      * 105.12*
>>>   *497.61*      * 4394.21
>>> *Join**      * *      *   49.89*      * 50.08*      * 78.55*      *
>> 150.39*
>>>   *1045.34*     *10258.19
>>> *Averaged performance of arithmetic, join, group, order, distinct select
>>> and filter operations on six datasets using Pig. Scripts were configured
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB