Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Performance difference between tuning reducer num and partition table


+
Felix.徐 2013-06-28, 15:40
+
Stephen Sprague 2013-06-28, 16:41
+
Felix.徐 2013-06-29, 05:27
Copy link to this message
-
Re: Performance difference between tuning reducer num and partition table
What happens if you don't set the number of reducers in the 1st run? How
many reducers are executed. If it's a much smaller number, the extra
overhead could matter. Another clue is the size of the files the first run
produced, i.e., do you have 30 small (much less than a block size) files?

On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 <[EMAIL PROTECTED]> wrote:

> Hi Stephen,
>
> My query is actually more complex , hive will generate 2 mapreduces,
> in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
> 30 reducers (reducer num is set manually)
> in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
> reducers for each partition
>
> I do not know whether they could achieve the same performance if the
> reducers num is set properly.
>
>
> 2013/6/29 Stephen Sprague <[EMAIL PROTECTED]>
>
>> great question.  your parallelization seems to trump hadoop's.    I guess
>> i'd ask what are the _total_ number of Mappers and Reducers that run on
>> your cluster for these two scenarios?   I'd be curious if there are the
>> same.
>>
>>
>>
>>
>> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <[EMAIL PROTECTED]> wrote:
>>
>>> Hi all,
>>>
>>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>>> perform a simple join on them,
>>>
>>> We can do it like this:
>>>
>>> INSERT OVERWRITE TABLE C
>>> SELECT .... FROM A JOIN B on A.id=B.id
>>>
>>> In order to speed up this query since table A and B have lots of data,
>>> another approach is :
>>>
>>> Say I partition table A and B into 10 partitions respectively, and write
>>> the query like this
>>>
>>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>>> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>>
>>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>>
>>> And my question is that , in my observation of some more complex
>>> queries, the second solution is about 15% faster than the first solution,
>>> is it simply because the setting of reducer num is not optimal?
>>> If the resource is not a limit and it is possible to set the proper
>>> reducer nums in the first solution , can they achieve the same performance?
>>> Is there any other fact that can cause performance difference between
>>> them(non-partition VS partition+concurrent) besides the job parameter
>>> issues?
>>>
>>> Thanks!
>>>
>>
>>
>
--
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com
+
Felix.徐 2013-07-01, 02:12
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB