Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Performance difference between tuning reducer num and partition table


+
Felix.徐 2013-06-28, 15:40
+
Stephen Sprague 2013-06-28, 16:41
Copy link to this message
-
Re: Performance difference between tuning reducer num and partition table
Hi Stephen,

My query is actually more complex , hive will generate 2 mapreduces,
in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
30 reducers (reducer num is set manually)
in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
reducers for each partition

I do not know whether they could achieve the same performance if the
reducers num is set properly.
2013/6/29 Stephen Sprague <[EMAIL PROTECTED]>

> great question.  your parallelization seems to trump hadoop's.    I guess
> i'd ask what are the _total_ number of Mappers and Reducers that run on
> your cluster for these two scenarios?   I'd be curious if there are the
> same.
>
>
>
>
> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <[EMAIL PROTECTED]> wrote:
>
>> Hi all,
>>
>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>> perform a simple join on them,
>>
>> We can do it like this:
>>
>> INSERT OVERWRITE TABLE C
>> SELECT .... FROM A JOIN B on A.id=B.id
>>
>> In order to speed up this query since table A and B have lots of data,
>> another approach is :
>>
>> Say I partition table A and B into 10 partitions respectively, and write
>> the query like this
>>
>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>
>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>
>> And my question is that , in my observation of some more complex queries,
>> the second solution is about 15% faster than the first solution,
>> is it simply because the setting of reducer num is not optimal?
>> If the resource is not a limit and it is possible to set the proper
>> reducer nums in the first solution , can they achieve the same performance?
>> Is there any other fact that can cause performance difference between
>> them(non-partition VS partition+concurrent) besides the job parameter
>> issues?
>>
>> Thanks!
>>
>
>
+
Dean Wampler 2013-06-29, 14:01
+
Felix.徐 2013-07-01, 02:12
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB