Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Performance difference between tuning reducer num and partition table


Copy link to this message
-
Re: Performance difference between tuning reducer num and partition table
Felix.徐 2013-06-29, 05:27
Hi Stephen,

My query is actually more complex , hive will generate 2 mapreduces,
in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
30 reducers (reducer num is set manually)
in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
reducers for each partition

I do not know whether they could achieve the same performance if the
reducers num is set properly.
2013/6/29 Stephen Sprague <[EMAIL PROTECTED]>

> great question.  your parallelization seems to trump hadoop's.    I guess
> i'd ask what are the _total_ number of Mappers and Reducers that run on
> your cluster for these two scenarios?   I'd be curious if there are the
> same.
>
>
>
>
> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <[EMAIL PROTECTED]> wrote:
>
>> Hi all,
>>
>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>> perform a simple join on them,
>>
>> We can do it like this:
>>
>> INSERT OVERWRITE TABLE C
>> SELECT .... FROM A JOIN B on A.id=B.id
>>
>> In order to speed up this query since table A and B have lots of data,
>> another approach is :
>>
>> Say I partition table A and B into 10 partitions respectively, and write
>> the query like this
>>
>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>
>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>
>> And my question is that , in my observation of some more complex queries,
>> the second solution is about 15% faster than the first solution,
>> is it simply because the setting of reducer num is not optimal?
>> If the resource is not a limit and it is possible to set the proper
>> reducer nums in the first solution , can they achieve the same performance?
>> Is there any other fact that can cause performance difference between
>> them(non-partition VS partition+concurrent) besides the job parameter
>> issues?
>>
>> Thanks!
>>
>
>