-Re: Performance difference between tuning reducer num and partition table
Dean Wampler 2013-06-29, 14:01
What happens if you don't set the number of reducers in the 1st run? How
many reducers are executed. If it's a much smaller number, the extra
overhead could matter. Another clue is the size of the files the first run
produced, i.e., do you have 30 small (much less than a block size) files?
On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 <[EMAIL PROTECTED]> wrote:
> Hi Stephen,
> My query is actually more complex , hive will generate 2 mapreduces,
> in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
> 30 reducers (reducer num is set manually)
> in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
> reducers for each partition
> I do not know whether they could achieve the same performance if the
> reducers num is set properly.
> 2013/6/29 Stephen Sprague <[EMAIL PROTECTED]>
>> great question. your parallelization seems to trump hadoop's. I guess
>> i'd ask what are the _total_ number of Mappers and Reducers that run on
>> your cluster for these two scenarios? I'd be curious if there are the
>> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <[EMAIL PROTECTED]> wrote:
>>> Hi all,
>>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>>> perform a simple join on them,
>>> We can do it like this:
>>> INSERT OVERWRITE TABLE C
>>> SELECT .... FROM A JOIN B on A.id=B.id
>>> In order to speed up this query since table A and B have lots of data,
>>> another approach is :
>>> Say I partition table A and B into 10 partitions respectively, and write
>>> the query like this
>>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>>> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>> And my question is that , in my observation of some more complex
>>> queries, the second solution is about 15% faster than the first solution,
>>> is it simply because the setting of reducer num is not optimal?
>>> If the resource is not a limit and it is possible to set the proper
>>> reducer nums in the first solution , can they achieve the same performance?
>>> Is there any other fact that can cause performance difference between
>>> them(non-partition VS partition+concurrent) besides the job parameter
Dean Wampler, Ph.D.