Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Performance difference between tuning reducer num and partition table


+
Felix.徐 2013-06-28, 15:40
+
Stephen Sprague 2013-06-28, 16:41
+
Felix.徐 2013-06-29, 05:27
+
Dean Wampler 2013-06-29, 14:01
Copy link to this message
-
Re: Performance difference between tuning reducer num and partition table
Hi Dean,

Thanks for your reply. If I don't set the number of reducers in the 1st run
, the number of reducers will be much smaller and the performance will be
worse. The total output file size is about 200MB, I see that many reduce
output files are empty, only 10 of them have data.

Another question is that , is there any documentation about the job
specific parameters of MapReduce and Hive?
2013/6/29 Dean Wampler <[EMAIL PROTECTED]>

> What happens if you don't set the number of reducers in the 1st run? How
> many reducers are executed. If it's a much smaller number, the extra
> overhead could matter. Another clue is the size of the files the first run
> produced, i.e., do you have 30 small (much less than a block size) files?
>
> On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 <[EMAIL PROTECTED]> wrote:
>
>> Hi Stephen,
>>
>> My query is actually more complex , hive will generate 2 mapreduces,
>> in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
>> 30 reducers (reducer num is set manually)
>> in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
>> reducers for each partition
>>
>> I do not know whether they could achieve the same performance if the
>> reducers num is set properly.
>>
>>
>> 2013/6/29 Stephen Sprague <[EMAIL PROTECTED]>
>>
>>> great question.  your parallelization seems to trump hadoop's.    I
>>> guess i'd ask what are the _total_ number of Mappers and Reducers that run
>>> on your cluster for these two scenarios?   I'd be curious if there are the
>>> same.
>>>
>>>
>>>
>>>
>>> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>>>> perform a simple join on them,
>>>>
>>>> We can do it like this:
>>>>
>>>> INSERT OVERWRITE TABLE C
>>>> SELECT .... FROM A JOIN B on A.id=B.id
>>>>
>>>> In order to speed up this query since table A and B have lots of data,
>>>> another approach is :
>>>>
>>>> Say I partition table A and B into 10 partitions respectively, and
>>>> write the query like this
>>>>
>>>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>>>> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>>>
>>>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>>>
>>>> And my question is that , in my observation of some more complex
>>>> queries, the second solution is about 15% faster than the first solution,
>>>> is it simply because the setting of reducer num is not optimal?
>>>> If the resource is not a limit and it is possible to set the proper
>>>> reducer nums in the first solution , can they achieve the same performance?
>>>> Is there any other fact that can cause performance difference between
>>>> them(non-partition VS partition+concurrent) besides the job parameter
>>>> issues?
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>
>
>
> --
> Dean Wampler, Ph.D.
> @deanwampler
> http://polyglotprogramming.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB