Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Performance difference between tuning reducer num and partition table


+
Felix.徐 2013-06-28, 15:40
Copy link to this message
-
Re: Performance difference between tuning reducer num and partition table
Stephen Sprague 2013-06-28, 16:41
great question.  your parallelization seems to trump hadoop's.    I guess
i'd ask what are the _total_ number of Mappers and Reducers that run on
your cluster for these two scenarios?   I'd be curious if there are the
same.
On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <[EMAIL PROTECTED]> wrote:

> Hi all,
>
> Here is the scenario, suppose I have 2 tables A and B, I would like to
> perform a simple join on them,
>
> We can do it like this:
>
> INSERT OVERWRITE TABLE C
> SELECT .... FROM A JOIN B on A.id=B.id
>
> In order to speed up this query since table A and B have lots of data,
> another approach is :
>
> Say I partition table A and B into 10 partitions respectively, and write
> the query like this
>
> INSERT OVERWRITE TABLE C PARTITION(pid=1)
> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>
> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>
> And my question is that , in my observation of some more complex queries,
> the second solution is about 15% faster than the first solution,
> is it simply because the setting of reducer num is not optimal?
> If the resource is not a limit and it is possible to set the proper
> reducer nums in the first solution , can they achieve the same performance?
> Is there any other fact that can cause performance difference between
> them(non-partition VS partition+concurrent) besides the job parameter
> issues?
>
> Thanks!
>
+
Felix.徐 2013-06-29, 05:27
+
Dean Wampler 2013-06-29, 14:01
+
Felix.徐 2013-07-01, 02:12