Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Performance difference between tuning reducer num and partition table


Copy link to this message
-
Performance difference between tuning reducer num and partition table
Felix.徐 2013-06-28, 15:40
Hi all,

Here is the scenario, suppose I have 2 tables A and B, I would like to
perform a simple join on them,

We can do it like this:

INSERT OVERWRITE TABLE C
SELECT .... FROM A JOIN B on A.id=B.id

In order to speed up this query since table A and B have lots of data,
another approach is :

Say I partition table A and B into 10 partitions respectively, and write
the query like this

INSERT OVERWRITE TABLE C PARTITION(pid=1)
SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

then I run this query 10 times concurrently (pid ranges from 1 to 10)

And my question is that , in my observation of some more complex queries,
the second solution is about 15% faster than the first solution,
is it simply because the setting of reducer num is not optimal?
If the resource is not a limit and it is possible to set the proper reducer
nums in the first solution , can they achieve the same performance? Is
there any other fact that can cause performance difference between
them(non-partition VS partition+concurrent) besides the job parameter
issues?

Thanks!