Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Want query to use more reducers


Copy link to this message
-
Want query to use more reducers
I have a query that doesn't use reducers as efficiently as I would hope.  If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire.  However, on smaller tables it uses as low as a single reducer.  While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time.  The query is shown below (abstracted to its basic form).  As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up.  I thought the "distribute by" clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing.

Any ideas how I could improve this situation?

Thanks.

CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
SELECT * FROM (
FROM (
SELECT * FROM input_table
DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q
SELECT TRANSFORM(*)
USING 'python my_reducer_script.py' AS(
output_column_1,
output_column_2,
output_column_etc,
)
) s
ORDER BY output_column_1;

________________________________________________________________________________
Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB