Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Want query to use more reducers


Copy link to this message
-
Want query to use more reducers
I have a query that doesn't use reducers as efficiently as I would hope.  If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire.  However, on smaller tables it uses as low as a single reducer.  While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time.  The query is shown below (abstracted to its basic form).  As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up.  I thought the "distribute by" clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing.

Any ideas how I could improve this situation?

Thanks.

CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
SELECT * FROM (
FROM (
SELECT * FROM input_table
DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q
SELECT TRANSFORM(*)
USING 'python my_reducer_script.py' AS(
output_column_1,
output_column_2,
output_column_etc,
)
) s
ORDER BY output_column_1;

________________________________________________________________________________
Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________