I have a query that doesn't use reducers as efficiently as I would hope. If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire. However, on smaller tables it uses as low as a single reducer. While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time. The query is shown below (abstracted to its basic form). As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up. I thought the "distribute by" clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing.
Any ideas how I could improve this situation?
CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
SELECT * FROM (
SELECT * FROM input_table
DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q
USING 'python my_reducer_script.py' AS(
ORDER BY output_column_1;
Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com
"Luminous beings are we, not this crude matter."
Sean Busbey 2013-09-30, 19:28
Keith Wiley 2013-09-30, 20:40