Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Want query to use more reducers


Copy link to this message
-
Re: Want query to use more reducers
Keith Wiley 2013-09-30, 20:40
Thanks.  mapred.reduce.tasks and hive.exec.reducers.max seem to have fixed the problem.  It is now saturating the cluster and running the query super fast.  Excellent!

On Sep 30, 2013, at 12:28 , Sean Busbey wrote:

> Hey Keith,
>
> It sounds like you should tweak the settings for how Hive handles query execution[1]:
>
> 1) Tune the guessed number of reducers based on input size
>
> = hive.exec.reducers.bytes.per.reducer
>
> Defaults to 1G. Based on your description, it sounds like this is probably still at default.
>
> In this case, you should also set a max # of reducers based on your cluster size.
>
> = hive.exec.reducers.max
>
> I usually set this to the # reduce slots, if there's a decent chance I'll get to saturate the cluster. If not, don't worry about it.
>
> 2) Hard code a number of reducers
>
> = mapred.reduce.tasks
>
> Setting this will cause Hive to always use that number. It defaults to -1, which tells hive to use the heuristic about input size to guess.
>
> In either of the above cases, you should look at the options to merge small files (search for "merge"  in the configuration property list) to avoid getting lots of little outputs.
>
> HTH
>
> [1]: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution
>
> -Sean
>
> On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley <[EMAIL PROTECTED]> wrote:
> I have a query that doesn't use reducers as efficiently as I would hope.  If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire.  However, on smaller tables it uses as low as a single reducer.  While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time.  The query is shown below (abstracted to its basic form).  As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up.  I thought the "distribute by" clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing.
>
> Any ideas how I could improve this situation?
>
> Thanks.
>
> CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
> SELECT * FROM (
>         FROM (
>                 SELECT * FROM input_table
>                 DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q
>         SELECT TRANSFORM(*)
>         USING 'python my_reducer_script.py' AS(
>         output_column_1,
>         output_column_2,
>         output_column_etc,
>         )
> ) s
> ORDER BY output_column_1;
>
> ________________________________________________________________________________
> Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                            --  Yoda
> ________________________________________________________________________________
>
>
>
>
> --
> Sean
________________________________________________________________________________
Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________