Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Obvious and not so obvious query optimzations in Hive

Copy link to this message
Re: Obvious and not so obvious query optimzations in Hive
If you are optimizing for latency (running time) as opposed to throughput,
it's best to have a single "wave" of reducers. So if your cluster is setup
with a limit of, say, 2 reducers per node using 2*N reduce tasks would work
best (for large queries). You have to specify that in your script using
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single
reducer - so that's slow. I never use OrderBy though (Unix's sort is
probably faster). For analytics queries I need Distribute/Sort By (with
UDFs), which can use multiple reducers.

Hope this helps.

On Wed, Jun 27, 2012 at 8:47 AM, <[EMAIL PROTECTED]> wrote:

> 5.       **How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop cluster?
> I am trying to understand the bottleneck between query and cluster size.**
> **