Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Obvious and not so obvious query optimzations in Hive


Copy link to this message
-
Re: Obvious and not so obvious query optimzations in Hive
If you are optimizing for latency (running time) as opposed to throughput,
it's best to have a single "wave" of reducers. So if your cluster is setup
with a limit of, say, 2 reducers per node using 2*N reduce tasks would work
best (for large queries). You have to specify that in your script using
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single
reducer - so that's slow. I never use OrderBy though (Unix's sort is
probably faster). For analytics queries I need Distribute/Sort By (with
UDFs), which can use multiple reducers.

Hope this helps.
igor
decide.com

On Wed, Jun 27, 2012 at 8:47 AM, <[EMAIL PROTECTED]> wrote:

> 5.       **How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop cluster?
> I am trying to understand the bottleneck between query and cluster size.**
> **
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB