Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Obvious and not so obvious query optimzations in Hive


+
richin.jain@... 2012-06-27, 15:47
+
Bejoy KS 2012-06-27, 18:46
+
yongqiang he 2012-06-27, 22:31
+
richin.jain@... 2012-06-28, 19:10
+
Bejoy KS 2012-06-28, 19:16
Copy link to this message
-
Re: Obvious and not so obvious query optimzations in Hive
Richin,

even if you set number of reducers to be launched it does not guarantee u
to that it will generate those many files.

based on your query and data only the reducers which got keys to process
will generate the files
so when you have hive query with large number of keys but with lesser
number in spilt size it will need large maps but then reducers will always
depend on the keys emitted by the mappers and all the extra reducers will
be a burden to the system

On Fri, Jun 29, 2012 at 12:40 AM, <[EMAIL PROTECTED]> wrote:

> Igor,Bejoy - thanks a lot, that helps.
>
> He, I am running the query on Amazon EMR cluster and based on the type of
> instances I pick, default number of mappers and reducers are set. Now I
> would expect Hive to generate that many number of output files as there are
> number of reducers (since I am not using order by clause or setting it
> explicitly). If Hive is setting lower number of reducers for itself than
> there is no point using a high end EMR cluster and pay for it.
> Also I can only set number of reduce tasks explicitly through  SET
> mapred.reduce.tasks = ... , how to set number of reducers itself? I am
> confused between number of reduce tasks and reducers, can you please
> explain?
>
> Thanks,
> Richin
>
> ===========> If you are optimizing for latency (running time) as opposed to throughput,
> it's best to have a single "wave" of reducers. So if your cluster is setup
> with a limit of, say, 2 reducers per node using 2*N reduce tasks would work
> best (for large queries). You have to specify that in your script using
> SET mapred.reduce.tasks = ...;
>
> GroupBy doesn't limit the number of reducers but OrderBy does use a single
> reducer - so that's slow. I never use OrderBy though (Unix's sort is
> probably faster). For analytics queries I need Distribute/Sort By (with
> UDFs), which can use multiple reducers.
>
>
> Hope this helps.
> igor
> decide.com
>
> On Wed, Jun 27, 2012 at 8:47 AM, <[EMAIL PROTECTED]> wrote:
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop cluster?
> I am trying to understand the bottleneck between query and cluster size.
>
>
> -----Original Message-----
> From: ext yongqiang he [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, June 27, 2012 6:32 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Obvious and not so obvious query optimzations in Hive
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> You may want GZip your data since it is offline. But space is not a
> concern and you want to optimize CPU, use snappy.
>
> With snappy, there is no reason to go with no compression.
>
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> First go with normal tables. External tables are hard to manage.
> Views are there for complex things which are hard to do with 'managed
> table'.
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read about
> RC files and how efficient they are, how to use them?
>
> rcfile if you query your data with Hive. 'create table xxx(xxx) stored as
> rcfile'
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop cluster?
> I am trying to understand the bottleneck between query and cluster size.
>
> Can you say more about your concern about "query and cluster size"?
>
> On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS <[EMAIL PROTECTED]> wrote:
> > Hi Richin
> >
> > I'm not an AWS guy but still lemme try answering a few questions in
> > general (not wrt AWS EMR)
> >
> >
> > 1.       Having my external table data gzipped and reading it in the

Nitin Pawar
+
richin.jain@... 2012-06-28, 19:23
+
Bejoy KS 2012-06-28, 19:47
+
richin.jain@... 2012-06-28, 20:08
+
Bejoy KS 2012-06-28, 20:12
+
richin.jain@... 2012-06-29, 19:35
+
Igor Tatarinov 2012-06-27, 22:44