Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Obvious and not so obvious query optimzations in Hive


+
richin.jain@... 2012-06-27, 15:47
+
Bejoy KS 2012-06-27, 18:46
+
yongqiang he 2012-06-27, 22:31
+
richin.jain@... 2012-06-28, 19:10
+
Bejoy KS 2012-06-28, 19:16
+
Nitin Pawar 2012-06-28, 19:15
+
richin.jain@... 2012-06-28, 19:23
+
Bejoy KS 2012-06-28, 19:47
+
richin.jain@... 2012-06-28, 20:08
Copy link to this message
-
Re: Obvious and not so obvious query optimzations in Hive
Hi Richin

The Keys vary based on your queries on the same table. You can always see how your query will be parsed to map reduce jobs using EXPLAIN Statement. It'll give you which columns are chosen as keys and values on each MR job for a query.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: <[EMAIL PROTECTED]>
Date: Thu, 28 Jun 2012 20:08:14
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Subject: RE: Obvious and not so obvious query optimzations in Hive
Bejoy, thanks again. This might be the silliest question but what are the keys in a hive query. Is it the fields we pick in select clause or the one we define with the group by clause.
Can you tell me what the keys will be for reducers for my query down below

CREATE EXTERNAL TABLE extlog
       (id string,
       ts string,
       metric string,
       min double,
       max double,
       avg double,
       sum double,
       sample double,
        unit string)
        ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ','
        STORED AS TEXTFILE
       LOCATION 's3n://xxx/xxx/';

CREATE TABLE range
       (ts string,
       id string)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ',';

CREATE EXTERNAL TABLE timeline
       (ts string,
       instancecount int)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ','
       STORED AS SEQUENCEFILE
       LOCATION 's3n://xxx/xx/xx/';
insert overwrite table range
       select from_unixtime
       ((unix_timestamp(ts, "MM/dd/yyyy HH:mm") - ((unix_timestamp(ts, "MM/dd/yyyy HH:mm"))%600)),"MM/dd/yyyy HH:mm"),
       id
       from extlog;

insert overwrite table timeline select ts,count(distinct id) from timelinerange group by ts;

Thanks,
Richin

From: ext Bejoy KS [mailto:[EMAIL PROTECTED]]
Sent: Thursday, June 28, 2012 3:47 PM
To: [EMAIL PROTECTED]
Subject: Re: Obvious and not so obvious query optimzations in Hive

Hi Richin

The keys are chosen by hive based on the input unless you specify certain clauses as DISTRIBUTE BY in your query which gives the flexibility for user to decide on the key/columns the data has to be distributed across reducers.
You are right In MR if a reducer writes out no data then you should get an empty file. Need to check source code to see whether hive deletes those files.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: Thu, 28 Jun 2012 19:23:46 +0000
To: <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
ReplyTo: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: RE: Obvious and not so obvious query optimzations in Hive

Thanks Nitin.
Depending on how I design my keys they might go to one or more reducers, but shouldn't I be seeing empty files for the reducers which did not get any data to reduce (because of the design of keys) ?
Or does hive clean all the empty files at the end of the query?

Richin

From: ext Nitin Pawar [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]>
Sent: Thursday, June 28, 2012 3:15 PM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Re: Obvious and not so obvious query optimzations in Hive

Richin,

even if you set number of reducers to be launched it does not guarantee u to that it will generate those many files.

based on your query and data only the reducers which got keys to process will generate the files
so when you have hive query with large number of keys but with lesser number in spilt size it will need large maps but then reducers will always depend on the keys emitted by the mappers and all the extra reducers will be a burden to the system
On Fri, Jun 29, 2012 at 12:40 AM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Igor,Bejoy - thanks a lot, that helps.

He, I am running the query on Amazon EMR cluster and based on the type of instances I pick, default number of mappers and reducers are set. Now I would expect Hive to generate that many number of output files as there are number of reducers (since I am not using order by clause or setting it explicitly). If Hive is setting lower number of reducers for itself than there is no point using a high end EMR cluster and pay for it.
Also I can only set number of reduce tasks explicitly through  SET mapred.reduce.tasks = ... , how to set number of reducers itself? I am confused between number of reduce tasks and reducers, can you please explain?

Thanks,
Richin

===========If you are optimizing for latency (running time) as opposed to throughput, it's best to have a single "wave" of reducers. So if your cluster is setup with a limit of, say, 2 reducers per node using 2*N reduce tasks would work best (for large queries). You have to specify that in your script using
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single reducer - so that's slow. I never use OrderBy though (Unix's sort is probably faster). For analytics queries I need Distribute/Sort By (with UDFs), which can use multiple reducers.
Hope this helps.
igor
decide.com<http://decide.com>

On Wed, Jun 27, 2012 at 8:47 AM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.
From: ext yongqiang he [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
Sent: Wednesday, June 27, 2012 6:32 PM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Re: Obvious and not so obvious query optimzations in Hive

1.       Having my external table data gzipped and reading it in the
table v/s no compression at all.

You may want GZip your data since it is offline. But space is not a concern and you want to optimize CPU, use snap
+
richin.jain@... 2012-06-29, 19:35
+
Igor Tatarinov 2012-06-27, 22:44
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB