Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Re: Want to improve the performance for execution of Hive Jobs.

Copy link to this message
Re: Want to improve the performance for execution of Hive Jobs.
Hi Bhavesh
     On a job level, if you set/override some properties it won't go into mapred-site.xml. Check your corresponding Job.xml to get the values. Also confirm from task logs that there is no warnings with respect to overriding those properties. If these two are good then you can confirm that the properties supplied by you are actually utilized for the job.

Disclaimer: I'm not a EWS guy to comment on some specifics in there. My responses are related to generic hadoop behavior. :)
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Bhavesh Shah <[EMAIL PROTECTED]>
Date: Tue, 8 May 2012 17:15:44
Subject: Re: Want to improve the performance for execution of Hive Jobs.

Hello Bejoy KS,
I did in the same way by executing "hive -f  <filename>" on Amazon EMR.
and when I observed the mapred-site.xml, all variables that I have set in
above file are set by default with their values. I didn't see my set values.

And the performance is slow too.
I have tried this on my local cluster by setting this values and I saw some
boost in the performance.
On Tue, May 8, 2012 at 4:23 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:

> Hi Bhavesh
>       I'm not sure of AWS, but from a quick reading cluster wide settings
> like hdfs block size can be set on hdfs-site.xml through bootstrap actions.
> Since you are changing hdfs block size set min and max split size across
> the cluster using bootstrap actions itself. The rest of the properties can
> on set on a per job level.
> Doesn't AWS provide an option to use "hive -f"? If so, just provide all
> the properties required for tuning the query followed by queries(in order)
> in a file and simply execute it using "hive -f <file name>".
> Regards
> Bejoy KS
>   ------------------------------
> *From:* Bhavesh Shah <[EMAIL PROTECTED]>
> *Sent:* Tuesday, May 8, 2012 3:33 PM
> *Subject:* Re: Want to improve the performance for execution of Hive Jobs.
> Thanks Bejoy KS for your reply,
> I want to ask one thing that If I want to set this parameter on Amazon
> Elastic Mapreduce then how can I set these variable like:
> e.g. SET mapred.min.split.size=m;
>       SET mapred.max.split.size=m+n;
>       set dfs.block.size=128
>       set mapred.compress.map.output=true
>       set io.sort.mb=400  etc....
> For all this do I need to write shell script for setting this variables on
> the particular path /home/hadoop/hive/bin/hive -e 'set .....'
> or pass all this steps in bootstrap actions???
> I found this link to pass the bootstrap actions
> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined
> What should I do in such case??
> On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:
> Hi Bhavesh
>      In sqoop you can optimize the performance by using --direct mode for
> import and increasing the number of mappers used for import. When you
> increase the number of mappers you need to ensure that the RDBMS connection
> pool will handle those number of connections gracefully. Also use a evenly
> distributed column as --split-by, that'll ensure that all mappers are kind
> of equally loaded.
>    min split size and map split size can be set on a job level. But, there
> are chances of slight loss in data locality if you increase these values.
> By increasing these values you are increasing the data volume processed per
> mapper so less number of mappers , now you need to see whether this will
> that get you substantial performance gains. I havent seen much gains there
> when I tried out those on some of my workflows in the past. A better
> approach than this would be increasing the hdfs block size itself if your
> cluster deals with relatively larger files. Of you change the hdfs block
> size then make the changes accordingly on min split and max split values.
Bhavesh Shah