Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Performance problems with Hive script


Copy link to this message
-
Re: Performance problems with Hive script
To compute moving averages, you should implement a custom reducer instead
of doing a big join. That will work *much* faster.

Also, Hive already has date_add(). Why did you have to implement your own?
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions

igor
decide.com
On Mon, Jan 23, 2012 at 6:48 PM, Benjamin Poserow <[EMAIL PROTECTED]>wrote:

> I wrote, separately, a Hadoop job to calculate running averages of about
> 2000 stock tickers over a 180 day period as well as a Hive script which
> performs equivalent functionality.  I have been using Amazon Elastic
> MapReduce as my platform for running these jobs.   I have been trying for a
> while to get my Hive script to perform well when spread over many nodes,
> but cannot seem to get the Hive script to perform nearly as well as the
> Hadoop job.  (The Hadoop job takes about an hour to run through all of my
> tickers, whereas the Hive job takes over an hour just to run about 1/8 of
> them and I cannot even seem to get it to finish when I run it for a larger
> number of tickers.)  I also have not seen large gains when running my Hive
> job using a larger number of hosts.   I've been trying to tinker with
> settings, examine the query plans of my queries, attempt many modifications
> of my queries, but have not seen great gains in performance.
>
> Here is my code.  Can you help me identify potential problem points and
> ways I can improve these queries, especially so they distribute well when
> run on multiple hosts.   I tried to add comments where appropriate to make
> it clear what I was doing in each step.  Please note there are about 2000 *
> 180 = 360,000 rows in the raw symbol table.
>
> Please help, I am quite stuck on this!  Feel free to ask any questions for
> which you would like clarification.
>
> Here is my script:
>
> ADD JAR ${INPUT}/market-data.jar ;
> ADD JAR ${INPUT}/HiveJars/derby.jar;
>
> set hive.stats.autogather=false;
>
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true;
> set hive.exec.reducers.bytes.per.reducer=1000000000;
> set hive.exec.max.dynamic.partitions.pernode=200000;
> set hive.exec.max.dynamic.partitions=200000;
> set hive.exec.max.created.files=1000000;
>
> -- Note ${INPUT} is the S3 URL to where my scripts and input files are
> stored.  ${INPUT}/hiveinput/output contains separate folders labeled
> symbol=[ticker symbol] so that
> --     they can be imported into a partitioned table.  The files in these
> folders contain the ticker prices of each of the stocks over a 180 day
> period obtained from Yahoo Finance
> CREATE EXTERNAL TABLE raw_symbols
> (dt STRING, open STRING, high STRING, low STRING, close STRING,
>  volume STRING, adj_close STRING)
> PARTITIONED BY (symbol STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> LOCATION '${INPUT}/hiveinput/output' ;
>
> -- Elastic MapReduce requires you to execute this command to create all of
> the dynamic partitions corresponding to the stock tickers
> ALTER TABLE raw_symbols RECOVER PARTITIONS;
>
> -- This is simply loading a table with the sequence 1 through 90.  I
> actually couldn't find anything in Hive to create a simple integer
> sequence.  So this table is loaded with
> --   this sequence
> CREATE EXTERNAL TABLE day_seq_orig
> (day INT)
> ROW FORMAT DELIMITED LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> LOCATION '${SEQFILE}';
>
> -- A temporary table to contain the distinct list of dates for which we
> have stock prices, should be 180 dates for the 180 days for which we are
> getting info
> CREATE TABLE distinct_dt
> (dt STRING)
> STORED AS SEQUENCEFILE;
>
> -- ${SAMPLE_SYMBOL} is just one of my symbols.  Since the same sequence of
> dates applies to all tickers, this gives me an easy and quick way to get
> the range of dates
> INSERT OVERWRITE TABLE distinct_dt
> SELECT distinct dt
> FROM raw_symbols
> WHERE raw_symbols.symbol = '${SAMPLE_SYMBOL}';
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB