Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Performance problems with Hive script


Copy link to this message
-
Re: Performance problems with Hive script
Igor Tatarinov 2012-01-24, 07:26
To compute moving averages, you should implement a custom reducer instead
of doing a big join. That will work *much* faster.

Also, Hive already has date_add(). Why did you have to implement your own?
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions

igor
decide.com
On Mon, Jan 23, 2012 at 6:48 PM, Benjamin Poserow <[EMAIL PROTECTED]>wrote:

> I wrote, separately, a Hadoop job to calculate running averages of about
> 2000 stock tickers over a 180 day period as well as a Hive script which
> performs equivalent functionality.  I have been using Amazon Elastic
> MapReduce as my platform for running these jobs.   I have been trying for a
> while to get my Hive script to perform well when spread over many nodes,
> but cannot seem to get the Hive script to perform nearly as well as the
> Hadoop job.  (The Hadoop job takes about an hour to run through all of my
> tickers, whereas the Hive job takes over an hour just to run about 1/8 of
> them and I cannot even seem to get it to finish when I run it for a larger
> number of tickers.)  I also have not seen large gains when running my Hive
> job using a larger number of hosts.   I've been trying to tinker with
> settings, examine the query plans of my queries, attempt many modifications
> of my queries, but have not seen great gains in performance.
>
> Here is my code.  Can you help me identify potential problem points and
> ways I can improve these queries, especially so they distribute well when
> run on multiple hosts.   I tried to add comments where appropriate to make
> it clear what I was doing in each step.  Please note there are about 2000 *
> 180 = 360,000 rows in the raw symbol table.
>
> Please help, I am quite stuck on this!  Feel free to ask any questions for
> which you would like clarification.
>
> Here is my script:
>
> ADD JAR ${INPUT}/market-data.jar ;
> ADD JAR ${INPUT}/HiveJars/derby.jar;
>
> set hive.stats.autogather=false;
>
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true;
> set hive.exec.reducers.bytes.per.reducer=1000000000;
> set hive.exec.max.dynamic.partitions.pernode=200000;
> set hive.exec.max.dynamic.partitions=200000;
> set hive.exec.max.created.files=1000000;
>
> -- Note ${INPUT} is the S3 URL to where my scripts and input files are
> stored.  ${INPUT}/hiveinput/output contains separate folders labeled
> symbol=[ticker symbol] so that
> --     they can be imported into a partitioned table.  The files in these
> folders contain the ticker prices of each of the stocks over a 180 day
> period obtained from Yahoo Finance
> CREATE EXTERNAL TABLE raw_symbols
> (dt STRING, open STRING, high STRING, low STRING, close STRING,
>  volume STRING, adj_close STRING)
> PARTITIONED BY (symbol STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> LOCATION '${INPUT}/hiveinput/output' ;
>
> -- Elastic MapReduce requires you to execute this command to create all of
> the dynamic partitions corresponding to the stock tickers
> ALTER TABLE raw_symbols RECOVER PARTITIONS;
>
> -- This is simply loading a table with the sequence 1 through 90.  I
> actually couldn't find anything in Hive to create a simple integer
> sequence.  So this table is loaded with
> --   this sequence
> CREATE EXTERNAL TABLE day_seq_orig
> (day INT)
> ROW FORMAT DELIMITED LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> LOCATION '${SEQFILE}';
>
> -- A temporary table to contain the distinct list of dates for which we
> have stock prices, should be 180 dates for the 180 days for which we are
> getting info
> CREATE TABLE distinct_dt
> (dt STRING)
> STORED AS SEQUENCEFILE;
>
> -- ${SAMPLE_SYMBOL} is just one of my symbols.  Since the same sequence of
> dates applies to all tickers, this gives me an easy and quick way to get
> the range of dates
> INSERT OVERWRITE TABLE distinct_dt
> SELECT distinct dt
> FROM raw_symbols
> WHERE raw_symbols.symbol = '${SAMPLE_SYMBOL}';