I'm running in Amazon on an EMR cluster with hive 0.8.1. We have a lot of
other Hadoop jobs, but only started experimenting with Hive recently.
I've been seeing a long pause after submitting a hive query and the
actually start of the hadoop job... 10 minutes or more in some cases. I'm
wondering what's happening during this time. Either a high level answer,
or maybe there is some logging I can turn on?
Here's some more detail. I submit the query on the master using the hive
cli, and start to see some output right away...
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
In order to limit the maximum number of reducers:
In order to set a constant number of reducers:
*[then a long delay here: 10 minutes or more... no activity in the hadoop
job tracker ui] *
… and then it continues normally ...
Starting Job = job_201301160029_0082, Tracking URL http://ip-xxxxxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082
Kill Command = /home/hadoop/bin/hadoop job
-Dmapred.job.tracker=xxxxxx:9001 -kill job_201301160029_0082
Hadoop job information for Stage-1: number of mappers: 2; number of
2013-01-30 20:45:30,526 Stage-1 map = 0%, reduce = 0%
This query is processing in the neighborhood of 500GB of data from S3. A
couple of possibilities I thought of… perhaps someone can confirm or deny:
a) Is the data copied from S3 to HDFS during this time?
b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175
MB)-- does it have to copy these around to the tasks at this time?
Any insights appreciated.