Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> delay before query starts processing


Copy link to this message
-
delay before query starts processing
Hi,

I'm running in Amazon on an EMR cluster with hive 0.8.1.  We have a lot of
other Hadoop jobs, but only started experimenting with Hive recently.

I've been seeing a long pause after submitting a hive query and the
actually start of the hadoop job... 10 minutes or more in some cases.  I'm
wondering what's happening during this time.  Either a high level answer,
or maybe there is some logging I can turn on?

Here's some more detail.  I submit the query on the master using the hive
cli, and start to see some output right away...

Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
*[then a long delay here: 10 minutes or more... no activity in the hadoop
job tracker ui] *
… and then it continues normally ...
Starting Job = job_201301160029_0082, Tracking URL http://ip-xxxxxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082
Kill Command = /home/hadoop/bin/hadoop job
 -Dmapred.job.tracker=xxxxxx:9001 -kill job_201301160029_0082
Hadoop job information for Stage-1: number of mappers: 2; number of
reducers: 1
2013-01-30 20:45:30,526 Stage-1 map = 0%,  reduce = 0%


This query is processing in the neighborhood of 500GB of data from S3.  A
couple of possibilities I thought of… perhaps someone can confirm or deny:
a) Is the data copied from S3 to HDFS during this time?
b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175
MB)-- does it have to copy these around to the tasks at this time?

Any insights appreciated.

Marc
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB