Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Inconsistent state in JobTracker (cdh)


Copy link to this message
-
Inconsistent state in JobTracker (cdh)
Hi all,

we are time to time experiencing a little odd behavior of JobTracker
(using cdh release, currently on cdh3u3, but I suppose this affects at
least all cdh3 releases so far). What we are seeing is M/R job beeing
stuck between map and reduce phase, with 100% maps completed but the web
UI reports 1 running map task and since we
have**mapred.reduce.slowstart.completed.maps set to 1.0 (because of
better throughput of jobs) the reduce phase will never start and the job
has to be killed. I have investigated this a bit and I think I have
found the reason for this.

12/11/20 01:05:10 INFO mapred.JobInProgress: Task
'attempt_201211011002_1852_m_007638_0' has completed
task_201211011002_1852_m_007638 successfully.
12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on <some output path> File does not exist. [Lease. Holder:
DFSClient_408514838, pendingcreates: 1]
....
12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block
blk_-1434919284750099885_670717751 bad datanode[0] nodes == null
12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations.
Source file "<some output path>" - Aborting...
12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job
job_201211011002_1852removing PrintWriter from FileManager
12/11/20 01:05:10 ERROR security.UserGroupInformation:
PriviledgedActionException as:mapred (auth:SIMPLE)
cause:java.io.IOException: java.util.ConcurrentModificationException
12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false,
false, true, -17988) from 10.2.73.35:44969: error: java.io.IOException:
java.util.ConcurrentModificationException
When I look to the source code for JobInProgress.completedTask(), I see
the log about successful competion of the task, and after that, the
logging in HDFS (JobHistory.Task.logFinished()). I suppose that if this
call throws an exception (like in the case above), the call to
completedTask() is aborted *before* the counters runningMapTasks and
finishedMapTasks are updated accordingly. I created a heap dump of the
JobTracker and I really found the counter runningMapTasks set to 1 and
finishedMapTasks was equal to numMapTasks - 1.

Now, the question is, should this be handled in the JobTracker (say by
moving the logging code after the counter manipulation)? Or should the
TaskTracker re-report the completed task on error in JobTracker? What
can cause the LeaseExpiredException? Should a JIRA be filled? :)

Thanks for comments,
  Jan
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB