Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Inconsistent state in JobTracker (cdh)

Copy link to this message
Inconsistent state in JobTracker (cdh)
Hi all,

we are time to time experiencing a little odd behavior of JobTracker
(using cdh release, currently on cdh3u3, but I suppose this affects at
least all cdh3 releases so far). What we are seeing is M/R job beeing
stuck between map and reduce phase, with 100% maps completed but the web
UI reports 1 running map task and since we
have**mapred.reduce.slowstart.completed.maps set to 1.0 (because of
better throughput of jobs) the reduce phase will never start and the job
has to be killed. I have investigated this a bit and I think I have
found the reason for this.

12/11/20 01:05:10 INFO mapred.JobInProgress: Task
'attempt_201211011002_1852_m_007638_0' has completed
task_201211011002_1852_m_007638 successfully.
12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on <some output path> File does not exist. [Lease. Holder:
DFSClient_408514838, pendingcreates: 1]
12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block
blk_-1434919284750099885_670717751 bad datanode[0] nodes == null
12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations.
Source file "<some output path>" - Aborting...
12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job
job_201211011002_1852removing PrintWriter from FileManager
12/11/20 01:05:10 ERROR security.UserGroupInformation:
PriviledgedActionException as:mapred (auth:SIMPLE)
cause:java.io.IOException: java.util.ConcurrentModificationException
12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false,
false, true, -17988) from error: java.io.IOException:
When I look to the source code for JobInProgress.completedTask(), I see
the log about successful competion of the task, and after that, the
logging in HDFS (JobHistory.Task.logFinished()). I suppose that if this
call throws an exception (like in the case above), the call to
completedTask() is aborted *before* the counters runningMapTasks and
finishedMapTasks are updated accordingly. I created a heap dump of the
JobTracker and I really found the counter runningMapTasks set to 1 and
finishedMapTasks was equal to numMapTasks - 1.

Now, the question is, should this be handled in the JobTracker (say by
moving the logging code after the counter manipulation)? Or should the
TaskTracker re-report the completed task on error in JobTracker? What
can cause the LeaseExpiredException? Should a JIRA be filled? :)

Thanks for comments,