|
|
-
Re: Inconsistent state in JobTracker (cdh)Harsh J 2012-11-20, 15:06
Hey Jan,
Your problem may be CDH-specific, so am moving this to [EMAIL PROTECTED] (https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/cdh-user). The specific issue you are running into (mostly environmentally triggered), around the ConcurrentModificationException failure at heartbeats specifically, was a known issue until CDH3u3 that has been fixed in CDH3u4 onwards. If you upgrade (to the latest CDH3 - CDH3u5), your problem should simply go away. On Tue, Nov 20, 2012 at 7:14 PM, Jan Lukavský <[EMAIL PROTECTED]> wrote: > Hi all, > > we are time to time experiencing a little odd behavior of JobTracker (using > cdh release, currently on cdh3u3, but I suppose this affects at least all > cdh3 releases so far). What we are seeing is M/R job beeing stuck between > map and reduce phase, with 100% maps completed but the web UI reports 1 > running map task and since we have mapred.reduce.slowstart.completed.maps > set to 1.0 (because of better throughput of jobs) the reduce phase will > never start and the job has to be killed. I have investigated this a bit and > I think I have found the reason for this. > > 12/11/20 01:05:10 INFO mapred.JobInProgress: Task > 'attempt_201211011002_1852_m_007638_0' has completed > task_201211011002_1852_m_007638 successfully. > 12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception: > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on > <some output path> File does not exist. [Lease. Holder: > DFSClient_408514838, pendingcreates: 1] > .... > 12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block > blk_-1434919284750099885_670717751 bad datanode[0] nodes == null > 12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations. Source > file "<some output path>" - Aborting... > 12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job > job_201211011002_1852removing PrintWriter from FileManager > 12/11/20 01:05:10 ERROR security.UserGroupInformation: > PriviledgedActionException as:mapred (auth:SIMPLE) > cause:java.io.IOException: java.util.ConcurrentModificationException > 12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call > heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false, false, > true, -17988) from 10.2.73.35:44969: error: java.io.IOException: > java.util.ConcurrentModificationException > > > When I look to the source code for JobInProgress.completedTask(), I see the > log about successful competion of the task, and after that, the logging in > HDFS (JobHistory.Task.logFinished()). I suppose that if this call throws an > exception (like in the case above), the call to completedTask() is aborted > *before* the counters runningMapTasks and finishedMapTasks are updated > accordingly. I created a heap dump of the JobTracker and I really found the > counter runningMapTasks set to 1 and finishedMapTasks was equal to > numMapTasks - 1. > > Now, the question is, should this be handled in the JobTracker (say by > moving the logging code after the counter manipulation)? Or should the > TaskTracker re-report the completed task on error in JobTracker? What can > cause the LeaseExpiredException? Should a JIRA be filled? :) > > Thanks for comments, > Jan > > -- Harsh J |