Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Inconsistent state in JobTracker (cdh)


Copy link to this message
-
Re: Inconsistent state in JobTracker (cdh)
Hey Jan,

Your problem may be CDH-specific, so am moving this to
[EMAIL PROTECTED]
(https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/cdh-user).

The specific issue you are running into (mostly environmentally
triggered), around the ConcurrentModificationException failure at
heartbeats specifically, was a known issue until CDH3u3 that has been
fixed in CDH3u4 onwards. If you upgrade (to the latest CDH3 - CDH3u5),
your problem should simply go away.

On Tue, Nov 20, 2012 at 7:14 PM, Jan Lukavsk√Ĺ
<[EMAIL PROTECTED]> wrote:
> Hi all,
>
> we are time to time experiencing a little odd behavior of JobTracker (using
> cdh release, currently on cdh3u3, but I suppose this affects at least all
> cdh3 releases so far). What we are seeing is M/R job beeing stuck between
> map and reduce phase, with 100% maps completed but the web UI reports 1
> running map task and since we have mapred.reduce.slowstart.completed.maps
> set to 1.0 (because of better throughput of jobs) the reduce phase will
> never start and the job has to be killed. I have investigated this a bit and
> I think I have found the reason for this.
>
> 12/11/20 01:05:10 INFO mapred.JobInProgress: Task
> 'attempt_201211011002_1852_m_007638_0' has completed
> task_201211011002_1852_m_007638 successfully.
> 12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
> <some output path> File does not exist. [Lease.  Holder:
> DFSClient_408514838, pendingcreates: 1]
> ....
> 12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block
> blk_-1434919284750099885_670717751 bad datanode[0] nodes == null
> 12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations. Source
> file "<some output path>" - Aborting...
> 12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job
> job_201211011002_1852removing PrintWriter from FileManager
> 12/11/20 01:05:10 ERROR security.UserGroupInformation:
> PriviledgedActionException as:mapred (auth:SIMPLE)
> cause:java.io.IOException: java.util.ConcurrentModificationException
> 12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false, false,
> true, -17988) from 10.2.73.35:44969: error: java.io.IOException:
> java.util.ConcurrentModificationException
>
>
> When I look to the source code for JobInProgress.completedTask(), I see the
> log about successful competion of the task, and after that, the logging in
> HDFS (JobHistory.Task.logFinished()). I suppose that if this call throws an
> exception (like in the case above), the call to completedTask() is aborted
> *before* the counters runningMapTasks and finishedMapTasks are updated
> accordingly. I created a heap dump of the JobTracker and I really found the
> counter runningMapTasks set to 1 and finishedMapTasks was equal to
> numMapTasks - 1.
>
> Now, the question is, should this be handled in the JobTracker (say by
> moving the logging code after the counter manipulation)? Or should the
> TaskTracker re-report the completed task on error in JobTracker? What can
> cause the LeaseExpiredException? Should a JIRA be filled? :)
>
> Thanks for comments,
>  Jan
>
>

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB