Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> after 2 weeks TaskTracker gets hung with 100% CPU consumption

Copy link to this message
Re: after 2 weeks TaskTracker gets hung with 100% CPU consumption
What version of Hadoop are you running?

On Apr 21, 2012, at 12:20 AM, Vladimir Egorov wrote:

> Hi,
> After around 2 weeks a TestTracker (TT) in our MR cluster gets hung with 100% CPU consumption. Most of the times no new tasks are sent to the node. We start getting more job failure in the cluster when this happens. Once we restart the TT the node is fine for around another two weeks.
> We also noticed that after restart some other TT in the cluster starts having the same behavior. This continues till all the TTs have been restarted. Another solution is to restart the MR cluster.
> A thread dump is posted below. It looks like TT is busy with some log cleanup. We also noticed that when we restart, sometimes TT fails to start because tobedeleted directory cannot be deleted. We have to delete it manually, and then TT starts normally.
> Has anyone seen this and is there a resolution or workaround.
> Thank you,
> Vladimir
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode):
> "Thread-97182" daemon prio=10 tid=0x00002aaab8a7f000 nid=0x1c7d runnable [0x0000000040508000]
>    java.lang.Thread.State: RUNNABLE
>     at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
>     at java.lang.StringCoding.encode(StringCoding.java:272)
>     at java.lang.String.getBytes(String.java:946)
>     at java.io.UnixFileSystem.list(Native Method)
>     at java.io.File.list(File.java:973)
>     at java.io.File.listFiles(File.java:1051)
>     at org.apache.hadoop.fs.FileUtil.fullyDeleteContents(FileUtil.java:96)
>     at org.apache.hadoop.fs.FileUtil.fullyDelete(FileUtil.java:84)
>     at org.apache.hadoop.fs.FileUtil.fullyDeleteContents(FileUtil.java:115)
>     at org.apache.hadoop.fs.FileUtil.fullyDelete(FileUtil.java:84)
>     at org.apache.hadoop.fs.RawLocalFileSystem.delete(RawLocalFileSystem.java:293)
>     at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:466)
>     at org.apache.hadoop.mapreduce.util.MRAsyncDiskService$DeleteTask.run(MRAsyncDiskService.java:199)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>     at java.lang.Thread.run(Thread.java:662)
> "Thread-97171" daemon prio=10 tid=0x00002aaab8a81000 nid=0x1bde waiting for monitor entry [0x000000004030a000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at org.apache.hadoop.mapred.TaskTracker.getTaskTrackerReportAddress(TaskTracker.java:1351)
>     - waiting to lock<0x00000000c185f690>  (a org.apache.hadoop.mapred.TaskTracker)
>     at org.apache.hadoop.mapred.TaskRunner.getVMArgs(TaskRunner.java:477)
>     at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:210)
> "Thread-6" daemon prio=10 tid=0x00002aaab443e800 nid=0x2a98 runnable [0x0000000043047000]
>    java.lang.Thread.State: RUNNABLE
>     at java.lang.String.substring(String.java:1939)
>     at java.lang.String.substring(String.java:1904)
>     at java.io.File.getName(File.java:401)
>     at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:229)
>     at java.io.File.exists(File.java:733)
>     at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:420)
>     at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:964)
>     at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:430)
>     at org.apache.hadoop.mapreduce.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:244)
>     at org.apache.hadoop.mapreduce.util.MRAsyncDiskService.moveAndDeleteAbsolutePath(MRAsyncDiskService.java:361)
>     at org.apache.hadoop.mapred.UserLogCleaner.deleteLogPath(UserLogCleaner.java:200)
>     at org.apache.hadoop.mapred.UserLogCleaner.processCompletedJobs(UserLogCleaner.java:103)
>     - locked<0x00000000c18b0200>  (a java.util.Collections$SynchronizedMap)
>     at org.apache.hadoop.mapred.UserLogCleaner.run(UserLogCleaner.java:83)

Arun C. Murthy
Hortonworks Inc.