Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> how is userlogs supposed to be cleaned up?


+
Chris Curtin 2012-03-06, 18:22
Copy link to this message
-
Re: how is userlogs supposed to be cleaned up?
Aside from cleanup, it seems like you are running into max number of subdirectories per directory on ext3.

Joep

Sent from my iPhone

On Mar 6, 2012, at 10:22 AM, Chris Curtin <[EMAIL PROTECTED]> wrote:

> Hi,
>
> We had a fun morning trying to figure out why our cluster was failing jobs,
> removing nodes from the cluster etc. The majority of the errors were
> something like:
>
>
> Error initializing attempt_201203061035_0047_m_000002_0:
>
> org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access
> `/disk1/userlogs/job_201203061035_0047': No such file or directory
>
>
>
>                at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
>
>                at org.apache.hadoop.util.Shell.run(Shell.java:182)
>
>                at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
>
>                at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
>
>                at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
>
>                at
> org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:533)
>
>                at
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:524)
>
>                at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
>
>                at
> org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240)
>
>                at
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:216)
>
>                at
> org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1352)
>
>
>
> Finally we shutdown the entire cluster and found that the 'userlogs'
> directory on the failed nodes had 30,000+ directories and the 'live' nodes
> 25,000+. Looking at creation timestamps it looks like around adding
> 30,000th directory the node falls over.
>
>
>
> Many of the directorys are weeks old and a few were months old.
>
>
>
> Deleting ALL the directories on all the nodes allowed us to bring the
> cluster up and things to run again. (Some users are claiming it is running
> faster now?)
>
>
>
> Our question: what is supposed to be cleaning up these directories? How
> often is that process or step taken?
>
>
>
> We are running CDH3u3.
>
>
>
> Thanks,
>
>
>
> Chris
+
Arun C Murthy 2012-03-06, 18:55