Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - how is userlogs supposed to be cleaned up?


Copy link to this message
-
Re: how is userlogs supposed to be cleaned up?
Joep Rottinghuis 2012-03-07, 07:12
Aside from cleanup, it seems like you are running into max number of subdirectories per directory on ext3.

Joep

Sent from my iPhone

On Mar 6, 2012, at 10:22 AM, Chris Curtin <[EMAIL PROTECTED]> wrote:

> Hi,
>
> We had a fun morning trying to figure out why our cluster was failing jobs,
> removing nodes from the cluster etc. The majority of the errors were
> something like:
>
>
> Error initializing attempt_201203061035_0047_m_000002_0:
>
> org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access
> `/disk1/userlogs/job_201203061035_0047': No such file or directory
>
>
>
>                at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
>
>                at org.apache.hadoop.util.Shell.run(Shell.java:182)
>
>                at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
>
>                at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
>
>                at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
>
>                at
> org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:533)
>
>                at
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:524)
>
>                at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
>
>                at
> org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240)
>
>                at
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:216)
>
>                at
> org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1352)
>
>
>
> Finally we shutdown the entire cluster and found that the 'userlogs'
> directory on the failed nodes had 30,000+ directories and the 'live' nodes
> 25,000+. Looking at creation timestamps it looks like around adding
> 30,000th directory the node falls over.
>
>
>
> Many of the directorys are weeks old and a few were months old.
>
>
>
> Deleting ALL the directories on all the nodes allowed us to bring the
> cluster up and things to run again. (Some users are claiming it is running
> faster now?)
>
>
>
> Our question: what is supposed to be cleaning up these directories? How
> often is that process or step taken?
>
>
>
> We are running CDH3u3.
>
>
>
> Thanks,
>
>
>
> Chris