In case anyone is wondering, I tracked this down to a race condition in JobInProgress or failure to clean up FileSystems in CleanupQueue (depending on how you look at it).
FileSystem.closeAllForUGI is what keeps the cache from memory leaking however it's not called in one thread. However JobInProgress calls closeAllForUGI on a UGI that was also passed to the CleanupQueue thread. If closeAllForUGI is called by JobInProgress before CleanupQueue calls FileSystem.get with that ugi then there's a leak. Since CleanupQueue doesn't call closeAllForUGI the filesystem is left cached perpetually.
Setting, for example, keep.failed.task.files=true or keep.task.files.pattern=<dummy text> prevents CleanupQueue from getting called which seems to solve my issues. You get junk left in .staging but that can be dealt with.
From: Marcin Mejran [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 16, 2013 1:47 PM
To: [EMAIL PROTECTED]
Subject: Jobtracker memory issues due to FileSystem$Cache
We've recently run into jobtracker memory issues on our new hadoop cluster. A heap dump shows that there are thousands of copies of DistributedFileSystem kept in FileSystem$Cache, a bit over one for each job run on the cluster and their jobconf objects support this view. I believe these are created when the .staging directories get cleaned up but I may be wrong on that.
>From what I can tell in the dump, the username (probably not ugi, hard to tell), scheme and authority parts of the Cache$Key are the same across multiple objects in FileSystem$Cache. I can only assume that the usergroupinformation piece differs somehow every time it's created.
We're using CDH4.2, MR1, CentOS 6.3 and Java 1.6_31. Kerberos, ldap and so on are not enabled.
Is there any known reason for this type of behavior?