Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Re: Auto clean DistCache?


Copy link to this message
-
Re: Auto clean DistCache?
Jean-Marc Spaggiari 2013-03-28, 16:02
Thanks Harsh. My issue was not related to the number of files/folders
but related to the total size of the DistributedCache. The directory
where it's stored only has 7GB available... So I will setup the limit
to 5GB with local.cache.size, or move it to the drives there I have
the dfs files stored.

Thanks,

JM

2013/3/28 Harsh J <[EMAIL PROTECTED]>:
> The DistributedCache is cleaned automatically and no user intervention
> (aside of size limitation changes, which may be an administrative
> requirement) is generally required to delete the older distributed
> cache files.
>
> This is observable in code and is also noted in TDG, 2ed.:
>
> Tom White:
> """
> The tasktracker also maintains a reference count for the number of
> tasks using each file in the cache. Before the task has run, the
> file’s reference count is incremented by one; then after the task has
> run, the count is decreased by one. Only when the count reaches zero
> it is eligible for deletion, since no tasks are using it. Files are
> deleted to make room for a new file when the cache exceeds a certain
> size—10 GB by default. The cache size may be changed by setting the
> configuration property local.cache.size, which is measured in bytes.
> """
>
> And also, the maximum allowed dirs is also checked for automatically
> today, to not violate the OS's limits.
>
> On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari
> <[EMAIL PROTECTED]> wrote:
>> Oh! good to know! It keep tracks even of month old entries??? There is no TTL?
>>
>> I was not able to find the documentation for  local.cache.size or
>> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
>> where I can found that?
>>
>> Thanks,
>>
>> JM
>>
>> 2013/3/27 Koji Noguchi <[EMAIL PROTECTED]>:
>>>> Else, I will go for a customed script to delete all directories (and content) older than 2 or 3 days…
>>>>
>>> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entries in memory.
>>> So if external process (like your script) start deleting dist cache files, there would be inconsistency and you'll start seeing task initialization failures due to no file found error.
>>>
>>> Koji
>>>
>>>
>>> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>>>
>>>> For the situation I faced I was really a disk space issue, not related
>>>> to the number of files. It was writing on a small partition.
>>>>
>>>> I will try with local.cache.size or
>>>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>>>> total size under 5GB... Else, I will go for a customed script to
>>>> delete all directories (and content) older than 2 or 3 days...
>>>>
>>>> Thanks,
>>>>
>>>> JM
>>>>
>>>> 2013/3/26 Abdelrahman Shettia <[EMAIL PROTECTED]>:
>>>>> Let me clarify , If there are lots of files or directories up to 32K (
>>>>> Depending on the user's # of files sys os config) in those distributed cache
>>>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R jobs
>>>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>>>> <[EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>
>>>>>> All the files are not opened at the same time ever, so you shouldn't see
>>>>>> any "# of open files exceeds error".
>>>>>>
>>>>>> Thanks,
>>>>>> +Vinod Kumar Vavilapalli
>>>>>> Hortonworks Inc.
>>>>>> http://hortonworks.com/
>>>>>>
>>>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>>>
>>>>>> Hi JM ,
>>>>>>
>>>>>> Actually these dirs need to be purged by a script that keeps the last 2
>>>>>> days worth of files, Otherwise you may run into # of open files exceeds
>>>>>> error.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Each time my MR job is run, a directory is created on the TaskTracker
>>>>>>
>>>>>> under mapred/local/taskTracker/hadoop/distcache (based on my