Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - How is hadoop going to handle the next generation disks?


Copy link to this message
-
Re: How is hadoop going to handle the next generation disks?
Edward Capriolo 2011-04-08, 17:59
On Fri, Apr 8, 2011 at 12:24 PM, sridhar basam <[EMAIL PROTECTED]> wrote:
>
> BTW this is on systems which have a lot of RAM and aren't under high load.
> If you find that your system is evicting dentries/inodes from its cache, you
> might want to experiment with drop vm.vfs_cache_pressure from its default so
> that the they are preferred over the pagecache. At the extreme, setting it
> to 0 means they are never evicted.
>  Sridhar
>
> On Fri, Apr 8, 2011 at 11:37 AM, sridhar basam <[EMAIL PROTECTED]> wrote:
>>
>> How many files do you have per node? What i find is that most of my
>> inodes/dentries are almost always cached so calculating the 'du -sk' on a
>> host even with hundreds of thousands of files the du -sk generally uses high
>> i/o for a couple of seconds. I am using 2TB disks too.
>>  Sridhar
>>
>>
>> On Fri, Apr 8, 2011 at 12:15 AM, Edward Capriolo <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> I have a 0.20.2 cluster. I notice that our nodes with 2 TB disks waste
>>> tons of disk io doing a 'du -sk' of each data directory. Instead of
>>> 'du -sk' why not just do this with java.io.file? How is this going to
>>> work with 4TB 8TB disks and up ? It seems like calculating used and
>>> free disk space could be done a better way.
>>>
>>> Edward
>>
>
>

Right. Most inodes are always cached when:

1) small disks
2) light load.

But that is not the case with hadoop.

Making the problem worse:
It seems like hadoop seems to issues 'du -sk' for all disks at the
same time. This pulverises cache.

All this to calculate a size that is typically within .01% of what a
df estimate would tell us.