1) but I thought that this sort of thing (yes even on linux) becomes
important when you have large amounts of data - because the way files are
written can cause issues on highly packed drives.
2) Probably this is the key point: HDFS i/o is most effected by the file
size, which is much more important than any occasional minor disk
inhomogeneities. So - the focus is on distributing and replicating files
rather than microoptimizing individual files.
On Tue, Nov 13, 2012 at 4:10 PM, Bertrand Dechoux <[EMAIL PROTECTED]>wrote:
> People are welcome to complement but I guess the answer is :
> 1) Hadoop is not running on windows (I am not sure if Microsoft made any
> statement about the OS used for Hadoop on Azure.)
> 2) files are written in one go with big blocks. (And actually, the files
> fragmentation is not the only issue. The many small files 'issue' is -in
> the end- a data fragmentation issue too and has an impact to read
> Bertrand Dechoux
> On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <[EMAIL PROTECTED]> wrote:
>> How does HDFS deal with optimization of file streaming? Do data nodes
>> have any optimizations at the disk level for dealing with fragmented files?
>> I assume not, but just curious if this is at all in the works, or if there
>> are java-y ways of dealing with a long running set of files in an HDFS
>> cluster. MAybe, for example, data nodes could log the amount of time spent
>> on I/O for certain files as a way of reporting wether or not
>> defragmentation needed to be run on a particular node in a cluster.
>> Jay Vyas