|
|
-
Re: Question about disk space allocation in hadoopYu Li 2010-07-01, 07:51
Hi Chris,
Thanks a lot for your knowledge sharing, I'll have a further investigation and give it a try on my cluster, hope could get a good solution from them:) Best Regards, Carp 2010/6/30 Chris Smith <csmithx+[EMAIL PROTECTED]>: > Some thoughts on how to restrict the temporary data, but I have only > tried (a) in anger: > > a) Partition your disks into HDFS and intermediate temp partitions > of the relevant size. This gives a fixed separation but is > difficult/impossible to modify on a busy cluster especially as there > may be no way of unloading/recovering the data stored in HDFS if you > make a mistake resizing partitions; > > b) Implement disk quotas and set relevant hard and soft limits on > the relevant root directories for intermediate space. This gives you > the flexibility to change the limits when required but as the limits > are per user/group some thought may be required as to which user/group > the limits apply to. There may also be a performance impact? > > You could combine this with setting “dfs.datanode.du.reserved” value > in $HADOOP_HOME/conf/hdfs-site.xml for limiting HDFS disk usage. > > c) Implement intermediate data space as a loopback file, see: > http://wiki.cita.utoronto.ca/mediawiki/index.php/Fake_Fast_Local_Disk > This example implements a temporary loopback filesystem on a iSCSI > mounted Lustre filesystem but the principles are the same. There are > some performance benchmarks linked to in section 3. The intermediate > temp data space is limited by the size of the loopback file created. > > Chris > > -----Original Message----- > From: Yu Li [mailto:[EMAIL PROTECTED]] > Sent: 30 June 2010 04:11 > To: [EMAIL PROTECTED] > Subject: Re: Question about disk space allocation in hadoop > > Hi all, > > Anybody has experience on this? Any Comments/Suggestions would be > highly appreciated, Thanks. > > Best Regards, > Carp > > 2010/6/29 Yu Li <[EMAIL PROTECTED]>: >> Hi all, >> >> As we all know, machines in hadoop cluster may be both datanode and >> tasktracker, so one machine may store both MR job intermediate data >> and HDFS data. My question is: if we have more than one disk per node, >> say 4 disks, and would like both job intermediate data and HDFS data >> store into all disks to reduce IO times of each single disk, can we >> draw a line between space of local FS and HDFS? For example, restrict >> the intermediate temp data occupy no more than 25% space on each disk? >> Thanks in advance. >> >> Best Regards, >> Carp >> > |