-Re: Question about disk space allocation in hadoop
Yu Li 2010-07-01, 07:51
Thanks a lot for your knowledge sharing, I'll have a further
investigation and give it a try on my cluster, hope could get a good
solution from them:)
2010/6/30 Chris Smith <csmithx+[EMAIL PROTECTED]>:
> Some thoughts on how to restrict the temporary data, but I have only
> tried (a) in anger:
> a) Partition your disks into HDFS and intermediate temp partitions
> of the relevant size. This gives a fixed separation but is
> difficult/impossible to modify on a busy cluster especially as there
> may be no way of unloading/recovering the data stored in HDFS if you
> make a mistake resizing partitions;
> b) Implement disk quotas and set relevant hard and soft limits on
> the relevant root directories for intermediate space. This gives you
> the flexibility to change the limits when required but as the limits
> are per user/group some thought may be required as to which user/group
> the limits apply to. There may also be a performance impact?
> You could combine this with setting “dfs.datanode.du.reserved” value
> in $HADOOP_HOME/conf/hdfs-site.xml for limiting HDFS disk usage.
> c) Implement intermediate data space as a loopback file, see:
> This example implements a temporary loopback filesystem on a iSCSI
> mounted Lustre filesystem but the principles are the same. There are
> some performance benchmarks linked to in section 3. The intermediate
> temp data space is limited by the size of the loopback file created.
> -----Original Message-----
> From: Yu Li [mailto:[EMAIL PROTECTED]]
> Sent: 30 June 2010 04:11
> To: [EMAIL PROTECTED]
> Subject: Re: Question about disk space allocation in hadoop
> Hi all,
> Anybody has experience on this? Any Comments/Suggestions would be
> highly appreciated, Thanks.
> Best Regards,
> 2010/6/29 Yu Li <[EMAIL PROTECTED]>:
>> Hi all,
>> As we all know, machines in hadoop cluster may be both datanode and
>> tasktracker, so one machine may store both MR job intermediate data
>> and HDFS data. My question is: if we have more than one disk per node,
>> say 4 disks, and would like both job intermediate data and HDFS data
>> store into all disks to reduce IO times of each single disk, can we
>> draw a line between space of local FS and HDFS? For example, restrict
>> the intermediate temp data occupy no more than 25% space on each disk?
>> Thanks in advance.
>> Best Regards,