-Re: Mutiple dfs.data.dir vs RAID0
Chris Embree 2013-02-11, 02:22
Interesting question. You'd probably need to benchmark to prove it out.
I'm not the exact details of how HDFS stripes data, but it should compare
pretty well to hardware RAID.
Conceptually, HDFS should be able to out perform a RAID solution, since
HDFS "knows" more about the data being written. One of the benefits of
HDFS is being able to buy cheaper hardware and still getting good
We bought cheap DL165's for our datanodes. 4x 2TB Drives with no RAID.
On Sun, Feb 10, 2013 at 8:57 PM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:
> I have a quick question regarding RAID0 performances vs multiple
> dfs.data.dir entries.
> Let's say I have 2 x 2TB drives.
> I can configure them as 2 separate drives mounted on 2 folders and
> assignes to hadoop using dfs.data.dir. Or I can mount the 2 drives
> with RAID0 and assigned them as a single folder to dfs.data.dir.
> With RAID0, the reads and writes are going to be spread over the 2
> disks. This is significantly increasing the speed. But if I put 2
> entries in dfs.data.dir, hadoop is going to spread over those 2
> directories too, and at the end, ths results should the same, no?
> Any experience/advice/results to share?