On 29/09/11 13:28, Brian Bockelman wrote:
> On Sep 29, 2011, at 1:50 AM, praveenesh kumar wrote:
>> I want to know can we use SAN storage for Hadoop cluster setup ?
>> If yes, what should be the best pratices ?
>> Is it a good way to do considering the fact "the underlining power of Hadoop
>> is co-locating the processing power (CPU) with the data storage and thus it
>> must be local storage to be effective".
>> *But also, is it better to say ï¿½local is betterï¿½ in the situation where I
>> have a single local 5400 RPM IDE drive, which would be dramatically slower
>> than SAN storage striped across many drives spinning at 10k RPM and
>> accessed via fiber channel ?*
> Hi Praveenesh,
> Two things:
> 1) If the option is a single 5400 RPM IDE drive (you can still buy those?) versus high-end SAN, the high-end SAN is going to win. That's often false comparison: the question is often "What can I buy for $50k?". In that case (setting aside organizational politics), you can buy more spindles in the "traditional" Hadoop setup than for the SAN.
> - Also, if you're latency limited, you're likely working against yourself. The best thing I ever did for my organization was make our software work just as well with 100ms latency as with 1ms latency.
> 2) As Paul pointed out, you have to ask yourself whether the SAN is shared or dedicated. Many SANs don't have the ability to strongly partition workloads between users..
One more: SAN is a SPOF. [Gray05] includes the impact of a SAN outage on
MS TerraServer, while [Jiang08] provides evidence that entry level
FibreChannel storage is less reliable than SATA due to interconnects.
Anyone who criticises the NameNode for being a SPOF and relies on a SAN
instead is missing something obvious.
[Gray05] Empirical Measurements of Disk Failure Rates and Error Rates
[Jiang08] Are disks the dominant contributor for storage failures?