Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Is SAN storage is a good option for Hadoop ?


Copy link to this message
-
Re: Is SAN storage is a good option for Hadoop ?
Steve Loughran 2011-09-29, 13:06
On 29/09/11 13:28, Brian Bockelman wrote:
>
> On Sep 29, 2011, at 1:50 AM, praveenesh kumar wrote:
>
>> Hi,
>>
>> I want to know can we use SAN storage for Hadoop cluster setup ?
>> If yes, what should be the best pratices ?
>>
>> Is it a good way to do considering the fact "the underlining power of Hadoop
>> is co-locating the processing power (CPU) with the data storage and thus it
>> must be local storage to be effective".
>> *But also, is it better to say �local is better� in the situation where I
>> have a single local 5400 RPM IDE drive, which  would be dramatically slower
>> than SAN storage striped  across many drives spinning at 10k RPM and
>> accessed via fiber channel ?*
>
> Hi Praveenesh,
>
> Two things:
> 1) If the option is a single 5400 RPM IDE drive (you can still buy those?) versus high-end SAN, the high-end SAN is going to win.  That's often false comparison: the question is often "What can I buy for $50k?".  In that case (setting aside organizational politics), you can buy more spindles in the "traditional" Hadoop setup than for the SAN.
>    - Also, if you're latency limited, you're likely working against yourself.  The best thing I ever did for my organization was make our software work just as well with 100ms latency as with 1ms latency.
> 2) As Paul pointed out, you have to ask yourself whether the SAN is shared or dedicated.  Many SANs don't have the ability to strongly partition workloads between users..
>
> Brian
>

One more: SAN is a SPOF. [Gray05] includes the impact of a SAN outage on
MS TerraServer, while [Jiang08] provides evidence that entry level
FibreChannel storage is less reliable than SATA due to interconnects.

Anyone who criticises the NameNode for being a SPOF and relies on a SAN
instead is missing something obvious.

[Gray05] Empirical Measurements of Disk Failure Rates and Error Rates
[Jiang08] Are disks the dominant contributor for storage failures?