Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Is SAN storage is a good option for Hadoop ?


Copy link to this message
-
Re: Is SAN storage is a good option for Hadoop ?
On 29/09/11 13:28, Brian Bockelman wrote:
>
> On Sep 29, 2011, at 1:50 AM, praveenesh kumar wrote:
>
>> Hi,
>>
>> I want to know can we use SAN storage for Hadoop cluster setup ?
>> If yes, what should be the best pratices ?
>>
>> Is it a good way to do considering the fact "the underlining power of Hadoop
>> is co-locating the processing power (CPU) with the data storage and thus it
>> must be local storage to be effective".
>> *But also, is it better to say �local is better� in the situation where I
>> have a single local 5400 RPM IDE drive, which  would be dramatically slower
>> than SAN storage striped  across many drives spinning at 10k RPM and
>> accessed via fiber channel ?*
>
> Hi Praveenesh,
>
> Two things:
> 1) If the option is a single 5400 RPM IDE drive (you can still buy those?) versus high-end SAN, the high-end SAN is going to win.  That's often false comparison: the question is often "What can I buy for $50k?".  In that case (setting aside organizational politics), you can buy more spindles in the "traditional" Hadoop setup than for the SAN.
>    - Also, if you're latency limited, you're likely working against yourself.  The best thing I ever did for my organization was make our software work just as well with 100ms latency as with 1ms latency.
> 2) As Paul pointed out, you have to ask yourself whether the SAN is shared or dedicated.  Many SANs don't have the ability to strongly partition workloads between users..
>
> Brian
>

One more: SAN is a SPOF. [Gray05] includes the impact of a SAN outage on
MS TerraServer, while [Jiang08] provides evidence that entry level
FibreChannel storage is less reliable than SATA due to interconnects.

Anyone who criticises the NameNode for being a SPOF and relies on a SAN
instead is missing something obvious.

[Gray05] Empirical Measurements of Disk Failure Rates and Error Rates
[Jiang08] Are disks the dominant contributor for storage failures?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB