<not sure if my previous message made it as I just subscribed>
I have read scattered documentation across the net which mostly say HDFS doesn't go well with SAN being used to store data. While some say, it is an emerging trend. I would love to know if there have been any tests performed which hint on what aspects does a direct storage excels/falls behind a SAN.
We are investigating whether a direct storage option is better than a SAN storage for a modest cluster with data in 100 TBs in steady state. The SAN of course can support order of magnitude more of iops we care about for now, but given it is a shared infrastructure and we may expand our data size, it may not be an advantage in the future.
Another thing I am interested in: for MR jobs, where data locality is the key driver, how does that span out when using a SAN instead of direct storage?
And of course on the subjective topics of availability and reliability on using a SAN for data storage in HDFS, I would love to receive your views.