Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> want to try HBase on a large cluster running Lustre - any advice?


Copy link to this message
-
Re: want to try HBase on a large cluster running Lustre - any advice?
How would you handle a node failure? Do you have shared storage which
exports LUNs to the datanodes? The beauty of hbase+hdfs is you can afford
nodes going down (depending on your replication policy).

Lustre is a great filesystem for scratch high performance filesystem but
using it as a backend for hbase, I think isnt a good idea.

I suggest you buy a disk for each node and then run hdfs and hbase on top
of it. Thats what I do.

On Mon, Dec 5, 2011 at 5:04 PM, Taylor, Ronald C <[EMAIL PROTECTED]>wrote:

> Hello Lars,
>
> Thanks for your previous help. Got a new question for you.  I now have the
> opportunity to try using Hadoop and HBase on a newly installed cluster
> here, at a nominal cost. A lot of compute power (480+ nodes, 16 cores per
> node going up to 32 by the end of FY12, 64 GB RAM per node, with a few fat
> nodes with 256GB). One local drive of 1TB per node, and a four petabyte
> Lustre file system. Hadoop jobs are already running on this new cluster, on
> terabyte size data sets.
>
> Here's the drawback: I cannot permanently store HBase tables on local
> disk. After a job finishes, the disks are reclaimed. So - if I want to
> build a continuously available data warehouse (basically for analytics
> runs, not for real-time web access by a large community at present - just
> me and other internal bioinformatics folk here at PNNL)  I need to put the
> HBase tables on the Lustre file system.
>
> Now, all the nodes in this cluster have a very fast InfiniBand QDR network
> interconnect. I think it's something like 40 gigabits/sec, as compared to
> the 1 gigabit/sec that you might see in a run-of-the-mill Hadoop cluster.
> And I just read a  couple white papers that say that if the network
> interconnect is good  enough, the loss of data locality when you use Lustre
> with Hadoop is not such a bad thing. That is, I Googled and found several
> papers on HDFS vs Lustre. The latest one I found (2011) is a white paper
> from a company called Xyratex. Here's a quote from it:
>
> The use of clustered file systems as a backend for Hadoop storage has been
> studied previously. The performance
> of distributed file systems such as Lustre2 , Ceph3 , PVFS4 , and GPFS5
> with Hadoop has been compared to that
> of HDFS. Most of these investigations have shown that non-HDFS file
> systems perform more poorly than HDFS,
> although with various optimizations and tuning efforts, a clustered file
> system can reach parity with HDFS. However,
> a consistent limitation in the studies of HDFS and non-HDFS performance
> with Hadoop is that they used the network
> infrastructure to which Hadoop is limited, TCP/IP, typically over 1 GigE.
> In HPC environments, where much faster
> network interconnects are available, significantly better clustered file
> system performance with Hadoop is possible.
>
> Anyway, I am not principally worried about speed or efficiency right now -
> this cluster is big enough that even if I do not use it most efficiently,
> I'll still be doing better than with my very small current cluster, which
> has very limited RAM and antique processors.
>
> My question is: will HBase work at all on Lustre? That is, on pp. 52-54 of
> your O'Reilly HBase book, you say that
>
> "... you are not locked into HDFS because the "FileSystem" used by HBase
> has a pluggable architecture and can be used to replace HDFS with any other
> supported system. The possibilities are endless and waiting for the brave
> at heart."  ... "You can select a different filesystem implementation by
> using a URI pattern, where the scheme (the part before the first ":",
>  i.e., the colon) part of the URI identifies the driver to be used."
>
> We use HDFS by setting the URI to
>
>  hdfs://<namenode>:port/<path>
>
> And you say to simply use the local file system a desktop Linux box (which
> would not replicate data or maintain copies of the files - no fault
> tolerance) one uses
>
>  file:///<path>
>
> So - can I simply change this one param, and point HBase to a location in
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB