Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Cluster Size/Node Density


Copy link to this message
-
RE: Cluster Size/Node Density
Jonathan Gray 2010-12-17, 22:46
You meant 15TB/45TB right?

Your numbers seem in the realm of possibility.  You should try it out on your 10 node cluster if you can.  I've done applications like this in the past with a large dataset and just random reads and HBase has performed well.  I also took advantage of HFileOutputFormat to write the data quickly.  But it was not 5000qps, this app was only in the 100s.

Ensure that your reads are Get operations with HBase as those will use HDFS pread instead of seek/read.  For this application, you absolutely must be using pread.

Good luck.  I'm interested in seeing how you can get HBase to perform, we are here to help if you have any issues.

JG

> -----Original Message-----
> From: Wayne [mailto:[EMAIL PROTECTED]]
> Sent: Friday, December 17, 2010 2:28 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Cluster Size/Node Density
>
> What can we expect from HDFS in terms of random reads? It is our own load,
> so we can "shape" it to a degree to be more "optimized" to how Hbase/hdfs
> prefers to function. We have a 10 node cluster we have been testing another
> nosql solution on, and we can try to test with that but I guess I am trying to
> do a gut check on what we are trying to do before moving to a different
> nosql solution (and wasting more r&d time). Concurrent reads and degrading
> read latency from disk i/o based reads as data volumes increase (total data
> stored) on the node is the wall we have hit with the other nosql solution.
> We totally understand the limitations of disks and disk i/o, that has always
> been the enemy of large databases. SSDs and Memory are currently too
> expensive to solve our problem. We want our limit to be what the physical
> disks can handle, and everything else to be a thin layer on top. We are
> looking for a solution that we know what each node can perform in terms of
> concurrent read/write load, and we then decide on the number of nodes
> based on required Gets/Puts per second.
>
> Can we store 15GB of data (before replication - 45GB+ after) on 30 nodes,
> and sustain 120 disk based readers returning data consistently in under
> 25ms? That is 40 reads/sec/thread or around 5,000 qps. Is this specific
> scenario in the realm of possible making all kinds of assumptions? If 25ms too
> fast is 50ms more likely? Is 100ms more likely? If we assume 100ms can it
> handle 240 readers at that rate? Concurrency will go down once the disk
> utilization is saturated and latency fundamentally is based on random disk io
> latency, but we are looking for what hbase can handle.
>
> I am sorry for such general questions, but I am trying to do a gut check before
> diving into a long testing scenario.
>
> Thanks.
>
>
> On Fri, Dec 17, 2010 at 4:30 PM, Jonathan Gray <[EMAIL PROTECTED]> wrote:
>
> > You absolutely need to do some testing and benchmarking.
> >
> > This sounds like the kind of application that will require lots of
> > tuning to get right.  It also sounds like the kind of thing HDFS is
> > typically not very good at.
> >
> > There is an increasing amount of activity in this area (optimizing
> > HDFS for random reads) and lots of good ideas.  HDFS-347 would
> > probably help tremendously for this kind of high random read rate,
> > bypassing the DN completely.
> >
> > JG
> >
> > > -----Original Message-----
> > > From: Wayne [mailto:[EMAIL PROTECTED]]
> > > Sent: Friday, December 17, 2010 12:29 PM
> > > To: [EMAIL PROTECTED]
> > > Subject: Re: Cluster Size/Node Density
> > >
> > > Sorry, I am sure my questions were far too broad to answer.
> > >
> > > Let me *try* to ask more specific questions. Assuming all data
> > > requests
> > are
> > > cold (random reading pattern) and everything comes from the disks
> > > (no block cache), what level of concurrency can HDFS handle? Almost
> > > all of
> > the
> > > load is controlled data processing, but we have to do a lot of work
> > > at
> > night
> > > during the batch window so something in the 15-20,000 QPS range
> > > would meet current worse case requirements. How many nodes would