Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Cluster Size/Node Density


Copy link to this message
-
RE: Cluster Size/Node Density
You absolutely need to do some testing and benchmarking.

This sounds like the kind of application that will require lots of tuning to get right.  It also sounds like the kind of thing HDFS is typically not very good at.

There is an increasing amount of activity in this area (optimizing HDFS for random reads) and lots of good ideas.  HDFS-347 would probably help tremendously for this kind of high random read rate, bypassing the DN completely.

JG

> -----Original Message-----
> From: Wayne [mailto:[EMAIL PROTECTED]]
> Sent: Friday, December 17, 2010 12:29 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Cluster Size/Node Density
>
> Sorry, I am sure my questions were far too broad to answer.
>
> Let me *try* to ask more specific questions. Assuming all data requests are
> cold (random reading pattern) and everything comes from the disks (no
> block cache), what level of concurrency can HDFS handle? Almost all of the
> load is controlled data processing, but we have to do a lot of work at night
> during the batch window so something in the 15-20,000 QPS range would
> meet current worse case requirements. How many nodes would be required
> to effectively return data against a 50TB aggregate data store? Disk I/O
> assumedly starts to break down at a certain point in terms of concurrent
> readers/node/disk.
> We have in our control how many total concurrent readers there are, so if we
> can get 10ms response time with 100 readers that might be better than
> 100ms responses from 1000 concurrent readers.
>
> Thanks.
>
>
> On Fri, Dec 17, 2010 at 2:46 PM, Jean-Daniel Cryans
> <[EMAIL PROTECTED]>wrote:
>
> > Hi Wayne,
> >
> > This question has such a large scope but is applicable to such a tiny
> > subset of workloads (eg yours) that fielding all the questions in
> > details would probably end up just wasting everyone's cycles. So first
> > I'd like to clear up some confusion.
> >
> > > We would like some help with cluster sizing estimates. We have 15TB
> > > of currently relational data we want to store in hbase.
> >
> > There's the 3x replication factor, but also you have to account that
> > each value is stored with it's row key, family name, qualifier and
> > timestamp. That could be a lot more data to store, but at the same
> > time you can use LZO compression to bring that down ~4x.
> >
> > > How many nodes, regions, etc. are we going to need?
> >
> > You don't really have the control over regions, they are created for
> > you as your data grows.
> >
> > > What will our read latency be for 30 vs. 100? Sure we can pack 20
> > > nodes
> > with 3TB
> > > of data each but will it take 1+s for every get?
> >
> > I'm not sure what kind of back-of-the-envelope calculations took you
> > to 1sec, but latency will be strictly determined by concurrency and
> > actual machine load. Even if you were able to pack 20TB in one onde
> > but using a tiny portion of it, you would still get sub 100ms
> > latencies. Or if you have only 10GB on that node but it's getting
> > hammered by 10000 clients, then you should expect much higher
> > latencies.
> >
> > > Will compaction run for 3 days?
> >
> > Which compactions? Major ones? If you don't insert new data in a
> > region, it won't be major compacted. Also if you have that much data,
> > I would set the time between major compactions to be bigger than 1
> > day. Heck, since you are doing time series, this means you'll never
> > delete anything right? So you might as well disable them.
> >
> > And now for the meaty part...
> >
> > The size of your dataset is only one part of the equation, the other
> > being traffic you would be pushing to the cluster which I think wasn't
> > covered at all in your email. Like I said previously, you can pack a
> > lot of data in a single node and can retrieve it really fast as long
> > as concurrency is low. Another thing is how random your reading
> > pattern is... can you even leverage the block cache at all? If yes,
> > then you can accept more concurrency, if not then hitting HDFS is a
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB