|
|
-
High throughput input, low latency output?
Anthony Urso 2011-10-07, 19:43
We have a use case that will require a ten to twenty EC2 node HBase cluster to take several hundred million rows of input from a larger number of EMR instances in daily bursts, and then serve those rows via low latency random reads, say on the order of 300 or so rows per second. Before we start coding, I thought it best to ask the experts for their advice.
1) Is this something that HBase will be able to handle gracefully? 2) Does anyone have any pointers on how to tune HBase for performance and stability under this load? 3) Would HBase perform better under this sort of load on twelve large EC2 instances, six xlarge or three xxlarge?
Thanks, Anthony
-
Re: High throughput input, low latency output?
Matt Corgan 2011-10-07, 20:15
We found that 2 cores is not enough to run hbase. 1 core can easily get tied up with a compaction while the other is doing garbage collection. That doesn't leave any headroom for gets/scans, especially on compressed data and/or when multiple are happening at the same time. Try to do all of that at the same time and some of the other background tasks start choking, like memstore flushes.
We run the c1.xlarge instances (8 cores, 8gb mem) and everything works well, though not much room for block cache.
Matt
On Fri, Oct 7, 2011 at 12:43 PM, Anthony Urso <[EMAIL PROTECTED]> wrote:
> We have a use case that will require a ten to twenty EC2 node HBase > cluster to take several hundred million rows of input from a larger > number of EMR instances in daily bursts, and then serve those rows via > low latency random reads, say on the order of 300 or so rows per > second. Before we start coding, I thought it best to ask the experts > for their advice. > > 1) Is this something that HBase will be able to handle gracefully? > 2) Does anyone have any pointers on how to tune HBase for performance > and stability under this load? > 3) Would HBase perform better under this sort of load on twelve large > EC2 instances, six xlarge or three xxlarge? > > Thanks, > Anthony >
-
Re: High throughput input, low latency output?
Stack 2011-10-08, 03:58
On Fri, Oct 7, 2011 at 12:43 PM, Anthony Urso <[EMAIL PROTECTED]> wrote: > We have a use case that will require a ten to twenty EC2 node HBase > cluster to take several hundred million rows of input from a larger > number of EMR instances in daily bursts, and then serve those rows via > low latency random reads, say on the order of 300 or so rows per > second. Before we start coding, I thought it best to ask the experts > for their advice. > > 1) Is this something that HBase will be able to handle gracefully?
You might have some chance if you were not on EC2.
Any chance of caching working? Are the reads totally random or will there be 'hot' areas? If so, you might have some hope. > 2) Does anyone have any pointers on how to tune HBase for performance > and stability under this load?
See performance section on book up on hbase.org (though there should probably be EC2 caveats...)
> 3) Would HBase perform better under this sort of load on twelve large > EC2 instances, six xlarge or three xxlarge? >
The more nodes the better. And if those nodes are not virtualized, better still. But then there is the network and if its saturated.... Can you run some tests before you start coding? St.Ack
-
Re: High throughput input, low latency output?
Anthony Urso 2011-10-08, 19:18
On Fri, Oct 7, 2011 at 8:58 PM, Stack <[EMAIL PROTECTED]> wrote: > On Fri, Oct 7, 2011 at 12:43 PM, Anthony Urso <[EMAIL PROTECTED]> wrote: >> We have a use case that will require a ten to twenty EC2 node HBase >> cluster to take several hundred million rows of input from a larger >> number of EMR instances in daily bursts, and then serve those rows via >> low latency random reads, say on the order of 300 or so rows per >> second. Before we start coding, I thought it best to ask the experts >> for their advice. >> >> 1) Is this something that HBase will be able to handle gracefully? > > You might have some chance if you were not on EC2. >
Is that because of the slow disk I/O?
> Any chance of caching working? Are the reads totally random or will > there be 'hot' areas? If so, you might have some hope. >
Hopefully. Do you mean external caching like memcache or OS-level disk caching?
> >> 2) Does anyone have any pointers on how to tune HBase for performance >> and stability under this load? > > See performance section on book up on hbase.org (though there should > probably be EC2 caveats...)
TY.
> >> 3) Would HBase perform better under this sort of load on twelve large >> EC2 instances, six xlarge or three xxlarge? >> > > The more nodes the better. And if those nodes are not virtualized, > better still. But then there is the network and if its saturated.... > > > Can you run some tests before you start coding?
Good idea.
> St.Ack >
-
Re: High throughput input, low latency output?
Stack 2011-10-08, 21:33
On Sat, Oct 8, 2011 at 12:18 PM, Anthony Urso <[EMAIL PROTECTED]> wrote: > Is that because of the slow disk I/O? >
If you are sharing the box, your cotenant could be trashing the i/o on you.
You for sure are sharing a network -- as best as I understand AWS -- and this can be oversubscribed from time to time (look back on this list for others input on hbase on ec2 for gist of what you are up for running on ec2). >> Any chance of caching working? Are the reads totally random or will >> there be 'hot' areas? If so, you might have some hope. >> > > Hopefully. Do you mean external caching like memcache or OS-level disk caching? >
I was more talking about hbase block cache; if you were reading same values over and over then this will have an effect; reading from cache you will get low latency reads.
St.Ack
|
|