-Re: Should I use HBASE?
Ian Varley 2011-09-14, 22:33
Point well taken, Mike. :) It's a bad idea to assume we know the original poster's requirements well enough to suggest a direction, based on such a brief sketch.
Original poster, let me be clear: a data set of your size may (or may not) be a good fit for doing in HBase; relational databases regularly do that volume of transactions happily, and offer advanced features and ACID guarantees that HBase does not. If you'd like more targeted advice from the community, perhaps answer the following questions:
1. Is the 500M rows you refer to a max target, or just the initial volume? Is there some other multiplier you didn't mention? (You said, "if it works good, this could be used widely")
2. What kind of read access patterns will you have? I.e. do you always get the data by a specific key, or scan across ordered rows? Or, would you need to be able to gather data in real time based on filters on other attributes (like you'd use an index in a relational DB for).
3. How big is the content of a typical row, in bytes?
4. Is using a more "industrial strength" DB like Oracle an option? Or would you be doing it on a free offering like MySQL or Postgres? Would you have a DBA to help administer the solution?
On Sep 14, 2011, at 4:23 PM, Michael Segel wrote:
> I think you misunderstood my point.
> The initial author asks a question about using HBase, yet doesn't really provide enough detailed information as to what he wants to achieve and why he is failing.
> My point was based on the information that he presented, he didn't show how or why his RDBMs solution was failing. (Or what he meant when he used the term fail.)
> There are so many reasons why the RDBMs could fail and it could be a factor of which RDBMs is being used.
> I've seen 50K ticks a second being ingested in to Informix's Financial Foundation offering 10 years ago. Here, there is a specific set up of the servers and configuration of IDS.
> But that's 50K records inserted in a second, not 5K every 5 minutes.
> Is it trivial? Probably not trivial, but still not really rocket science.
> But I digress. Again the point is that we have a person coming here and asking us 'is this a good fit' and it would be better to say 'it depends' or 'you haven't provided enough information...'
> To your point, yes, there are other databases out there like Informix and Oracle that scale better than MySQL. If the issue is that his RDBMs can't keep up, then one question I have to ask is if he's thought about changing to a different RDBMs platform. What happens if you say sure we can do this in HBase, and then he pulls out his 'must be ACID compliant' card?
>> From: [EMAIL PROTECTED]
>> To: [EMAIL PROTECTED]
>> Date: Wed, 14 Sep 2011 12:01:46 -0700
>> Subject: Re: Should I use HBASE?
>> That's an important point to make, Michael. Jumping to HBase (or any NoSQL store) from an RDBMS has pros and cons; the pros are generally that you can scale linearly on cheap(er) hardware as your data and usage grows, but the cons are that many things you take for granted in an RDBMS (like transactions, joins, indexes) aren't built in. You shouldn't assume that just because it's "a lot" of data, that an RDBMS won't handle it well. Benchmarking is key.
>> In this case, 6-months' worth of data at a rate of 10K inserts per 5 minutes comes out to a steady state of about 500M rows (is that what you mean, @stable29?). Even with skinny rows, that's not "trivial" for a relational database, especially if that database is MySQL. It can work, but you'll have to have someone who really understands the DB at a low level and can administer it, troubleshoot, deal with physical deletion after the 6 months is up, etc. If you ever need to change your schema while keeping the system online, that could also be a challenge. These things are all TOTALLY doable on a relational DB, but you are at least edging towards the territory where there's a reasonable case to be made for HBase.