Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Should I use HBASE?


Copy link to this message
-
Re: Should I use HBASE?
Ian Varley 2011-09-14, 19:01
That's an important point to make, Michael. Jumping to HBase (or any NoSQL store) from an RDBMS has pros and cons; the pros are generally that you can scale linearly on cheap(er) hardware as your data and usage grows, but the cons are that many things you take for granted in an RDBMS (like transactions, joins, indexes) aren't built in. You shouldn't assume that just because it's "a lot" of data, that an RDBMS won't handle it well. Benchmarking is key.

In this case, 6-months' worth of data at a rate of 10K inserts per 5 minutes comes out to a steady state of about 500M rows (is that what you mean, @stable29?). Even with skinny rows, that's not "trivial" for a relational database, especially if that database is MySQL. It can work, but you'll have to have someone who really understands the DB at a low level and can administer it, troubleshoot, deal with physical deletion after the 6 months is up, etc. If you ever need to change your schema while keeping the system online, that could also be a challenge. These things are all TOTALLY doable on a relational DB, but you are at least edging towards the territory where there's a reasonable case to be made for HBase.

Also, since you also don't (probably) have much worry in terms of complex transactions, joins, etc., it does sound like a situation where a small HBase cluster might do a nice job at storing this data for you. If you can design in terms of one (or a small number) of access (read & write) patterns that will always be used, you can really optimize it to the point that you pretty much know exactly how every write is going onto the disk and getting read from the disk.

Even with HBase, though, you'll still need someone who really understands the architecture, etc. The difference might just be that HBase is fundamentally simpler than a relational DB; if that simplicity provides what you need without complex workarounds, cool. HBase puts you closer to the metal than a relational database; sometimes that's good (at scale) and sometimes it's not (say, if you didn't really need that power and a higher level, more abstract tool set like a relational database would suffice).

Ian

On Sep 14, 2011, at 1:17 PM, Michael Segel wrote:

>
> I realize that this is an HBase group, however nothing in the stated problem would suggest that an RDBMs couldn't handle the problem.
> Inserting 10K rows every 5 minutes poses a challenge to the database?
>
> I guess it would be a challenge based on the size and type of data along with the database, schema, hardware, etc...  Essentially YMMV.
>
> I'm not sure that switching to HBase would solve their problem.
>
>
>> Date: Wed, 14 Sep 2011 08:09:13 -0700
>> From: [EMAIL PROTECTED]
>> Subject: Re: Should I use HBASE?
>> To: [EMAIL PROTECTED]
>>
>> Hi,
>>
>> I'd guess that you could relatively easily write something that writes that much data into your RDBMS and see how writes start behaving over time and how fast reads are after you are done with all writes.
>> Over at Sematext we have this thing called Scalable Performance Monitoring [1] service and we chose HBase to store all performance metrics, but we keep a LOT of data (points).
>>
>> [1] http://sematext.com/spm/index.html
>>
>>
>> Not coincidentally, we also have HBase-specific monitoring and reports there.
>>
>> Otis
>>
>>
>>>
>>> From: stable29 <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]
>>> Sent: Wednesday, September 14, 2011 6:02 AM
>>> Subject: Should I use HBASE?
>>>
>>>
>>> Currently I am using RDBMS in my project. My project basically monitor
>>> servers. It has to collect the information from all the servers ( no. of
>>> servers could be very huge) every 5 minutes and store it in the database.
>>> storing all the servers information ( around 10000 rows will be inserted
>>> with logical comparison) within 5 minutes itself is challenging for RDBMS
>>> database. we have to maintain around 6 months data in the database.
>>> So,that’s why the data amount becomes very huge.  This is the primary