Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Multiple tables vs big fat table


Copy link to this message
-
Re: Multiple tables vs big fat table
Ian Varley 2011-11-21, 16:21
One clarification; Michael, when you say:

"If I do a scan(), I'm actually going to go through all of the rows in the table."

That's if you're doing a *full* table scan, which you'd have to do if you wanted selectivity based on some attribute that isn't part of the key. This is to be avoided in anything other than a map/reduce scenario; you definitely don't want to scan an entire 100TB table every time you want to return 10 rows to your user in real time.

By contrast, however, HBase is perfectly capable of doing *limited* range scans, over some set of sorted rows that are contiguous with respect to their row keys. This continues to be linear in the size of the scanned range, *not* the size of the whole table. In fact, the get() operation is actually built on top of this same scan() operation, but simply restricts itself to one row. (This pre-supposes that you're not manually using a hash for your row keys, of course).

So if you're scanning by a fixed range of your row key space, that continues to be constant with respect to the size of the whole table.

Ian

On Nov 21, 2011, at 10:13 AM, Michael Segel wrote:

>
> Mark,
> I sometimes answer these things while on my iPad. Its not the best way to type in long answers.  :-)
>
> Yes, you are correct, I'm saying exactly that.  
>
> So imagine you have an HBase Table on a cluster with 10 nodes and 10TB of data.
> If I do a get() I'm asking for a specific row and it will take some time, depending on the row size. For the sake of the example, lets say 5ms.
> If I do a scan(), I'm actually going to go through all of the rows in the table.
>
> Now the Table and the cluster grows to 100 nodes and 100TB of data.
> If I do the get(), it should still take roughly 5ms.
> However if I do the scan() its going to take longer because you're now going through much more data.
>
> Note: I'm talking about a single threaded scan() from a non M/R app or from HBase shell.
>
> This is kind of why getting the right row key, understanding how your data is going to be used, and your schema  are all kind of important when it comes to performance.
> (Even flipping the order of the elements that make up your key can have an impact.)
>
> IMHO I think you need to do a lot more thinking and planning when you work with a NoSQL database than you would w an RDBMs.
>
>
>> Date: Mon, 21 Nov 2011 07:43:09 -0800
>> From: [EMAIL PROTECTED]
>> To: [EMAIL PROTECTED]
>> Subject: Re: Multiple tables vs big fat table
>>
>> Thanks for the detailed explanation. Can you just elaborate on your last
>> comment:
>>
>> In terms of get() performance HBase scales linearly. In terms of scans, it doesn't.
>>
>> Are you saying as my tables get larger and larger that the performance
>> of my scan operations will decline over time but gets will remain constant?
>>
>>
>> On 11/21/11 1:40 AM, Michel Segel wrote:
>>> Mark,
>>>
>>> I think you've gotten a bit more of an explanation...
>>>
>>> The reason I say 'It depends...' is that there are arguments for either design.
>>> If your log events are going to be accessed independently by type... Meaning that you're going to process only a single type of an event at a time, then it makes sense to separate the data.  Note I'm talking about your primary access path.
>>>
>>> At the same time, it was pointed out that if you're not going to be accessing the log events one at a time, you may actually want a hybrid approach where you keep your index in HBase but store your event logs in a sequence file.
>>>
>>> And again, it all depends on what you want to do with the data. That's why you can't always say ... 'if y then do x...'
>>>
>>> There are other issues too. How will the data end up sitting in the table? Sure his is more of an issue of schema/key design, but it will also have an impact on your systems performance.
>>>
>>> In terms of get() performance HBase scales linearly. In terms of scans, it doesn't.
>>>
>>> So there's a lot to think about...
>>>
>>>