|
|
-
Multiple tables vs big fat table
Mark 2011-11-18, 21:29
Is it better to have many smaller tables are one larger table? For example if we wanted to store user action logs we could do either of the following:
Multiple tables: - SearchLog - PageViewLog - LoginLog
or
One table: - ActionLog where the key could be a concatenation of the action type ie (search, pageview, login)
Any ideas? Are there any performance considerations on having multiple smaller tables?
Thanks
-
Re: Multiple tables vs big fat table
Michel Segel 2011-11-20, 08:57
Mark, Simple answer ... it depends... ;-)
Longer answer... What's your use case? What's your access pattern? Is the type of data, in this case evenly distributed in terms of size?
Sent from a remote device. Please excuse any typos...
Mike Segel
On Nov 18, 2011, at 3:29 PM, Mark <[EMAIL PROTECTED]> wrote:
> Is it better to have many smaller tables are one larger table? For example if we wanted to store user action logs we could do either of the following: > > Multiple tables: > - SearchLog > - PageViewLog > - LoginLog > > or > > One table: > - ActionLog where the key could be a concatenation of the action type ie (search, pageview, login) > > Any ideas? Are there any performance considerations on having multiple smaller tables? > > Thanks > >
-
Re: Multiple tables vs big fat table
Mark 2011-11-20, 17:54
I'm more interested in how and why it would depend rather than the actual answer.
In evenly distributed systems you should do x/y because ..... If your data is not evenly distributed then you should...
Thanks On 11/20/11 12:57 AM, Michel Segel wrote: > Mark, > Simple answer ... it depends... ;-) > > Longer answer... > What's your use case? What's your access pattern? Is the type of data, in this case evenly distributed in terms of size? > > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Nov 18, 2011, at 3:29 PM, Mark<[EMAIL PROTECTED]> wrote: > >> Is it better to have many smaller tables are one larger table? For example if we wanted to store user action logs we could do either of the following: >> >> Multiple tables: >> - SearchLog >> - PageViewLog >> - LoginLog >> >> or >> >> One table: >> - ActionLog where the key could be a concatenation of the action type ie (search, pageview, login) >> >> Any ideas? Are there any performance considerations on having multiple smaller tables? >> >> Thanks >> >>
-
Re: Multiple tables vs big fat table
lars hofhansl 2011-11-20, 19:30
There are many considerations here, but one is that separate tables provide a completely separate namespace. If you use one table design of the key space is more involved as you need to separate the namespace with key prefixes. So if you never have to access data from separate "key space" in a single scan, then go for multiple tables.
On the other hand, one big table will probably distribute better over the regionserver and lead to fewer regions over all.
So it depends on how many tables you envision. 10 or 20 or even 100 or so it probably OK. 1000 tables or more will lead to very many regions and hence overhead at the regionservers.
________________________________ From: Mark <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Sunday, November 20, 2011 9:54 AM Subject: Re: Multiple tables vs big fat table I'm more interested in how and why it would depend rather than the actual answer.
In evenly distributed systems you should do x/y because ..... If your data is not evenly distributed then you should...
Thanks On 11/20/11 12:57 AM, Michel Segel wrote: > Mark, > Simple answer ... it depends... ;-) > > Longer answer... > What's your use case? What's your access pattern? Is the type of data, in this case evenly distributed in terms of size? > > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Nov 18, 2011, at 3:29 PM, Mark<[EMAIL PROTECTED]> wrote: > >> Is it better to have many smaller tables are one larger table? For example if we wanted to store user action logs we could do either of the following: >> >> Multiple tables: >> - SearchLog >> - PageViewLog >> - LoginLog >> >> or >> >> One table: >> - ActionLog where the key could be a concatenation of the action type ie (search, pageview, login) >> >> Any ideas? Are there any performance considerations on having multiple smaller tables? >> >> Thanks >> >>
-
Re: Multiple tables vs big fat table
Mark 2011-11-21, 01:18
Thanks for the info.
On 11/20/11 11:30 AM, lars hofhansl wrote: > There are many considerations here, but one is that separate tables provide a completely separate namespace. > If you use one table design of the key space is more involved as you need to separate the namespace with key prefixes. > > > So if you never have to access data from separate "key space" in a single scan, then go for multiple tables. > > On the other hand, one big table will probably distribute better over the regionserver and lead to fewer regions over all. > > So it depends on how many tables you envision. 10 or 20 or even 100 or so it probably OK. 1000 tables or more will lead to very > many regions and hence overhead at the regionservers. > > > > ________________________________ > From: Mark<[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Sunday, November 20, 2011 9:54 AM > Subject: Re: Multiple tables vs big fat table > > I'm more interested in how and why it would depend rather than the > actual answer. > > In evenly distributed systems you should do x/y because ..... If your > data is not evenly distributed then you should... > > Thanks > > > On 11/20/11 12:57 AM, Michel Segel wrote: >> Mark, >> Simple answer ... it depends... ;-) >> >> Longer answer... >> What's your use case? What's your access pattern? Is the type of data, in this case evenly distributed in terms of size? >> >> >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On Nov 18, 2011, at 3:29 PM, Mark<[EMAIL PROTECTED]> wrote: >> >>> Is it better to have many smaller tables are one larger table? For example if we wanted to store user action logs we could do either of the following: >>> >>> Multiple tables: >>> - SearchLog >>> - PageViewLog >>> - LoginLog >>> >>> or >>> >>> One table: >>> - ActionLog where the key could be a concatenation of the action type ie (search, pageview, login) >>> >>> Any ideas? Are there any performance considerations on having multiple smaller tables? >>> >>> Thanks >>> >>>
-
Re: Multiple tables vs big fat table
Amandeep Khurana 2011-11-21, 01:36
Mark,
This is an interesting discussion and like Michel said - the answer to your question depends on what you are trying to achieve. However, here are the points that I would think about:
What are the access patters of the various buckets of data that you want to put in HBase? For instance, would the SearchLog and PageViewLog tables be access together all the time? Would they be primarily scanned or just random look ups. What are the cache requirements? Are both going to be equally read and written? Ideally, you want to store data with separate access patterns in separate tables.
Then, what kind of schema are you looking at. When I say schema, I mean keys and column families. Now, if you concatenate the three tables you mentioned and let's say your keys are prefixed with the type of data:
S<id> P<id> L<id>
you will be using some servers more than others for different parts of the data. In theory, that should not happen but in most practical scenarios when splitting happens, regions tend to stick together. There are ways to work around that as well.
Like Lars said, it's okay to have multiple tables. But you don't want to end up 100s of tables. You ideally want to optimize for the number of tables depending on the access patterns.
Again, this discussion will be kind of abstract without a specific example. :)
-ak On Fri, Nov 18, 2011 at 1:29 PM, Mark <[EMAIL PROTECTED]> wrote:
> Is it better to have many smaller tables are one larger table? For example > if we wanted to store user action logs we could do either of the following: > > Multiple tables: > - SearchLog > - PageViewLog > - LoginLog > > or > > One table: > - ActionLog where the key could be a concatenation of the action type ie > (search, pageview, login) > > Any ideas? Are there any performance considerations on having multiple > smaller tables? > > Thanks > >
-
Re: Multiple tables vs big fat table
Michel Segel 2011-11-21, 09:40
Mark,
I think you've gotten a bit more of an explanation...
The reason I say 'It depends...' is that there are arguments for either design. If your log events are going to be accessed independently by type... Meaning that you're going to process only a single type of an event at a time, then it makes sense to separate the data. Note I'm talking about your primary access path.
At the same time, it was pointed out that if you're not going to be accessing the log events one at a time, you may actually want a hybrid approach where you keep your index in HBase but store your event logs in a sequence file.
And again, it all depends on what you want to do with the data. That's why you can't always say ... 'if y then do x...'
There are other issues too. How will the data end up sitting in the table? Sure his is more of an issue of schema/key design, but it will also have an impact on your systems performance.
In terms of get() performance HBase scales linearly. In terms of scans, it doesn't.
So there's a lot to think about...
Sent from a remote device. Please excuse any typos...
Mike Segel
On Nov 20, 2011, at 7:36 PM, Amandeep Khurana <[EMAIL PROTECTED]> wrote:
> Mark, > > This is an interesting discussion and like Michel said - the answer to your > question depends on what you are trying to achieve. However, here are the > points that I would think about: > > What are the access patters of the various buckets of data that you want to > put in HBase? For instance, would the SearchLog and PageViewLog tables be > access together all the time? Would they be primarily scanned or just > random look ups. What are the cache requirements? Are both going to be > equally read and written? Ideally, you want to store data with separate > access patterns in separate tables. > > Then, what kind of schema are you looking at. When I say schema, I mean > keys and column families. Now, if you concatenate the three tables you > mentioned and let's say your keys are prefixed with the type of data: > > S<id> > P<id> > L<id> > > you will be using some servers more than others for different parts of the > data. In theory, that should not happen but in most practical scenarios > when splitting happens, regions tend to stick together. There are ways to > work around that as well. > > Like Lars said, it's okay to have multiple tables. But you don't want to > end up 100s of tables. You ideally want to optimize for the number of > tables depending on the access patterns. > > Again, this discussion will be kind of abstract without a specific example. > :) > > -ak > > > On Fri, Nov 18, 2011 at 1:29 PM, Mark <[EMAIL PROTECTED]> wrote: > >> Is it better to have many smaller tables are one larger table? For example >> if we wanted to store user action logs we could do either of the following: >> >> Multiple tables: >> - SearchLog >> - PageViewLog >> - LoginLog >> >> or >> >> One table: >> - ActionLog where the key could be a concatenation of the action type ie >> (search, pageview, login) >> >> Any ideas? Are there any performance considerations on having multiple >> smaller tables? >> >> Thanks >> >>
-
Re: Multiple tables vs big fat table
Mark 2011-11-21, 15:43
Thanks for the detailed explanation. Can you just elaborate on your last comment:
In terms of get() performance HBase scales linearly. In terms of scans, it doesn't.
Are you saying as my tables get larger and larger that the performance of my scan operations will decline over time but gets will remain constant? On 11/21/11 1:40 AM, Michel Segel wrote: > Mark, > > I think you've gotten a bit more of an explanation... > > The reason I say 'It depends...' is that there are arguments for either design. > If your log events are going to be accessed independently by type... Meaning that you're going to process only a single type of an event at a time, then it makes sense to separate the data. Note I'm talking about your primary access path. > > At the same time, it was pointed out that if you're not going to be accessing the log events one at a time, you may actually want a hybrid approach where you keep your index in HBase but store your event logs in a sequence file. > > And again, it all depends on what you want to do with the data. That's why you can't always say ... 'if y then do x...' > > There are other issues too. How will the data end up sitting in the table? Sure his is more of an issue of schema/key design, but it will also have an impact on your systems performance. > > In terms of get() performance HBase scales linearly. In terms of scans, it doesn't. > > So there's a lot to think about... > > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Nov 20, 2011, at 7:36 PM, Amandeep Khurana<[EMAIL PROTECTED]> wrote: > >> Mark, >> >> This is an interesting discussion and like Michel said - the answer to your >> question depends on what you are trying to achieve. However, here are the >> points that I would think about: >> >> What are the access patters of the various buckets of data that you want to >> put in HBase? For instance, would the SearchLog and PageViewLog tables be >> access together all the time? Would they be primarily scanned or just >> random look ups. What are the cache requirements? Are both going to be >> equally read and written? Ideally, you want to store data with separate >> access patterns in separate tables. >> >> Then, what kind of schema are you looking at. When I say schema, I mean >> keys and column families. Now, if you concatenate the three tables you >> mentioned and let's say your keys are prefixed with the type of data: >> >> S<id> >> P<id> >> L<id> >> >> you will be using some servers more than others for different parts of the >> data. In theory, that should not happen but in most practical scenarios >> when splitting happens, regions tend to stick together. There are ways to >> work around that as well. >> >> Like Lars said, it's okay to have multiple tables. But you don't want to >> end up 100s of tables. You ideally want to optimize for the number of >> tables depending on the access patterns. >> >> Again, this discussion will be kind of abstract without a specific example. >> :) >> >> -ak >> >> >> On Fri, Nov 18, 2011 at 1:29 PM, Mark<[EMAIL PROTECTED]> wrote: >> >>> Is it better to have many smaller tables are one larger table? For example >>> if we wanted to store user action logs we could do either of the following: >>> >>> Multiple tables: >>> - SearchLog >>> - PageViewLog >>> - LoginLog >>> >>> or >>> >>> One table: >>> - ActionLog where the key could be a concatenation of the action type ie >>> (search, pageview, login) >>> >>> Any ideas? Are there any performance considerations on having multiple >>> smaller tables? >>> >>> Thanks >>> >>>
-
RE: Multiple tables vs big fat table
Michael Segel 2011-11-21, 16:13
Mark, I sometimes answer these things while on my iPad. Its not the best way to type in long answers. :-)
Yes, you are correct, I'm saying exactly that.
So imagine you have an HBase Table on a cluster with 10 nodes and 10TB of data. If I do a get() I'm asking for a specific row and it will take some time, depending on the row size. For the sake of the example, lets say 5ms. If I do a scan(), I'm actually going to go through all of the rows in the table.
Now the Table and the cluster grows to 100 nodes and 100TB of data. If I do the get(), it should still take roughly 5ms. However if I do the scan() its going to take longer because you're now going through much more data.
Note: I'm talking about a single threaded scan() from a non M/R app or from HBase shell.
This is kind of why getting the right row key, understanding how your data is going to be used, and your schema are all kind of important when it comes to performance. (Even flipping the order of the elements that make up your key can have an impact.)
IMHO I think you need to do a lot more thinking and planning when you work with a NoSQL database than you would w an RDBMs. > Date: Mon, 21 Nov 2011 07:43:09 -0800 > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Subject: Re: Multiple tables vs big fat table > > Thanks for the detailed explanation. Can you just elaborate on your last > comment: > > In terms of get() performance HBase scales linearly. In terms of scans, it doesn't. > > Are you saying as my tables get larger and larger that the performance > of my scan operations will decline over time but gets will remain constant? > > > On 11/21/11 1:40 AM, Michel Segel wrote: > > Mark, > > > > I think you've gotten a bit more of an explanation... > > > > The reason I say 'It depends...' is that there are arguments for either design. > > If your log events are going to be accessed independently by type... Meaning that you're going to process only a single type of an event at a time, then it makes sense to separate the data. Note I'm talking about your primary access path. > > > > At the same time, it was pointed out that if you're not going to be accessing the log events one at a time, you may actually want a hybrid approach where you keep your index in HBase but store your event logs in a sequence file. > > > > And again, it all depends on what you want to do with the data. That's why you can't always say ... 'if y then do x...' > > > > There are other issues too. How will the data end up sitting in the table? Sure his is more of an issue of schema/key design, but it will also have an impact on your systems performance. > > > > In terms of get() performance HBase scales linearly. In terms of scans, it doesn't. > > > > So there's a lot to think about... > > > > > > > > Sent from a remote device. Please excuse any typos... > > > > Mike Segel > > > > On Nov 20, 2011, at 7:36 PM, Amandeep Khurana<[EMAIL PROTECTED]> wrote: > > > >> Mark, > >> > >> This is an interesting discussion and like Michel said - the answer to your > >> question depends on what you are trying to achieve. However, here are the > >> points that I would think about: > >> > >> What are the access patters of the various buckets of data that you want to > >> put in HBase? For instance, would the SearchLog and PageViewLog tables be > >> access together all the time? Would they be primarily scanned or just > >> random look ups. What are the cache requirements? Are both going to be > >> equally read and written? Ideally, you want to store data with separate > >> access patterns in separate tables. > >> > >> Then, what kind of schema are you looking at. When I say schema, I mean > >> keys and column families. Now, if you concatenate the three tables you > >> mentioned and let's say your keys are prefixed with the type of data: > >> > >> S<id> > >> P<id> > >> L<id> > >> > >> you will be using some servers more than others for different parts of the > >> data. In theory, that should not happen but in most practical scenarios
-
Re: Multiple tables vs big fat table
Ian Varley 2011-11-21, 16:21
One clarification; Michael, when you say:
"If I do a scan(), I'm actually going to go through all of the rows in the table."
That's if you're doing a *full* table scan, which you'd have to do if you wanted selectivity based on some attribute that isn't part of the key. This is to be avoided in anything other than a map/reduce scenario; you definitely don't want to scan an entire 100TB table every time you want to return 10 rows to your user in real time.
By contrast, however, HBase is perfectly capable of doing *limited* range scans, over some set of sorted rows that are contiguous with respect to their row keys. This continues to be linear in the size of the scanned range, *not* the size of the whole table. In fact, the get() operation is actually built on top of this same scan() operation, but simply restricts itself to one row. (This pre-supposes that you're not manually using a hash for your row keys, of course).
So if you're scanning by a fixed range of your row key space, that continues to be constant with respect to the size of the whole table.
Ian
On Nov 21, 2011, at 10:13 AM, Michael Segel wrote:
> > Mark, > I sometimes answer these things while on my iPad. Its not the best way to type in long answers. :-) > > Yes, you are correct, I'm saying exactly that. > > So imagine you have an HBase Table on a cluster with 10 nodes and 10TB of data. > If I do a get() I'm asking for a specific row and it will take some time, depending on the row size. For the sake of the example, lets say 5ms. > If I do a scan(), I'm actually going to go through all of the rows in the table. > > Now the Table and the cluster grows to 100 nodes and 100TB of data. > If I do the get(), it should still take roughly 5ms. > However if I do the scan() its going to take longer because you're now going through much more data. > > Note: I'm talking about a single threaded scan() from a non M/R app or from HBase shell. > > This is kind of why getting the right row key, understanding how your data is going to be used, and your schema are all kind of important when it comes to performance. > (Even flipping the order of the elements that make up your key can have an impact.) > > IMHO I think you need to do a lot more thinking and planning when you work with a NoSQL database than you would w an RDBMs. > > >> Date: Mon, 21 Nov 2011 07:43:09 -0800 >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> Subject: Re: Multiple tables vs big fat table >> >> Thanks for the detailed explanation. Can you just elaborate on your last >> comment: >> >> In terms of get() performance HBase scales linearly. In terms of scans, it doesn't. >> >> Are you saying as my tables get larger and larger that the performance >> of my scan operations will decline over time but gets will remain constant? >> >> >> On 11/21/11 1:40 AM, Michel Segel wrote: >>> Mark, >>> >>> I think you've gotten a bit more of an explanation... >>> >>> The reason I say 'It depends...' is that there are arguments for either design. >>> If your log events are going to be accessed independently by type... Meaning that you're going to process only a single type of an event at a time, then it makes sense to separate the data. Note I'm talking about your primary access path. >>> >>> At the same time, it was pointed out that if you're not going to be accessing the log events one at a time, you may actually want a hybrid approach where you keep your index in HBase but store your event logs in a sequence file. >>> >>> And again, it all depends on what you want to do with the data. That's why you can't always say ... 'if y then do x...' >>> >>> There are other issues too. How will the data end up sitting in the table? Sure his is more of an issue of schema/key design, but it will also have an impact on your systems performance. >>> >>> In terms of get() performance HBase scales linearly. In terms of scans, it doesn't. >>> >>> So there's a lot to think about... >>> >>>
-
RE: Multiple tables vs big fat table
Michael Segel 2011-11-21, 17:04
Ian,
The long and short...
I was using the example to show scan() vs get() and how HBase scales in a linear fashion.
To your point... if you hash your row key, you can't use start/stop row key values in your scan.
We have an application where you would have to do a complete scan in order to find a subset of rows which required some work. To get around from having to do a full scan, you could use a secondary index, or table to store the row keys that you want to work with. The trick is then to query the subset and then split the resulting list. (You can do this by overloading the inputFormat class to take a java list object as your input in to a map/reduce job and then create n evenly splits... (Ok n even splits + 1 split holding the remainder...) )
There are always going to be design tradeoffs. The app was designed to give the best performance on reads and then sacrifice m/r performance... (reads from outside of a M/R ) > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Date: Mon, 21 Nov 2011 08:21:56 -0800 > Subject: Re: Multiple tables vs big fat table > > One clarification; Michael, when you say: > > "If I do a scan(), I'm actually going to go through all of the rows in the table." > > That's if you're doing a *full* table scan, which you'd have to do if you wanted selectivity based on some attribute that isn't part of the key. This is to be avoided in anything other than a map/reduce scenario; you definitely don't want to scan an entire 100TB table every time you want to return 10 rows to your user in real time. > > By contrast, however, HBase is perfectly capable of doing *limited* range scans, over some set of sorted rows that are contiguous with respect to their row keys. This continues to be linear in the size of the scanned range, *not* the size of the whole table. In fact, the get() operation is actually built on top of this same scan() operation, but simply restricts itself to one row. (This pre-supposes that you're not manually using a hash for your row keys, of course). > > So if you're scanning by a fixed range of your row key space, that continues to be constant with respect to the size of the whole table. > > Ian > > On Nov 21, 2011, at 10:13 AM, Michael Segel wrote: > > > > > Mark, > > I sometimes answer these things while on my iPad. Its not the best way to type in long answers. :-) > > > > Yes, you are correct, I'm saying exactly that. > > > > So imagine you have an HBase Table on a cluster with 10 nodes and 10TB of data. > > If I do a get() I'm asking for a specific row and it will take some time, depending on the row size. For the sake of the example, lets say 5ms. > > If I do a scan(), I'm actually going to go through all of the rows in the table. > > > > Now the Table and the cluster grows to 100 nodes and 100TB of data. > > If I do the get(), it should still take roughly 5ms. > > However if I do the scan() its going to take longer because you're now going through much more data. > > > > Note: I'm talking about a single threaded scan() from a non M/R app or from HBase shell. > > > > This is kind of why getting the right row key, understanding how your data is going to be used, and your schema are all kind of important when it comes to performance. > > (Even flipping the order of the elements that make up your key can have an impact.) > > > > IMHO I think you need to do a lot more thinking and planning when you work with a NoSQL database than you would w an RDBMs. > > > > > >> Date: Mon, 21 Nov 2011 07:43:09 -0800 > >> From: [EMAIL PROTECTED] > >> To: [EMAIL PROTECTED] > >> Subject: Re: Multiple tables vs big fat table > >> > >> Thanks for the detailed explanation. Can you just elaborate on your last > >> comment: > >> > >> In terms of get() performance HBase scales linearly. In terms of scans, it doesn't. > >> > >> Are you saying as my tables get larger and larger that the performance > >> of my scan operations will decline over time but gets will remain constant?
-
Re: Multiple tables vs big fat table
Ian Varley 2011-11-21, 17:11
Certainly, and that's all valid. I just wanted to make it clear to Mark (and others reading) that scans aren't inherently "bad" in HBase, and they don't need to scan the entire table (and usually shouldn't). Short, local scans are very efficient, provided your row keys are sorted in a way that's meaningful to your app. You don't have to do it that way (you can use a hash in the key) but it's a valid design pattern and makes great use of HBase's architecture (the fact that row keys are sorted on disk and in memory).
Ian
On Nov 21, 2011, at 11:04 AM, Michael Segel wrote:
> > Ian, > > The long and short... > > I was using the example to show scan() vs get() and how HBase scales in a linear fashion. > > To your point... if you hash your row key, you can't use start/stop row key values in your scan. > > We have an application where you would have to do a complete scan in order to find a subset of rows which required some work. To get around from having to do a full scan, you could use a secondary index, or table to store the row keys that you want to work with. > The trick is then to query the subset and then split the resulting list. (You can do this by overloading the inputFormat class to take a java list object as your input in to a map/reduce job and then create n evenly splits... (Ok n even splits + 1 split holding the remainder...) ) > > There are always going to be design tradeoffs. The app was designed to give the best performance on reads and then sacrifice m/r performance... (reads from outside of a M/R ) > > >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> Date: Mon, 21 Nov 2011 08:21:56 -0800 >> Subject: Re: Multiple tables vs big fat table >> >> One clarification; Michael, when you say: >> >> "If I do a scan(), I'm actually going to go through all of the rows in the table." >> >> That's if you're doing a *full* table scan, which you'd have to do if you wanted selectivity based on some attribute that isn't part of the key. This is to be avoided in anything other than a map/reduce scenario; you definitely don't want to scan an entire 100TB table every time you want to return 10 rows to your user in real time. >> >> By contrast, however, HBase is perfectly capable of doing *limited* range scans, over some set of sorted rows that are contiguous with respect to their row keys. This continues to be linear in the size of the scanned range, *not* the size of the whole table. In fact, the get() operation is actually built on top of this same scan() operation, but simply restricts itself to one row. (This pre-supposes that you're not manually using a hash for your row keys, of course). >> >> So if you're scanning by a fixed range of your row key space, that continues to be constant with respect to the size of the whole table. >> >> Ian >> >> On Nov 21, 2011, at 10:13 AM, Michael Segel wrote: >> >>> >>> Mark, >>> I sometimes answer these things while on my iPad. Its not the best way to type in long answers. :-) >>> >>> Yes, you are correct, I'm saying exactly that. >>> >>> So imagine you have an HBase Table on a cluster with 10 nodes and 10TB of data. >>> If I do a get() I'm asking for a specific row and it will take some time, depending on the row size. For the sake of the example, lets say 5ms. >>> If I do a scan(), I'm actually going to go through all of the rows in the table. >>> >>> Now the Table and the cluster grows to 100 nodes and 100TB of data. >>> If I do the get(), it should still take roughly 5ms. >>> However if I do the scan() its going to take longer because you're now going through much more data. >>> >>> Note: I'm talking about a single threaded scan() from a non M/R app or from HBase shell. >>> >>> This is kind of why getting the right row key, understanding how your data is going to be used, and your schema are all kind of important when it comes to performance. >>> (Even flipping the order of the elements that make up your key can have an impact.) >>> >>> IMHO I think you need to do a lot more thinking and planning when you work with a NoSQL database than you would w an RDBMs.
|
|