|
|
-
Parent/child relation - go vertical, horizontal, or many tables?
Jason 2011-02-11, 00:55
Hi all,
Let's say I have two entities Parent and Child. There could be many children in one parent (from hundreds to tens of millions) A child can only belong to one Parent.
Typical queries are: -Fetch all children from a single parent -Find a few children by their keys or values from a single parent -Update a single child by child key and it's parent key
And there are no cross-parent queries.
I am trying to figure out what is better schema approach from performance/maintenance perspective:
1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory?
2. One table. Compound row key parent id/child id. One child per row.
3. Many tables - one per parent. Row key is a child id.
Thanks!
-
Re: Parent/child relation - go vertical, horizontal, or many tables?
Ryan Rawson 2011-02-11, 00:57
You want to choose the schema that minimizes the # of RPCs you are doing.
-ryan
On Thu, Feb 10, 2011 at 4:55 PM, Jason <[EMAIL PROTECTED]> wrote: > Hi all, > > Let's say I have two entities Parent and Child. There could be many children in one parent (from hundreds to tens of millions) > A child can only belong to one Parent. > > Typical queries are: > -Fetch all children from a single parent > -Find a few children by their keys or values from a single parent > -Update a single child by child key and it's parent key > > And there are no cross-parent queries. > > I am trying to figure out what is better schema approach from performance/maintenance perspective: > > 1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory? > > 2. One table. Compound row key parent id/child id. One child per row. > > 3. Many tables - one per parent. Row key is a child id. > > Thanks!
-
Re: Parent/child relation - go vertical, horizontal, or many tables?
Andrey Stepachev 2011-02-11, 09:46
I such case I think that you can use tall tables with parent:child keys and filters or range scans to get childrens.
You queries will be: -Fetch all children from a single parent
scan [parent:0, parent+1:0)
-Find a few children by their keys or values from a single parent
scan [parent:min_of_child_keys, parent:max_of_child_key + 1] + filterset (or custom hash filter) If it is too many keys, you can use HTable.getRegionLocation to split you childs by parallel scans on different regions.
-Update a single child by child key and it's parent key
easy (in all cases, simpe put or get+put if it is true update, not overwrite) 2011/2/11 Jason <[EMAIL PROTECTED]>
> Hi all, > > Let's say I have two entities Parent and Child. There could be many > children in one parent (from hundreds to tens of millions) > A child can only belong to one Parent. > > Typical queries are: > -Fetch all children from a single parent > -Find a few children by their keys or values from a single parent > -Update a single child by child key and it's parent key > > And there are no cross-parent queries. > > I am trying to figure out what is better schema approach from > performance/maintenance perspective: > > 1. One table with one Parent per row. Row key is a parent id. Children are > stored in a single family each under separate qualifier (child id). Would it > even work assuming all children may not fit in memory? > > 2. One table. Compound row key parent id/child id. One child per row. > > 3. Many tables - one per parent. Row key is a child id. > > Thanks!
-
RE: Parent/child relation - go vertical, horizontal, or many tables?
Michael Segel 2011-02-11, 13:51
Jason,
You have the following constraint: Foreach child there is one parent. A parent can have more than one child.
While you don't specify size of the child, when a parent can have tens of millions, that could become an issue. Assuming that the child is relatively small...
You have 3 use cases: (Scan patterns)
> -Fetch all children from a single parent > -Find a few children by their keys or values from a single parent > -Update a single child by child key and it's parent key
Your options...
> 1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory? > While you raise an interesting point, lets look at the schema as a solution. This works well because you can fetch the entire row based on parent key. So all queries are get()s and not scan()s.
You can then pull all of the existing columns where each column represents a child.
You can also do a get() of only those columns you want based on child_id as the column name.
You can also do a get() or a put of a specific column (child_id) for a given parent (row key). With respect to your issue about a row being too large to fit in to memory... This would imply that the row would be too large to fit in to a single region. Wouldn't that cause your HBase to die a horrible death?
If this really is a potential situation, then you should consider the parent_key, child_id compound row key...
> 2. One table. Compound row key parent id/child id. One child per row. > Based on your use cases, I wouldn't recommend this. While it is a valid schema, it is only 'optimal' for your 'Update a single child by child key and its parent key'.
> 3. Many tables - one per parent. Row key is a child id. If you have a scenario of a parent has billions+ of children, the could be a valid choice, however based on what you said, (up to tens of millions) and the data set is unique and non-intersecting, you would be better off with a single table. (Too many tables is not a good thing in HBase.) HTH
-Mike > Subject: Parent/child relation - go vertical, horizontal, or many tables? > From: [EMAIL PROTECTED] > Date: Thu, 10 Feb 2011 16:55:00 -0800 > To: [EMAIL PROTECTED] > > Hi all, > > Let's say I have two entities Parent and Child. There could be many children in one parent (from hundreds to tens of millions) > A child can only belong to one Parent. > > Typical queries are: > -Fetch all children from a single parent > -Find a few children by their keys or values from a single parent > -Update a single child by child key and it's parent key > > And there are no cross-parent queries. > > I am trying to figure out what is better schema approach from performance/maintenance perspective: > > 1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory? > > 2. One table. Compound row key parent id/child id. One child per row. > > 3. Many tables - one per parent. Row key is a child id. > > Thanks!
-
RE: Parent/child relation - go vertical, horizontal, or many tables?
Buttler, David 2011-02-11, 18:45
Michael, Thanks for the analysis. The thought process you put into this seems useful. However, following along at home I came to a different conclusion than you did. I would prefer (sol. 2) over (sol. 3) for the reason you mention, but I would also strongly prefer (sol. 2) over (sol. 1), also for the reason you mention.
So, I don't see how you can not recommend (sol. 2). It seems like (sol. 1) would be very wasteful for use cases (u2) and (u3). The only time it would help is in (u1). And then it doesn't seem obvious to me that a single row is better except in cases where there are very few children per parent.
Perhaps if the data is expected to have a power law distribution (fat tail, zipfian), a hybrid approach would be better: go with (sol. 1) for any parent that has fewer than (say 10) children. But, after a parent fills up its first 10 children, start populating rows like (sol. 2).
This would definitely make the client code more complex, so it would only make sense if there were huge savings to be had. Maybe a slightly better implementation of the hybrid would be to divide the child key space up into buckets so that you can directly address any child, but still have fewer calls in retrieving all children. Then you can adjust your bucket size based on your actual use case (with a bucket size of 1 being the special case of (sol. 2)).
But the more I think about it, the more I suspect that the added complexity will not be worth it, and he should just go with (sol. 2).
Dave -----Original Message----- From: Michael Segel [mailto:[EMAIL PROTECTED]] Sent: Friday, February 11, 2011 5:51 AM To: [EMAIL PROTECTED] Subject: RE: Parent/child relation - go vertical, horizontal, or many tables? Jason,
You have the following constraint: Foreach child there is one parent. A parent can have more than one child.
While you don't specify size of the child, when a parent can have tens of millions, that could become an issue. Assuming that the child is relatively small...
You have 3 use cases: (Scan patterns)
> -Fetch all children from a single parent > -Find a few children by their keys or values from a single parent > -Update a single child by child key and it's parent key
Your options...
> 1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory? > While you raise an interesting point, lets look at the schema as a solution. This works well because you can fetch the entire row based on parent key. So all queries are get()s and not scan()s.
You can then pull all of the existing columns where each column represents a child.
You can also do a get() of only those columns you want based on child_id as the column name.
You can also do a get() or a put of a specific column (child_id) for a given parent (row key). With respect to your issue about a row being too large to fit in to memory... This would imply that the row would be too large to fit in to a single region. Wouldn't that cause your HBase to die a horrible death?
If this really is a potential situation, then you should consider the parent_key, child_id compound row key...
> 2. One table. Compound row key parent id/child id. One child per row. > Based on your use cases, I wouldn't recommend this. While it is a valid schema, it is only 'optimal' for your 'Update a single child by child key and its parent key'.
> 3. Many tables - one per parent. Row key is a child id. If you have a scenario of a parent has billions+ of children, the could be a valid choice, however based on what you said, (up to tens of millions) and the data set is unique and non-intersecting, you would be better off with a single table. (Too many tables is not a good thing in HBase.) HTH
-Mike > Subject: Parent/child relation - go vertical, horizontal, or many tables? > From: [EMAIL PROTECTED] > Date: Thu, 10 Feb 2011 16:55:00 -0800
-
RE: Parent/child relation - go vertical, horizontal, or many tables?
Michael Segel 2011-02-11, 20:22
David,
First a caveat... You need to have a realistic notion of the data and its sizes when considering your options... With respect to the response, Here's what I said: -=- "With respect to your issue about a row being too large to fit in to memory... This would imply that the row would be too large to fit in to a single region. Wouldn't that cause your HBase to die a horrible death?
If this really is a potential situation, then you should consider the parent_key, child_id compound row key..." -=- Now a correction. If you insert a row that is larger than a region, the region will grow to fit the row and will not split. So until your row exceeds the size of available disk... you can do it. So yeah you could fill up memory...
And that's the only reason why I would recommend option 2 over option 1. So how real is this scenario?
Looking at the 3 stated use cases... Doing a get() on the parent ID will give you the entire set of children for the parent in a single fetch. If you limit the columns to either a single column or a set of columns, you are still going to be a single get().
This is going to be faster than doing a scan() on a series of row starting with parent_id stopping with parent_id+1. (At least in theory. I haven't mocked this out and tried it.)
Again the only advantage of option 2 is if you really are worried about your data size blowing you out of the water. If you do find yourself using a lot of memory to fetch your edge cases, then you'd be better off with the second option.
Here you have the following:
1) Fetching all of the children (scan() with a start and stop key) 2) Fetching some of the rows... (scan() with a start and stop key and some sort of filter); 3) Fetching single child (get() using a combination of parent_id, child_id for the key.)
So while you don't have to worry about the size of a row, you do not get the same performance that you could with option 1.
Does that make sense?
-Mike
> From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Date: Fri, 11 Feb 2011 10:45:14 -0800 > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables? > > Michael, > Thanks for the analysis. The thought process you put into this seems useful. However, following along at home I came to a different conclusion than you did. I would prefer (sol. 2) over (sol. 3) for the reason you mention, but I would also strongly prefer (sol. 2) over (sol. 1), also for the reason you mention. > > So, I don't see how you can not recommend (sol. 2). It seems like (sol. 1) would be very wasteful for use cases (u2) and (u3). The only time it would help is in (u1). And then it doesn't seem obvious to me that a single row is better except in cases where there are very few children per parent. > > Perhaps if the data is expected to have a power law distribution (fat tail, zipfian), a hybrid approach would be better: go with (sol. 1) for any parent that has fewer than (say 10) children. But, after a parent fills up its first 10 children, start populating rows like (sol. 2). > > This would definitely make the client code more complex, so it would only make sense if there were huge savings to be had. > Maybe a slightly better implementation of the hybrid would be to divide the child key space up into buckets so that you can directly address any child, but still have fewer calls in retrieving all children. Then you can adjust your bucket size based on your actual use case (with a bucket size of 1 being the special case of (sol. 2)). > > But the more I think about it, the more I suspect that the added complexity will not be worth it, and he should just go with (sol. 2). > > Dave > > > -----Original Message----- > From: Michael Segel [mailto:[EMAIL PROTECTED]] > Sent: Friday, February 11, 2011 5:51 AM > To: [EMAIL PROTECTED] > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables? > > > Jason, > > You have the following constraint: > Foreach child there is one parent. A parent can have more than one child.
-
RE: Parent/child relation - go vertical, horizontal, or many tables?
Jonathan Gray 2011-02-11, 20:48
Just to chime in with my usual take on this (seems like the tall vs. wide discussion happens every few weeks...)
For "get all children of a parent", doing a get() on the wide table vs. doing a scan() on the tall table (as long as you set scanner caching appropriately) will be almost identical. I wouldn't expect any difference in performance if you are properly tuning parameters *EXCEPT* that today a Scan will always require more than one RPC because the API is such that you need to open the scanner first, and then do next() on it, and then close() it. This is a current API limitation but we could implement an optimization to allow for single-RPC scans if the query can be fulfilled in a single response (start row, stop row, and scanner caching set appropriately). A Get, on the server-side, does this exact same thing but in a single RPC (it opens a scanner, next() on it, and then close() it).
The fact that a row cannot cross a region boundary is a consideration, but unless your rows will be many gigabytes each, I don't think this is that important. Having to cross a region boundary to fulfill the "get all children" query would be my primary worry.
Now besides those considerations above, the other two queries you want (parent-child point lookups and parent-child additions) are virtually identical in performance on the server-side starting with HBase 0.90 and beyond. We have the same block-seeking optimizations in both schemas for the read case, and the write case is identical in both.
The only other thing to consider is what if all the children of one parent can't fit in memory at the same time. This is not at all related to a region getting too big (there is no requirement of fitting a region into memory) but is a consideration for reading it in a single RPC (both on the server-side and also receiving it in your client). However, you would deal with this the same way in the tall or wide case. In the tall case, you would appropriately set the scanner caching number. In the wide case, you would set the intra-row scan limit. In this case, you will be forced to use the Scan API regardless because if you need multiple RPCs for a single row, you need the Scanner next() semantics.
Many times, this decisions comes to a matter of personal preference. I lean towards wide tables these days unless I expect extremely high numbers of children (so I want to split across regions and RPC requests) and I expect to frequently run the get-all-children query with high numbers of children.
JG
> -----Original Message----- > From: Michael Segel [mailto:[EMAIL PROTECTED]] > Sent: Friday, February 11, 2011 12:23 PM > To: [EMAIL PROTECTED] > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables? > > > David, > > First a caveat... You need to have a realistic notion of the data and its sizes > when considering your options... > With respect to the response, Here's what I said: > -=- > "With respect to your issue about a row being too large to fit in to memory... > This would imply that the row would be too large to fit in to a single region. > Wouldn't that cause your HBase to die a horrible death? > > If this really is a potential situation, then you should consider the > parent_key, child_id compound row key..." > -=- > Now a correction. If you insert a row that is larger than a region, the region > will grow to fit the row and will not split. So until your row exceeds the size of > available disk... you can do it. So yeah you could fill up memory... > > And that's the only reason why I would recommend option 2 over option 1. > So how real is this scenario? > > Looking at the 3 stated use cases... Doing a get() on the parent ID will give > you the entire set of children for the parent in a single fetch. > If you limit the columns to either a single column or a set of columns, you are > still going to be a single get(). > > This is going to be faster than doing a scan() on a series of row starting with
-
RE: Parent/child relation - go vertical, horizontal, or many tables?
Michael Segel 2011-02-11, 20:59
Jonathan, Thanks for the response. > The fact that a row cannot cross a region boundary is a consideration, but unless your rows will be many gigabytes each, I don't think this is that important. Having to cross a region boundary to fulfill the "get all children" query would be my primary worry.
That would be an issue if you have a tall table with many rows. Assuming you had enough children to break the wide row and the children were relatively big...
> Now besides those considerations above, the other two queries you want (parent-child point lookups and parent-child additions) are virtually identical in performance on the server-side starting with HBase 0.90 and beyond. We have the same block-seeking optimizations in both schemas for the read case, and the write case is identical in both.
This is interesting.
So essentially the pat response these days is either "... it depends..." or "YMMV".
Because the OP didn't really say how wide or how frequent he would have wide rows... I'd still lean to wide rows... But it is good to know about the improvements in 0.90
Thx
-Mike > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables? > Date: Fri, 11 Feb 2011 20:48:51 +0000 > > Just to chime in with my usual take on this (seems like the tall vs. wide discussion happens every few weeks...) > > For "get all children of a parent", doing a get() on the wide table vs. doing a scan() on the tall table (as long as you set scanner caching appropriately) will be almost identical. I wouldn't expect any difference in performance if you are properly tuning parameters *EXCEPT* that today a Scan will always require more than one RPC because the API is such that you need to open the scanner first, and then do next() on it, and then close() it. This is a current API limitation but we could implement an optimization to allow for single-RPC scans if the query can be fulfilled in a single response (start row, stop row, and scanner caching set appropriately). A Get, on the server-side, does this exact same thing but in a single RPC (it opens a scanner, next() on it, and then close() it). > > The fact that a row cannot cross a region boundary is a consideration, but unless your rows will be many gigabytes each, I don't think this is that important. Having to cross a region boundary to fulfill the "get all children" query would be my primary worry. > > Now besides those considerations above, the other two queries you want (parent-child point lookups and parent-child additions) are virtually identical in performance on the server-side starting with HBase 0.90 and beyond. We have the same block-seeking optimizations in both schemas for the read case, and the write case is identical in both. > > The only other thing to consider is what if all the children of one parent can't fit in memory at the same time. This is not at all related to a region getting too big (there is no requirement of fitting a region into memory) but is a consideration for reading it in a single RPC (both on the server-side and also receiving it in your client). However, you would deal with this the same way in the tall or wide case. In the tall case, you would appropriately set the scanner caching number. In the wide case, you would set the intra-row scan limit. In this case, you will be forced to use the Scan API regardless because if you need multiple RPCs for a single row, you need the Scanner next() semantics. > > Many times, this decisions comes to a matter of personal preference. I lean towards wide tables these days unless I expect extremely high numbers of children (so I want to split across regions and RPC requests) and I expect to frequently run the get-all-children query with high numbers of children. > > JG > > > -----Original Message----- > > From: Michael Segel [mailto:[EMAIL PROTECTED]] > > Sent: Friday, February 11, 2011 12:23 PM > > To: [EMAIL PROTECTED]
-
Re: Parent/child relation - go vertical, horizontal, or many tables?
Ryan Rawson 2011-02-12, 07:00
If you dont think a row would get beyond thousands of columns I'd go with wide columns. Once you get to 10k, 100k, or millions, things might get a little weird. Performance on huge rows is difficult because we have to materialize the entire row at a time. There are options in scan to return partial rows though. Also a region will eventually become a single row region unable to split.
But wide columns arent in general to be avoided, just if you cant predict the ultimate width.
-ryan
On Fri, Feb 11, 2011 at 12:59 PM, Michael Segel <[EMAIL PROTECTED]> wrote: > > Jonathan, > Thanks for the response. >> The fact that a row cannot cross a region boundary is a > consideration, but unless your rows will be many gigabytes each, I don't > think this is that important. Having to cross a region boundary to > fulfill the "get all children" query would be my primary worry. > > That would be an issue if you have a tall table with many rows. Assuming you had enough children to break the wide row and the children were relatively big... > >> Now besides those considerations above, the other two queries you > want (parent-child point lookups and parent-child additions) are > virtually identical in performance on the server-side starting with > HBase 0.90 and beyond. We have the same block-seeking optimizations in > both schemas for the read case, and the write case is identical in both. > > This is interesting. > > So essentially the pat response these days is either "... it depends..." or "YMMV". > > Because the OP didn't really say how wide or how frequent he would have wide rows... I'd still lean to wide rows... > But it is good to know about the improvements in 0.90 > > Thx > > -Mike > > >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables? >> Date: Fri, 11 Feb 2011 20:48:51 +0000 >> >> Just to chime in with my usual take on this (seems like the tall vs. wide discussion happens every few weeks...) >> >> For "get all children of a parent", doing a get() on the wide table vs. doing a scan() on the tall table (as long as you set scanner caching appropriately) will be almost identical. I wouldn't expect any difference in performance if you are properly tuning parameters *EXCEPT* that today a Scan will always require more than one RPC because the API is such that you need to open the scanner first, and then do next() on it, and then close() it. This is a current API limitation but we could implement an optimization to allow for single-RPC scans if the query can be fulfilled in a single response (start row, stop row, and scanner caching set appropriately). A Get, on the server-side, does this exact same thing but in a single RPC (it opens a scanner, next() on it, and then close() it). >> >> The fact that a row cannot cross a region boundary is a consideration, but unless your rows will be many gigabytes each, I don't think this is that important. Having to cross a region boundary to fulfill the "get all children" query would be my primary worry. >> >> Now besides those considerations above, the other two queries you want (parent-child point lookups and parent-child additions) are virtually identical in performance on the server-side starting with HBase 0.90 and beyond. We have the same block-seeking optimizations in both schemas for the read case, and the write case is identical in both. >> >> The only other thing to consider is what if all the children of one parent can't fit in memory at the same time. This is not at all related to a region getting too big (there is no requirement of fitting a region into memory) but is a consideration for reading it in a single RPC (both on the server-side and also receiving it in your client). However, you would deal with this the same way in the tall or wide case. In the tall case, you would appropriately set the scanner caching number. In the wide case, you would set the intra-row scan limit. In this case, you will be forced to use the Scan API regardless because if you need multiple RPCs for a single row, you need the Scanner next() semantics.
-
Re: Parent/child relation - go vertical, horizontal, or many tables?
Jason 2011-02-12, 20:36
Thank you all for the great insight. Based on your thoughts I am going to try a hybrid approach - that is split children into buckets based on id range and store a bucket per row. The row key then would be parent-id:bucket-id where bucket-id=child-id/n, and n - bucket size chosen specifically to prevent rows from being too wide.
On Feb 11, 2011, at 11:00 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote:
> If you dont think a row would get beyond thousands of columns I'd go > with wide columns. Once you get to 10k, 100k, or millions, things > might get a little weird. Performance on huge rows is difficult > because we have to materialize the entire row at a time. There are > options in scan to return partial rows though. Also a region will > eventually become a single row region unable to split. > > But wide columns arent in general to be avoided, just if you cant > predict the ultimate width. > > -ryan > > On Fri, Feb 11, 2011 at 12:59 PM, Michael Segel > <[EMAIL PROTECTED]> wrote: >> >> Jonathan, >> Thanks for the response. >>> The fact that a row cannot cross a region boundary is a >> consideration, but unless your rows will be many gigabytes each, I don't >> think this is that important. Having to cross a region boundary to >> fulfill the "get all children" query would be my primary worry. >> >> That would be an issue if you have a tall table with many rows. Assuming you had enough children to break the wide row and the children were relatively big... >> >>> Now besides those considerations above, the other two queries you >> want (parent-child point lookups and parent-child additions) are >> virtually identical in performance on the server-side starting with >> HBase 0.90 and beyond. We have the same block-seeking optimizations in >> both schemas for the read case, and the write case is identical in both. >> >> This is interesting. >> >> So essentially the pat response these days is either "... it depends..." or "YMMV". >> >> Because the OP didn't really say how wide or how frequent he would have wide rows... I'd still lean to wide rows... >> But it is good to know about the improvements in 0.90 >> >> Thx >> >> -Mike >> >> >>> From: [EMAIL PROTECTED] >>> To: [EMAIL PROTECTED] >>> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables? >>> Date: Fri, 11 Feb 2011 20:48:51 +0000 >>> >>> Just to chime in with my usual take on this (seems like the tall vs. wide discussion happens every few weeks...) >>> >>> For "get all children of a parent", doing a get() on the wide table vs. doing a scan() on the tall table (as long as you set scanner caching appropriately) will be almost identical. I wouldn't expect any difference in performance if you are properly tuning parameters *EXCEPT* that today a Scan will always require more than one RPC because the API is such that you need to open the scanner first, and then do next() on it, and then close() it. This is a current API limitation but we could implement an optimization to allow for single-RPC scans if the query can be fulfilled in a single response (start row, stop row, and scanner caching set appropriately). A Get, on the server-side, does this exact same thing but in a single RPC (it opens a scanner, next() on it, and then close() it). >>> >>> The fact that a row cannot cross a region boundary is a consideration, but unless your rows will be many gigabytes each, I don't think this is that important. Having to cross a region boundary to fulfill the "get all children" query would be my primary worry. >>> >>> Now besides those considerations above, the other two queries you want (parent-child point lookups and parent-child additions) are virtually identical in performance on the server-side starting with HBase 0.90 and beyond. We have the same block-seeking optimizations in both schemas for the read case, and the write case is identical in both. >>> >>> The only other thing to consider is what if all the children of one parent can't fit in memory at the same time. This is not at all related to a region getting too big (there is no requirement of fitting a region into memory) but is a consideration for reading it in a single RPC (both on the server-side and also receiving it in your client). However, you would deal with this the same way in the tall or wide case. In the tall case, you would appropriately set the scanner caching number. In the wide case, you would set the intra-row scan limit. In this case, you will be forced to use the Scan API regardless because if you need multiple RPCs for a single row, you need the Scanner next() semantics.
|
|