Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Parent/child relation - go vertical, horizontal, or many tables?

Copy link to this message
RE: Parent/child relation - go vertical, horizontal, or many tables?
Jonathan Gray 2011-02-11, 20:48
Just to chime in with my usual take on this (seems like the tall vs. wide discussion happens every few weeks...)

For "get all children of a parent", doing a get() on the wide table vs. doing a scan() on the tall table (as long as you set scanner caching appropriately) will be almost identical.  I wouldn't expect any difference in performance if you are properly tuning parameters *EXCEPT* that today a Scan will always require more than one RPC because the API is such that you need to open the scanner first, and then do next() on it, and then close() it.  This is a current API limitation but we could implement an optimization to allow for single-RPC scans if the query can be fulfilled in a single response (start row, stop row, and scanner caching set appropriately).  A Get, on the server-side, does this exact same thing but in a single RPC (it opens a scanner, next() on it, and then close() it).

The fact that a row cannot cross a region boundary is a consideration, but unless your rows will be many gigabytes each, I don't think this is that important.  Having to cross a region boundary to fulfill the "get all children" query would be my primary worry.

Now besides those considerations above, the other two queries you want (parent-child point lookups and parent-child additions) are virtually identical in performance on the server-side starting with HBase 0.90 and beyond.  We have the same block-seeking optimizations in both schemas for the read case, and the write case is identical in both.

The only other thing to consider is what if all the children of one parent can't fit in memory at the same time.  This is not at all related to a region getting too big (there is no requirement of fitting a  region into memory) but is a consideration for reading it in a single RPC (both on the server-side and also receiving it in your client).  However, you would deal with this the same way in the tall or wide case.  In the tall case, you would appropriately set the scanner caching number.  In the wide case, you would set the intra-row scan limit.  In this case, you will be forced to use the Scan API regardless because if you need multiple RPCs for a single row, you need the Scanner next() semantics.

Many times, this decisions comes to a matter of personal preference.  I lean towards wide tables these days unless I expect extremely high numbers of children (so I want to split across regions and RPC requests) and I expect to frequently run the get-all-children query with high numbers of children.


> -----Original Message-----
> From: Michael Segel [mailto:[EMAIL PROTECTED]]
> Sent: Friday, February 11, 2011 12:23 PM
> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> David,
> First a caveat... You need to have a realistic notion of the data and its sizes
> when considering your options...
> With respect to the response, Here's what I said:
> -=-
> "With respect to your issue about a row being too large to fit in to memory...
>  This would imply that the row would be too large to fit in to a single region.
> Wouldn't that cause your HBase to die a horrible death?
>  If this really is a potential situation, then you should consider the
> parent_key, child_id compound row key..."
> -=-
> Now a correction. If you insert a row that is larger than a region, the region
> will grow to fit the row and will not split. So until your row exceeds the size of
> available disk... you can do it. So yeah you could fill up memory...
> And that's the only reason why I would recommend option 2 over option 1.
> So how real is this scenario?
> Looking at the 3 stated use cases...  Doing a get() on the parent ID will give
> you the entire set of children for the parent in a single fetch.
> If you limit the columns to either a single column or a set of columns, you are
> still going to be a single get().
> This is going to be faster than doing a scan() on a series of row starting with