Great summary! The one other thing I would also note is that your
consistency requirements need to be considered. If you want to make atomic
list changes you'll need to go wide.
On Apr 3, 2012 1:00 PM, "Ian Varley" <[EMAIL PROTECTED]> wrote:
> Hi Derek,
> If I understand you correctly, you're ultimately trying to store triples
> in the form "user, valueid, value", right? E.g., something like:
> "user123, firstname, Paul",
> "user234, lastname, Smith"
> (But the usernames are fixed width, and the valueids are fixed width).
> And, your access pattern is along the lines of: "for user X, list the next
> 30 values, starting with valueid Y". Is that right? And these values should
> be returned sorted by valueid?
> The tl;dr version is that you should probably go with one row per
> user+value, and not build a complicated intra-row pagination scheme on your
> own unless you're really sure it is needed.
> Your two options mirror a common question people have when designing HBase
> schemas: should I go "tall" or "wide"? Your first schema is "tall": each
> row represents one value for one user, and so there are many rows in the
> table for each user; the row key is user + valueid, and there would be
> (presumably) a single column qualifier that means "the value". This is
> great if you want to scan over rows in sorted order by row key (thus my
> question above, about whether these ids are sorted correctly). You can
> start a scan at any user+valueid, read the next 30, and be done. What
> you're giving up is the ability to have transactional guarantees around all
> the rows for one user, but it doesn't sound like you need that. Doing it
> this way is generally recommended (see here<
> Your second option is "wide": you store a bunch of values in one row,
> using different qualifiers (where the qualifier is the valueid). The simple
> way to do that would be to just store ALL values for one user in a single
> row. I'm guessing you jumped to the "paginated" version because you're
> assuming that storing millions of columns in a single row would be bad for
> performance, which may or may not be true; as long as you're not trying to
> do too much in a single request, or do things like scanning over and
> returning all of the cells in the row, it shouldn't be fundamentally worse.
> The client has methods that allow you to get specific slices of columns.
> Note that neither case fundamentally uses more disk space than the other;
> you're just "shifting" part of the identifying information for a value
> either to the left (into the row key, in option one) or to the right (into
> the column qualifiers in option 2). Under the covers, every key/value still
> stores the whole row key, and column family name. (If this is a bit
> confusing, take an hour and watch Lars George's excellent video about
> understanding HBase schema design:
> A manually paginated version has lots more complexities, as you note, like
> having to keep track of how many things are in each page, re-shuffling if
> new values are inserted, etc. That seems significantly more complex. It
> might have some slight speed advantages (or disadvantages!) at extremely
> high throughput, and the only way to really know that would be to try it
> out. If you don't have time to build it both ways and compare, my advice
> would be to start with the simplest option (one row per user+value). Start
> simple & iterate! :)
> (Let me know if I've misunderstood your situation.)
> On Apr 2, 2012, at 3:10 PM, Derek Wollenstein wrote:
> We're looking at how to store a large amount of (per-user) list data in
> hbase, and we were trying to figure out what kind of access pattern made
> the most sense. One option is store the majority of the data in a key, so
> we could have something like
> <FixedWidthUserName><FixedWidthValueId1>:"" (no value)