|
|
+
Derek Wollenstein 2012-04-02, 20:10
+
Ian Varley 2012-04-03, 19:59
-
Re: Key Design Question for list dataJacques 2012-04-04, 16:46
Great summary! The one other thing I would also note is that your
consistency requirements need to be considered. If you want to make atomic list changes you'll need to go wide. On Apr 3, 2012 1:00 PM, "Ian Varley" <[EMAIL PROTECTED]> wrote: > Hi Derek, > > If I understand you correctly, you're ultimately trying to store triples > in the form "user, valueid, value", right? E.g., something like: > > "user123, firstname, Paul", > "user234, lastname, Smith" > > (But the usernames are fixed width, and the valueids are fixed width). > > And, your access pattern is along the lines of: "for user X, list the next > 30 values, starting with valueid Y". Is that right? And these values should > be returned sorted by valueid? > > The tl;dr version is that you should probably go with one row per > user+value, and not build a complicated intra-row pagination scheme on your > own unless you're really sure it is needed. > > Your two options mirror a common question people have when designing HBase > schemas: should I go "tall" or "wide"? Your first schema is "tall": each > row represents one value for one user, and so there are many rows in the > table for each user; the row key is user + valueid, and there would be > (presumably) a single column qualifier that means "the value". This is > great if you want to scan over rows in sorted order by row key (thus my > question above, about whether these ids are sorted correctly). You can > start a scan at any user+valueid, read the next 30, and be done. What > you're giving up is the ability to have transactional guarantees around all > the rows for one user, but it doesn't sound like you need that. Doing it > this way is generally recommended (see here< > http://hbase.apache.org/book.html#schema.smackdown>). > > Your second option is "wide": you store a bunch of values in one row, > using different qualifiers (where the qualifier is the valueid). The simple > way to do that would be to just store ALL values for one user in a single > row. I'm guessing you jumped to the "paginated" version because you're > assuming that storing millions of columns in a single row would be bad for > performance, which may or may not be true; as long as you're not trying to > do too much in a single request, or do things like scanning over and > returning all of the cells in the row, it shouldn't be fundamentally worse. > The client has methods that allow you to get specific slices of columns. > > Note that neither case fundamentally uses more disk space than the other; > you're just "shifting" part of the identifying information for a value > either to the left (into the row key, in option one) or to the right (into > the column qualifiers in option 2). Under the covers, every key/value still > stores the whole row key, and column family name. (If this is a bit > confusing, take an hour and watch Lars George's excellent video about > understanding HBase schema design: > http://www.youtube.com/watch?v=_HLoH_PgrLk). > > A manually paginated version has lots more complexities, as you note, like > having to keep track of how many things are in each page, re-shuffling if > new values are inserted, etc. That seems significantly more complex. It > might have some slight speed advantages (or disadvantages!) at extremely > high throughput, and the only way to really know that would be to try it > out. If you don't have time to build it both ways and compare, my advice > would be to start with the simplest option (one row per user+value). Start > simple & iterate! :) > > (Let me know if I've misunderstood your situation.) > > Ian > > On Apr 2, 2012, at 3:10 PM, Derek Wollenstein wrote: > > We're looking at how to store a large amount of (per-user) list data in > hbase, and we were trying to figure out what kind of access pattern made > the most sense. One option is store the majority of the data in a key, so > we could have something like > <FixedWidthUserName><FixedWidthValueId1>:"" (no value) > <FixedWidthUserName +
Derek Wollenstein 2012-04-04, 18:01
+
Ian Varley 2012-04-04, 18:15
+
Derek Wollenstein 2012-04-04, 18:20
|