Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Key Design Question for list data


Copy link to this message
-
Re: Key Design Question for list data
Great summary!   The one other thing I would also note is that your
consistency requirements need to be considered.  If you want to make atomic
list changes you'll need to go wide.
On Apr 3, 2012 1:00 PM, "Ian Varley" <[EMAIL PROTECTED]> wrote:

> Hi Derek,
>
> If I understand you correctly, you're ultimately trying to store triples
> in the form "user, valueid, value", right? E.g., something like:
>
> "user123, firstname, Paul",
> "user234, lastname, Smith"
>
> (But the usernames are fixed width, and the valueids are fixed width).
>
> And, your access pattern is along the lines of: "for user X, list the next
> 30 values, starting with valueid Y". Is that right? And these values should
> be returned sorted by valueid?
>
> The tl;dr version is that you should probably go with one row per
> user+value, and not build a complicated intra-row pagination scheme on your
> own unless you're really sure it is needed.
>
> Your two options mirror a common question people have when designing HBase
> schemas: should I go "tall" or "wide"? Your first schema is "tall": each
> row represents one value for one user, and so there are many rows in the
> table for each user; the row key is user + valueid, and there would be
> (presumably) a single column qualifier that means "the value". This is
> great if you want to scan over rows in sorted order by row key (thus my
> question above, about whether these ids are sorted correctly). You can
> start a scan at any user+valueid, read the next 30, and be done. What
> you're giving up is the ability to have transactional guarantees around all
> the rows for one user, but it doesn't sound like you need that. Doing it
> this way is generally recommended (see here<
> http://hbase.apache.org/book.html#schema.smackdown>).
>
> Your second option is "wide": you store a bunch of values in one row,
> using different qualifiers (where the qualifier is the valueid). The simple
> way to do that would be to just store ALL values for one user in a single
> row. I'm guessing you jumped to the "paginated" version because you're
> assuming that storing millions of columns in a single row would be bad for
> performance, which may or may not be true; as long as you're not trying to
> do too much in a single request, or do things like scanning over and
> returning all of the cells in the row, it shouldn't be fundamentally worse.
> The client has methods that allow you to get specific slices of columns.
>
> Note that neither case fundamentally uses more disk space than the other;
> you're just "shifting" part of the identifying information for a value
> either to the left (into the row key, in option one) or to the right (into
> the column qualifiers in option 2). Under the covers, every key/value still
> stores the whole row key, and column family name. (If this is a bit
> confusing, take an hour and watch Lars George's excellent video about
> understanding HBase schema design:
> http://www.youtube.com/watch?v=_HLoH_PgrLk).
>
> A manually paginated version has lots more complexities, as you note, like
> having to keep track of how many things are in each page, re-shuffling if
> new values are inserted, etc. That seems significantly more complex. It
> might have some slight speed advantages (or disadvantages!) at extremely
> high throughput, and the only way to really know that would be to try it
> out. If you don't have time to build it both ways and compare, my advice
> would be to start with the simplest option (one row per user+value). Start
> simple & iterate! :)
>
> (Let me know if I've misunderstood your situation.)
>
> Ian
>
> On Apr 2, 2012, at 3:10 PM, Derek Wollenstein wrote:
>
> We're looking at how to store a large amount of (per-user) list data in
> hbase, and we were trying to figure out what kind of access pattern made
> the most sense.  One option is store the majority of the data in a key, so
> we could have something like
> <FixedWidthUserName><FixedWidthValueId1>:"" (no value)
> <FixedWidthUserName
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB