-Re: HBase (BigTable) many to many with students and courses
Ian Varley 2012-05-29, 17:49
A few more responses:
On May 29, 2012, at 10:54 AM, Em wrote:
> In fact, everything you model with a Key-Value-storage like HBase,
> Cassandra etc. can be modeled as an RDMBS-scheme.
> Since a lot of people, like me, are coming from that edge, we must
> re-learn several basic things.
> It starts with understanding that you model a K-V-storage the way you
> want to access the data, not as the data relates to eachother (in
> general terms) and ends with translating the connections of data into a
> K-V-schema as good as possible.
Yes, that's a good way of putting it. I did a talk at HBaseCon this year that deals with some of these questions. The video isn't up yet, but the slides are here:
>> 3. You could also let a higher client layer worry about this. For
>> example, your data layer query just returns a student with a list of
>> their course IDs, and then another process in your client code looks
>> up each course by ID to get the name. You can then put an external
>> caching layer (like memcached) in the middle and make things a lot
>> faster (though that does put the burden on you to have the code path
>> for changing course info also flush the relevant cache entries).
> Hm, in what way does this give me an advantage over using HBase -
> assuming that the number of courses is small enough to fit in RAM - ?
> I know that Memcached is optimized for this purpose and might have much
> faster response times - no doubts.
> However, from a conceptual point of view: Why does Memcached handles the
> K-V-distribution more efficiently than a HBase with warmed caches?
> Hopefully this question isn't that hard :).
The only architectural advantage I can think of is that reads in HBase still have to check the memstore and all relevant file blocks. So even if that's warm in the HBase cache, it's not quite as lightweight. That said, I guess I was also sort of assuming that table you're looking up into would be the smaller one. The details of which is better here would depend on many things; YMMV. I'd personally go with #2 as a default and them optimize based on real workloads, rather than over engineering it.
> Without ever doing it, you never get a real feeling of when to use the
> right tool.
> Using a good tool for the wrong problem can be an interesting
> experience, since you learn some of the do's and don'ts of the software
> you use.
Very good points. Definitely not trying to discourage learning! :) I just always feel compelled to make that caveat early on, while people are making technology evaluations. You can learn to crochet with a rail gun, but I wouldn't recommend it, unless what you really want to learn is about rail guns, not crocheting. :) Sounds like you really want to learn about rail guns.
> Since I am a reader of the MEAP-edition of HBase in Action, I am aware
> of the TwitBase-example application presented in that book.
> I am very interested in seeing the author presenting a solution for
> efficiently accessing the Tweets of the persons I follow.
> This is an n:m-relation.
> You got n users with m tweets and each user is seeing his own tweets as
> well as the tweets of followed persons in descending order by timestamp.
> This must be done with a join within an RDMBs (and maybe in HBase also),
> since I can not think of another scalable way of doing so.
Don't forget about denormalization! You can put copies of the tweets, or at least copies the unique ids of the tweets, into each follower's stream. Yes, that means when Wil Wheaton tweets for the 1000th time about comic con, you get a million copies of the tweet (or ID). But you're trading time & space at write time for extremely fast speeds at write time. Whether this makes sense depends on a zillion other factors, there's no hard & fast rule.
> However, if you do this by a Join, this means that a person with 40.000
> followers needs a batch-request consisting of 40.000 GET-objects. That's
> huge and I bet that this is everything but not fast nor scalable. It
I guess you could say it's "broken by design" (though I'd argue for "unavailable by design" ;). Full join use cases (like you'd do for, say, analytics) don't work well with big data, taking a naive approach.
But at least for the twitter app, that's not actually what you're doing. Instead, you've typically got a pattern like *pagination*: you'd never be requesting 40K objects, at most you'd be requesting 20 or 40 objects (a page worth), along some dimension like time. You're optimizing for getting those really fast, with the knowledge that the stream itself is so big, you'd never offer the user a way to act on it in any kind of bulk way. If you want to do processing like that, you do it asynchronously (e.g. via map/reduce).
That's actually a really interesting way to see the difference between relational DBs and big data non-relational ones: relational databases promise you easy whole-set operations, and big data nosql databases don't, because they assume the "whole set" will be too big for that.
Let's say your product manager for this twitter-like thing said "You know what would be awesome? A widget that shows the average lengths of all tweets in the system, in real time!". And if this product manager was really slick, they might even say "It's easy, look, I even wrote the SQL for it: 'SELECT avg(length(body)) FROM tweets'. I'm a genius."
In a SQL database, this query is going to do a full table scan to get this average, every time you ask for it. It would be in HBase, too; the difference is just that HBase makes you be more explicit about it: no simple declarative "SELECT avg ..." syntax, you have to whip out your iterators and see that you're scanning a billion rows to get your average.
Would it be nice for HBase to ALSO offer declarative and simple ways to do things like joins, averages, etc? Maybe; but as soon as you dip a toe into this, you kind of have to jump in with both feet. Would you