Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - HBase (BigTable) many to many with students and courses

Copy link to this message
Re: HBase (BigTable) many to many with students and courses
Ian Varley 2012-05-29, 19:26

On May 29, 2012, at 1:24 PM, Em wrote:

>> But you're trading time & space at write time for extremely fast
>> speeds at write time.
> You ment "extremely fast speeds at read time", don't you?

Ha, yes, thanks. That's what I meant.

> However this means that Sheldon has to do at least two requests to fetch
> his latest tweets.
> First: Get the latest columns (aka tweets) of his row and second do a
> multiget to fetch their content. Okay, that's more than one request but
> you get what I mean. Am I right?

Yup, unless you denormalize the tweet bodies as well--then you just read the current user's record and you have everything you need (with the downside of massive data duplication).

> Although I got this little overhead here, does read perform at low
> latency in a large scale environment? I think that a RDMBS has to do
> something equal, doesn't it?

Yes, but the relational DB has it all on one node, whereas in a distributed database, it's as many RPC calls as you have nodes, or something on the order of that (see Nicolas's explanation, which is better than mine).

The key difference is that that read access time for HBase is still constant when you have PB of data (whereas with a PB of data, the relational DB has long since fallen over). The underlying reason for that is that HBase basically took the top level of the seek operation (which is a fast memory access into the top node of a B+ tree, if you're talking about an RDBMS on one machine) and made it a cross machine lookup ("which server would have this data?"). So it's fundamentally a lot slower than an RDBMS, but still constant w/r/t your overall data size. If you remember your algorithms class, constant factors fall out when N gets larger, and you only care about the big O. :)

> Maybe it makes sense to have a desing of a key so that all the  relevant
> tweets for a user are placed at the same region?

Again, that works in a denormalized sense, where you lead each row key with the current user (the tweet recipient) and copy the tweets there. If you are saying that you'd somehow find a magic formula so that people who follow each other happen to be on the same region server, you can forget it. Facebook and Twitter have tried partitioning schemes like that for their social graph data, and found there's no magic boundaries, even countries. (I know, citation needed, and I can't find it just now, but I'm sure I read that somewhere. :) But it makes sense; I can follow anybody, anybody can follow me, so on average my followers and followees would be randomly sprayed across all region servers. I either denormalize stuff (so my row contains everything I need) or I commit to doing GETs to all other region servers when I need that data. And if I really need it to be low latency, it's not a winning bet to have every read spray across a lot of servers like that. (That's just my opinion, perhaps I'm wrong about that.)

> Maybe one better does a tall-schema, since row-locking can impact
> performance a lot for people who follow a lot of other persons but
> that's not the topic here.

Yes, very true. We also haven't broached the subject of how long writes would take if you denormalize, and whether they should be asynchronous (for everyone, or just for the Wil Wheatons). If you put all of my incoming tweets into a single row for me, that's potentially lock-wait city. Twitter has a follow limit, maybe that's why. :)