On May 29, 2012, at 3:25 PM, Em wrote:
Yup, unless you denormalize the tweet bodies as well--then you just read the current user's record and you have everything you need (with the downside of massive data duplication).
Well, I think this would be bad practice for editable stuff like tweets.
They can be deleted, updated etc.. Furthermore at some point the data
duplication will come at its expense - big data or not.
Do you aggree?
You're probably right, yes. For the real Twitter, denormalizing every tweet to every follower would likely be ruinously expensive. For something like a message-based app (like, say, chat), it's much more plausible because there's a fixed and limited number of participants who'd see any given message.
The important point here, though, is the "YMMV" part ("Your Mileage May Vary", for those not familiar with the idiom). This kind of stuff is exceedingly difficult to give easy, general solutions to because every case is different. How many people does the average user follow? How big are the message bodies? What's the ratio of CPUs to disk spindles? Etc.
Yes, but the relational DB has it all on one node, whereas in a distributed database, it's as many RPC calls as you have nodes, or something on the order of that (see Nicolas's explanation, which is better than mine).
Ah, sorry, I mean "N Keywal" who replied upthread. His name is Nicolas. :) His response was about under what situations the GET requests would be batched or sent separately.
The key difference is that that read access time for HBase is still constant when you have PB of data (whereas with a PB of data, the relational DB has long since fallen over). The underlying reason for that is that HBase basically took the top level of the seek operation (which is a fast memory access into the top node of a B+ tree, if you're talking about an RDBMS on one machine) and made it a cross machine lookup ("which server would have this data?"). So it's fundamentally a lot slower than an RDBMS, but still constant w/r/t your overall data size. If you remember your algorithms class, constant factors fall out when N gets larger, and you only care about the big O. :)
Yes :) I am interested in the underlying algorithms of HBase's
hash-algorithms. I think it's an interesting approach.
It has no hash-join algorithms built in, if that's what you mean. But you can read the client access code path, and see the anatomy of a scan--how it finds out which regions to hit, creates requests to all of them, combines the results, etc.
If you mean the overall approach of HBase, as contrasted with the B+ trees used by relational databases: this is based on the ideas of "Log Structured Merge Trees"; there original paper on it is here:
If we go down the line of data duplication, why not generate keys in a
way that all of the user's tweet-stream's tweets end up at the same
This way, doing a multiget you only call one or two region servers for
all your tweets.
What would be an example of this? Say we have tweets from 4 users. The normalized "tweet" table might have data like:
TweetID Time User Body
--------- ---- ---- -----------------
123 T1 A Hello world
234 T2 B OMG BIEBER!
345 T3 C RT @B OMG BIEBER!
456 T4 A I hate twitter
567 T5 B ZOMG MORE BIEBER!
If we denorm just the IDs by follower (let's say everyone follows everyone here, and you don't see your own tweets), we'd get:
Follower Time Followee TweetID
-------- ---- --------- -----------
A T2 B 234
A T3 C 345
A T5 B 567
B T1 A 123
B T3 C 345
B T4 A 456
C T1 A 123
C T2 B 234
C T4 A 456
C T5 B 567
Here, if I know who I am (the follower) I can easily scan a time range of all the tweets I should see, and it'll almost always be in one region (with the rare exception where it happens to fall over the region server break, which would be handled transparently by the client of course). So if I'm user C, and I scan from T1 to T5, I'd get (A,123), (B,234), (A,456), and (B,567) back.
Of course, I then want to do a GET for the body of each tweet. You're asking, why not organize the tweet table in such a way as to make it more likely that these 4 tweets will be co-located?
The answer to that in the generic case is that you can't; the IDs of the tweets are completely orthogonal to the followers & followees, so you can't make any expectations about that kind of colocation. It looks easy with 3 users, but imagine there are hundreds of millions of users; what are the chances I follow people whose tweets happen to be co-located?
In this case, we could do a little better by making the tweet ID into a UUID with a time component, such that tweets around the same time would be grouped together in the table. The problem with this, of course, is that now all insert activity is going to "hot spot" one area of the table (i.e. one region server). So what you really want is sort of the opposite; you want all the incoming traffic to be nicely distributed so the write load is spread out. Thus, you actually probably don't want the tweets you'd see in a given page to be co-located on the same server, unless it's specifically YOUR copy of the tweets (which gets back to the denormalization question above).
As you can see, there's no "easy" answer here either. This is why you get paid the big bucks. :)
Yes, very true. We also haven't broached the subject of how long writes would take if you denormalize, and whether they should be asynchronous (for everyone, or just for the Wil Wheatons). If you put all of my incoming tweets into a single row for me, that's potentially lock-wait city. Twitter has a follow limit, maybe that's why. :)
Oh, do they?
However, as f