-Re: How to Rank in HBase?
Ian Varley 2012-01-30, 06:36
HBase uses an approach to structuring its storage known as "Log Structured Merge Trees", which you can learn more about here:
As well as in Lars George's great book, here:
It does all of these "frequent updates" just in memory, which is very fast; at the same time, it writes a simple forward-only log of all edits (known as the Write Ahead Log, or WAL) to disk in order to provide durability in the event of machine failure. It periodically writes the in-memory data to disk in big immutable ordered chunks, called "store files", which is very efficient. Future reads of the data then "merge" the on-disk store file data with the current state in memory, to get the full picture of the state of any row. Over time, the many small store files get "compacted" into bigger files, so that individual reads don't have too many files to read from. Each "get" or "scan" operation can just read small blocks of the store files; when you ask for one record, it doesn't have to read gigabytes of data from the disk, it can just read a small block. As such, random small reads and writes on a very big data set can be done efficiently.
Furthermore, it's fine to update the data store frequently. For any given record, you can make as many updates as you want to the in-memory structures, and these aren't written to disk until the memory store is flushed (and into the WAL, but that's also efficient b/c it's ordered by update time, not record key). It all happens in memory, which is very fast (but, again, it's safe b/c of the WAL). There are even some recent JIRAs that make that process more efficient, by, for example, HBASE-4241<https://issues.apache.org/jira/browse/HBASE-4241>.
One way to think about it is that HBase is *precisely* a layer that adds these efficient random read/write capabilities on top of the Hadoop distributed file system (HDFS), and takes care of doing that in a way that parallelizes nicely across a large cluster of machines, deals with machine failures, etc.
On Jan 29, 2012, at 10:16 PM, Bing Li wrote:
Thanks so much for your reply!
According to my understanding, in a large scale distributed system, it
prefers write-once-read-many. Frequent-updating must bring heavy load for
the consistency issue and the performance must be lowered. HBase must not
be suitable to be updated frequently, right?
On Mon, Jan 30, 2012 at 1:51 PM, Stack <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
On Sun, Jan 29, 2012 at 12:02 PM, Bing Li <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Another question is whether it is proper to update data in HBase
This is 'normal', yes.