Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - How to Rank in HBase?


Copy link to this message
-
Re: How to Rank in HBase?
Bing Li 2012-01-30, 06:52
Dear Ian,

I appreciate so much for your detailed reply! I will read the book about
HBase.

Best regards,
Bing

On Mon, Jan 30, 2012 at 2:36 PM, Ian Varley <[EMAIL PROTECTED]> wrote:

> Bing,
>
> HBase uses an approach to structuring its storage known as "Log Structured
> Merge Trees", which you can learn more about here:
>
>
> http://scholar.google.com/scholar?q=log+structured+merge+tree&hl=en&as_sdt=0&as_vis=1&oi=scholart
>
> As well as in Lars George's great book, here:
>
> http://shop.oreilly.com/product/0636920014348.do
>
> It does all of these "frequent updates" just in memory, which is very
> fast; at the same time, it writes a simple forward-only log of all edits
> (known as the Write Ahead Log, or WAL) to disk in order to provide
> durability in the event of machine failure. It periodically writes the
> in-memory data to disk in big immutable ordered chunks, called "store
> files", which is very efficient. Future reads of the data then "merge" the
> on-disk store file data with the current state in memory, to get the full
> picture of the state of any row. Over time, the many small store files get
> "compacted" into bigger files, so that individual reads don't have too many
> files to read from. Each "get" or "scan" operation can just read small
> blocks of the store files; when you ask for one record, it doesn't have to
> read gigabytes of data from the disk, it can just read a small block. As
> such, random small reads and writes on a very big data set can be done
> efficiently.
>
> Furthermore, it's fine to update the data store frequently. For any given
> record, you can make as many updates as you want to the in-memory
> structures, and these aren't written to disk until the memory store is
> flushed (and into the WAL, but that's also efficient b/c it's ordered by
> update time, not record key). It all happens in memory, which is very fast
> (but, again, it's safe b/c of the WAL). There are even some recent JIRAs
> that make that process more efficient, by, for example, HBASE-4241<
> https://issues.apache.org/jira/browse/HBASE-4241>.
>
> One way to think about it is that HBase is *precisely* a layer that adds
> these efficient random read/write capabilities on top of the Hadoop
> distributed file system (HDFS), and takes care of doing that in a way that
> parallelizes nicely across a large cluster of machines, deals with machine
> failures, etc.
>
> Ian
>
> On Jan 29, 2012, at 10:16 PM, Bing Li wrote:
>
> Dear Stack,
>
> Thanks so much for your reply!
>
> According to my understanding, in a large scale distributed system, it
> prefers write-once-read-many. Frequent-updating must bring heavy load for
> the consistency issue and the performance must be lowered. HBase must not
> be suitable to be updated frequently, right?
>
> Best regards,
> Bing
>
> On Mon, Jan 30, 2012 at 1:51 PM, Stack <[EMAIL PROTECTED]<mailto:
> [EMAIL PROTECTED]>> wrote:
>
> On Sun, Jan 29, 2012 at 12:02 PM, Bing Li <[EMAIL PROTECTED]<mailto:
> [EMAIL PROTECTED]>> wrote:
> Another question is whether it is proper to update data in HBase
> frequently?
>
>
> This is 'normal', yes.
> St.Ack
>
>
>