Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> how does hbase get the latest version with immutable hfiles?


Copy link to this message
-
Re: how does hbase get the latest version with immutable hfiles?

Hi there, I think you probably want to look at thisŠ

Hbase catalog metadataŠ

http://hbase.apache.org/book.html#arch.catalog

How data is stored internallyŠ

http://hbase.apache.org/book.html#regions.arch

Lots of versioning description hereŠ

http://hbase.apache.org/book.html#datamodel

Long story short, client talks directly to RegionServers, Hbase looks at
multiple StoreFiles.

On 6/1/12 4:27 PM, "S Ahmed" <[EMAIL PROTECTED]> wrote:

>(reference:
>http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
>
>A row consists of a key, and column families, along with a timestamp.
>
>So for example:
>
>key = com.example.com/some/path
>
>cf: outboundlinks {
>      com.example.com/link1,
>     com.example.com/link2,
>     ..
>}
>
>Data is stored like this:
>
>Region Server -> Store -> StoreFile -> HFile
>
>Now when a client requests a particular key, the hmaster figures out which
>region server holds the data, this information is returned the client
>(which saves it locally), and then it makes a request to the region
>server.
>
>Now since the actual data files are immutable, if you modify a particular
>value in a CF, it is tombestombed (not sure how that works but understand
>it at a high level).
>
>So if I make a request for a given key, going with the example above, a
>particular url on the website example.com, and i want all the
>outboundlinks
>I reference the column family "outboudnlinks" which can store millions of
>urls.
>
>What process/service/class is in charge of assembling the various files to
>get all the correct data?
>
>Summary of my question:
>What I am trying to understand is, if a particular CF has millions of
>values, and if a single value is mutated, a new file has to be created.
>So
>this means, if I query for that value i.e. it is included in my result
>set,
>how does hbase know where to look for the latest data?
>
>So basically from what I understand, making a get request for a particular
>key, cf will have to potentially look at more than one StoreFile (or
>HFile?) correct?