Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> how does hbase get the latest version with immutable hfiles?


Copy link to this message
-
how does hbase get the latest version with immutable hfiles?
(reference:
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)

A row consists of a key, and column families, along with a timestamp.

So for example:

key = com.example.com/some/path

cf: outboundlinks {
      com.example.com/link1,
     com.example.com/link2,
     ..
}

Data is stored like this:

Region Server -> Store -> StoreFile -> HFile

Now when a client requests a particular key, the hmaster figures out which
region server holds the data, this information is returned the client
(which saves it locally), and then it makes a request to the region server.

Now since the actual data files are immutable, if you modify a particular
value in a CF, it is tombestombed (not sure how that works but understand
it at a high level).

So if I make a request for a given key, going with the example above, a
particular url on the website example.com, and i want all the outboundlinks
I reference the column family "outboudnlinks" which can store millions of
urls.

What process/service/class is in charge of assembling the various files to
get all the correct data?

Summary of my question:
What I am trying to understand is, if a particular CF has millions of
values, and if a single value is mutated, a new file has to be created.  So
this means, if I query for that value i.e. it is included in my result set,
how does hbase know where to look for the latest data?

So basically from what I understand, making a get request for a particular
key, cf will have to potentially look at more than one StoreFile (or
HFile?) correct?