|
|
-
how does hbase get the latest version with immutable hfiles?
S Ahmed 2012-06-01, 20:27
(reference: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)A row consists of a key, and column families, along with a timestamp. So for example: key = com.example.com/some/path cf: outboundlinks { com.example.com/link1, com.example.com/link2, .. } Data is stored like this: Region Server -> Store -> StoreFile -> HFile Now when a client requests a particular key, the hmaster figures out which region server holds the data, this information is returned the client (which saves it locally), and then it makes a request to the region server. Now since the actual data files are immutable, if you modify a particular value in a CF, it is tombestombed (not sure how that works but understand it at a high level). So if I make a request for a given key, going with the example above, a particular url on the website example.com, and i want all the outboundlinks I reference the column family "outboudnlinks" which can store millions of urls. What process/service/class is in charge of assembling the various files to get all the correct data? Summary of my question: What I am trying to understand is, if a particular CF has millions of values, and if a single value is mutated, a new file has to be created. So this means, if I query for that value i.e. it is included in my result set, how does hbase know where to look for the latest data? So basically from what I understand, making a get request for a particular key, cf will have to potentially look at more than one StoreFile (or HFile?) correct?
-
Re: how does hbase get the latest version with immutable hfiles?
Doug Meil 2012-06-02, 13:13
Hi there, I think you probably want to look at thisŠ Hbase catalog metadataŠ http://hbase.apache.org/book.html#arch.catalogHow data is stored internallyŠ http://hbase.apache.org/book.html#regions.archLots of versioning description hereŠ http://hbase.apache.org/book.html#datamodelLong story short, client talks directly to RegionServers, Hbase looks at multiple StoreFiles. On 6/1/12 4:27 PM, "S Ahmed" <[EMAIL PROTECTED]> wrote: >(reference: > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)> >A row consists of a key, and column families, along with a timestamp. > >So for example: > >key = com.example.com/some/path > >cf: outboundlinks { > com.example.com/link1, > com.example.com/link2, > .. >} > >Data is stored like this: > >Region Server -> Store -> StoreFile -> HFile > >Now when a client requests a particular key, the hmaster figures out which >region server holds the data, this information is returned the client >(which saves it locally), and then it makes a request to the region >server. > >Now since the actual data files are immutable, if you modify a particular >value in a CF, it is tombestombed (not sure how that works but understand >it at a high level). > >So if I make a request for a given key, going with the example above, a >particular url on the website example.com, and i want all the >outboundlinks >I reference the column family "outboudnlinks" which can store millions of >urls. > >What process/service/class is in charge of assembling the various files to >get all the correct data? > >Summary of my question: >What I am trying to understand is, if a particular CF has millions of >values, and if a single value is mutated, a new file has to be created. >So >this means, if I query for that value i.e. it is included in my result >set, >how does hbase know where to look for the latest data? > >So basically from what I understand, making a get request for a particular >key, cf will have to potentially look at more than one StoreFile (or >HFile?) correct?
-
Re: how does hbase get the latest version with immutable hfiles?
Elliott Clark 2012-06-02, 18:18
If you want to get into the really nitty gritty I found Lars' presentation really insightful. http://www.hbasecon.com/sessions/learning-hbase-internals/On Sat, Jun 2, 2012 at 6:13 AM, Doug Meil <[EMAIL PROTECTED]>wrote: > > Hi there, I think you probably want to look at thisŠ > > Hbase catalog metadataŠ > > http://hbase.apache.org/book.html#arch.catalog> > How data is stored internallyŠ > > http://hbase.apache.org/book.html#regions.arch> > Lots of versioning description hereŠ > > http://hbase.apache.org/book.html#datamodel> > > > Long story short, client talks directly to RegionServers, Hbase looks at > multiple StoreFiles. > > > > On 6/1/12 4:27 PM, "S Ahmed" <[EMAIL PROTECTED]> wrote: > > >(reference: > > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)> > > >A row consists of a key, and column families, along with a timestamp. > > > >So for example: > > > >key = com.example.com/some/path > > > >cf: outboundlinks { > > com.example.com/link1, > > com.example.com/link2, > > .. > >} > > > >Data is stored like this: > > > >Region Server -> Store -> StoreFile -> HFile > > > >Now when a client requests a particular key, the hmaster figures out which > >region server holds the data, this information is returned the client > >(which saves it locally), and then it makes a request to the region > >server. > > > >Now since the actual data files are immutable, if you modify a particular > >value in a CF, it is tombestombed (not sure how that works but understand > >it at a high level). > > > >So if I make a request for a given key, going with the example above, a > >particular url on the website example.com, and i want all the > >outboundlinks > >I reference the column family "outboudnlinks" which can store millions of > >urls. > > > >What process/service/class is in charge of assembling the various files to > >get all the correct data? > > > >Summary of my question: > >What I am trying to understand is, if a particular CF has millions of > >values, and if a single value is mutated, a new file has to be created. > >So > >this means, if I query for that value i.e. it is included in my result > >set, > >how does hbase know where to look for the latest data? > > > >So basically from what I understand, making a get request for a particular > >key, cf will have to potentially look at more than one StoreFile (or > >HFile?) correct? > > >
-
Re: how does hbase get the latest version with immutable hfiles?
S Ahmed 2012-06-03, 19:21
Elliot, Is there a video or slides? I guess I have to register to view it? On Sat, Jun 2, 2012 at 2:18 PM, Elliott Clark <[EMAIL PROTECTED]>wrote: > If you want to get into the really nitty gritty I found Lars' presentation > really insightful. > > http://www.hbasecon.com/sessions/learning-hbase-internals/> > On Sat, Jun 2, 2012 at 6:13 AM, Doug Meil <[EMAIL PROTECTED] > >wrote: > > > > > Hi there, I think you probably want to look at thisŠ > > > > Hbase catalog metadataŠ > > > > http://hbase.apache.org/book.html#arch.catalog> > > > How data is stored internallyŠ > > > > http://hbase.apache.org/book.html#regions.arch> > > > Lots of versioning description hereŠ > > > > http://hbase.apache.org/book.html#datamodel> > > > > > > > Long story short, client talks directly to RegionServers, Hbase looks at > > multiple StoreFiles. > > > > > > > > On 6/1/12 4:27 PM, "S Ahmed" <[EMAIL PROTECTED]> wrote: > > > > >(reference: > > > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)> > > > > >A row consists of a key, and column families, along with a timestamp. > > > > > >So for example: > > > > > >key = com.example.com/some/path > > > > > >cf: outboundlinks { > > > com.example.com/link1, > > > com.example.com/link2, > > > .. > > >} > > > > > >Data is stored like this: > > > > > >Region Server -> Store -> StoreFile -> HFile > > > > > >Now when a client requests a particular key, the hmaster figures out > which > > >region server holds the data, this information is returned the client > > >(which saves it locally), and then it makes a request to the region > > >server. > > > > > >Now since the actual data files are immutable, if you modify a > particular > > >value in a CF, it is tombestombed (not sure how that works but > understand > > >it at a high level). > > > > > >So if I make a request for a given key, going with the example above, a > > >particular url on the website example.com, and i want all the > > >outboundlinks > > >I reference the column family "outboudnlinks" which can store millions > of > > >urls. > > > > > >What process/service/class is in charge of assembling the various files > to > > >get all the correct data? > > > > > >Summary of my question: > > >What I am trying to understand is, if a particular CF has millions of > > >values, and if a single value is mutated, a new file has to be created. > > >So > > >this means, if I query for that value i.e. it is included in my result > > >set, > > >how does hbase know where to look for the latest data? > > > > > >So basically from what I understand, making a get request for a > particular > > >key, cf will have to potentially look at more than one StoreFile (or > > >HFile?) correct? > > > > > > >
-
Re: how does hbase get the latest version with immutable hfiles?
Elliott Clark 2012-06-04, 01:37
There are slide. I think you have to register with an email and fist/last name to download the ppt. On Sun, Jun 3, 2012 at 12:21 PM, S Ahmed <[EMAIL PROTECTED]> wrote: > Elliot, > > Is there a video or slides? I guess I have to register to view it? > > On Sat, Jun 2, 2012 at 2:18 PM, Elliott Clark <[EMAIL PROTECTED] > >wrote: > > > If you want to get into the really nitty gritty I found Lars' > presentation > > really insightful. > > > > http://www.hbasecon.com/sessions/learning-hbase-internals/> > > > On Sat, Jun 2, 2012 at 6:13 AM, Doug Meil <[EMAIL PROTECTED] > > >wrote: > > > > > > > > Hi there, I think you probably want to look at thisŠ > > > > > > Hbase catalog metadataŠ > > > > > > http://hbase.apache.org/book.html#arch.catalog> > > > > > How data is stored internallyŠ > > > > > > http://hbase.apache.org/book.html#regions.arch> > > > > > Lots of versioning description hereŠ > > > > > > http://hbase.apache.org/book.html#datamodel> > > > > > > > > > > > Long story short, client talks directly to RegionServers, Hbase looks > at > > > multiple StoreFiles. > > > > > > > > > > > > On 6/1/12 4:27 PM, "S Ahmed" <[EMAIL PROTECTED]> wrote: > > > > > > >(reference: > > > > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html> ) > > > > > > > >A row consists of a key, and column families, along with a timestamp. > > > > > > > >So for example: > > > > > > > >key = com.example.com/some/path > > > > > > > >cf: outboundlinks { > > > > com.example.com/link1, > > > > com.example.com/link2, > > > > .. > > > >} > > > > > > > >Data is stored like this: > > > > > > > >Region Server -> Store -> StoreFile -> HFile > > > > > > > >Now when a client requests a particular key, the hmaster figures out > > which > > > >region server holds the data, this information is returned the client > > > >(which saves it locally), and then it makes a request to the region > > > >server. > > > > > > > >Now since the actual data files are immutable, if you modify a > > particular > > > >value in a CF, it is tombestombed (not sure how that works but > > understand > > > >it at a high level). > > > > > > > >So if I make a request for a given key, going with the example above, > a > > > >particular url on the website example.com, and i want all the > > > >outboundlinks > > > >I reference the column family "outboudnlinks" which can store millions > > of > > > >urls. > > > > > > > >What process/service/class is in charge of assembling the various > files > > to > > > >get all the correct data? > > > > > > > >Summary of my question: > > > >What I am trying to understand is, if a particular CF has millions of > > > >values, and if a single value is mutated, a new file has to be > created. > > > >So > > > >this means, if I query for that value i.e. it is included in my result > > > >set, > > > >how does hbase know where to look for the latest data? > > > > > > > >So basically from what I understand, making a get request for a > > particular > > > >key, cf will have to potentially look at more than one StoreFile (or > > > >HFile?) correct? > > > > > > > > > > > >
-
Re: how does hbase get the latest version with immutable hfiles?
S Ahmed 2012-06-04, 18:36
Once hbase has identified the file that contains the row key, what algorithm is used? I understand that keys are ordered lexically. And are files ordered using quicksort? On Sun, Jun 3, 2012 at 9:37 PM, Elliott Clark <[EMAIL PROTECTED]>wrote: > There are slide. I think you have to register with an email and fist/last > name to download the ppt. > > On Sun, Jun 3, 2012 at 12:21 PM, S Ahmed <[EMAIL PROTECTED]> wrote: > > > Elliot, > > > > Is there a video or slides? I guess I have to register to view it? > > > > On Sat, Jun 2, 2012 at 2:18 PM, Elliott Clark <[EMAIL PROTECTED] > > >wrote: > > > > > If you want to get into the really nitty gritty I found Lars' > > presentation > > > really insightful. > > > > > > http://www.hbasecon.com/sessions/learning-hbase-internals/> > > > > > On Sat, Jun 2, 2012 at 6:13 AM, Doug Meil < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > > > > > Hi there, I think you probably want to look at thisŠ > > > > > > > > Hbase catalog metadataŠ > > > > > > > > http://hbase.apache.org/book.html#arch.catalog> > > > > > > > How data is stored internallyŠ > > > > > > > > http://hbase.apache.org/book.html#regions.arch> > > > > > > > Lots of versioning description hereŠ > > > > > > > > http://hbase.apache.org/book.html#datamodel> > > > > > > > > > > > > > > > Long story short, client talks directly to RegionServers, Hbase looks > > at > > > > multiple StoreFiles. > > > > > > > > > > > > > > > > On 6/1/12 4:27 PM, "S Ahmed" <[EMAIL PROTECTED]> wrote: > > > > > > > > >(reference: > > > > > > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html> > ) > > > > > > > > > >A row consists of a key, and column families, along with a > timestamp. > > > > > > > > > >So for example: > > > > > > > > > >key = com.example.com/some/path > > > > > > > > > >cf: outboundlinks { > > > > > com.example.com/link1, > > > > > com.example.com/link2, > > > > > .. > > > > >} > > > > > > > > > >Data is stored like this: > > > > > > > > > >Region Server -> Store -> StoreFile -> HFile > > > > > > > > > >Now when a client requests a particular key, the hmaster figures out > > > which > > > > >region server holds the data, this information is returned the > client > > > > >(which saves it locally), and then it makes a request to the region > > > > >server. > > > > > > > > > >Now since the actual data files are immutable, if you modify a > > > particular > > > > >value in a CF, it is tombestombed (not sure how that works but > > > understand > > > > >it at a high level). > > > > > > > > > >So if I make a request for a given key, going with the example > above, > > a > > > > >particular url on the website example.com, and i want all the > > > > >outboundlinks > > > > >I reference the column family "outboudnlinks" which can store > millions > > > of > > > > >urls. > > > > > > > > > >What process/service/class is in charge of assembling the various > > files > > > to > > > > >get all the correct data? > > > > > > > > > >Summary of my question: > > > > >What I am trying to understand is, if a particular CF has millions > of > > > > >values, and if a single value is mutated, a new file has to be > > created. > > > > >So > > > > >this means, if I query for that value i.e. it is included in my > result > > > > >set, > > > > >how does hbase know where to look for the latest data? > > > > > > > > > >So basically from what I understand, making a get request for a > > > particular > > > > >key, cf will have to potentially look at more than one StoreFile (or > > > > >HFile?) correct? > > > > > > > > > > > > > > > > > >
|
|