|
Lin Ma
2012-08-18, 07:12
Drew Dahlke
2012-08-20, 13:26
Lin Ma
2012-08-20, 16:09
Asif Ali
2012-08-20, 19:26
J Mohamed Zahoor
2012-08-21, 10:55
Lin Ma
2012-08-21, 13:28
Lin Ma
2012-08-21, 13:32
jmozah
2012-08-21, 14:45
Lin Ma
2012-08-21, 15:42
jmozah
2012-08-21, 15:56
Lin Ma
2012-08-21, 16:30
J Mohamed Zahoor
2012-08-22, 04:51
Lin Ma
2012-08-22, 12:11
Anoop Sam John
2012-08-22, 12:34
Lin Ma
2012-08-22, 13:28
Stack
2012-08-22, 15:28
anil gupta
2012-08-22, 16:57
Pamecha, Abhishek
2012-08-22, 17:28
Anoop Sam John
2012-08-23, 03:50
J Mohamed Zahoor
2012-08-23, 04:04
Pamecha, Abhishek
2012-08-23, 05:05
|
-
Using HBase serving to replace memcachedLin Ma 2012-08-18, 07:12
Hello guys,
In your experience, is it practical to use HBase directly for serving? Saying handle directly user traffic (tens of thousands QPS scale) behind Apache, and replace the role of memcached? I am not sure whether there are any known panic to replace memcached by using HBase? One issue I could think about is for a specific row range, only one active region server could handle the request, but in memcached, we can setup several memcached instance with duplicate content (all of them are active) to serve the same purpose under a VIP which could achieve better performance and scalability. Any advice or reference documents are appreciated. Thanks. regards, Lin
-
Re: Using HBase serving to replace memcachedDrew Dahlke 2012-08-20, 13:26
I'd say if the memcached model is working for you, stick with it.
HBase (currently) caches whole blocks. With cache blocks enabled you can achieve 10s of thousands of reqs/sec with a pretty small cluster. However there's a catch. Once you reach the point where your tables are so large they can't all sit in memory at the same time you'll see a behavior change. User traffic tends to be very random access which, with block caching, can cause a lot of thrashing with frequent cache evictions. We've seen this bring our cluster to it's knees. IMHO a better model is persist things in HBase and then cache things with memcached just as you would with any other data store. If you're looking for a spiffy memcached replacement I'd recommend checking out Redis. On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <[EMAIL PROTECTED]> wrote: > Hello guys, > > In your experience, is it practical to use HBase directly for serving? > Saying handle directly user traffic (tens of thousands QPS scale) behind > Apache, and replace the role of memcached? I am not sure whether there are > any known panic to replace memcached by using HBase? One issue I could > think about is for a specific row range, only one active region server > could handle the request, but in memcached, we can setup several memcached > instance with duplicate content (all of them are active) to serve the same > purpose under a VIP which could achieve better performance and scalability. > > Any advice or reference documents are appreciated. Thanks. > > regards, > Lin
-
Re: Using HBase serving to replace memcachedLin Ma 2012-08-20, 16:09
Thank you Drew. I like your reply, especially blocking cache nature
provided by HBase. A quick question, for traditional memcached, all of the items are in memory, no disk is used, correct? regards, Lin On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke <[EMAIL PROTECTED]> wrote: > I'd say if the memcached model is working for you, stick with it. > HBase (currently) caches whole blocks. With cache blocks enabled you > can achieve 10s of thousands of reqs/sec with a pretty small cluster. > However there's a catch. Once you reach the point where your tables > are so large they can't all sit in memory at the same time you'll see > a behavior change. User traffic tends to be very random access which, > with block caching, can cause a lot of thrashing with frequent cache > evictions. We've seen this bring our cluster to it's knees. > > IMHO a better model is persist things in HBase and then cache things > with memcached just as you would with any other data store. If you're > looking for a spiffy memcached replacement I'd recommend checking out > Redis. > > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <[EMAIL PROTECTED]> wrote: > > Hello guys, > > > > In your experience, is it practical to use HBase directly for serving? > > Saying handle directly user traffic (tens of thousands QPS scale) behind > > Apache, and replace the role of memcached? I am not sure whether there > are > > any known panic to replace memcached by using HBase? One issue I could > > think about is for a specific row range, only one active region server > > could handle the request, but in memcached, we can setup several > memcached > > instance with duplicate content (all of them are active) to serve the > same > > purpose under a VIP which could achieve better performance and > scalability. > > > > Any advice or reference documents are appreciated. Thanks. > > > > regards, > > Lin >
-
Re: Using HBase serving to replace memcachedAsif Ali 2012-08-20, 19:26
I've used memcached heavily in such scenarios and all such data is always
in Memory. Memcached definitely is a great solution for this and scales very well. But keep in mind - it is not consistent. Which means there are some requests which will be handled incorrectly. Memcached is great but also look at Guava cache for similar use cases. Asif Ali On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma <[EMAIL PROTECTED]> wrote: > Thank you Drew. I like your reply, especially blocking cache nature > provided by HBase. A quick question, for traditional memcached, all of the > items are in memory, no disk is used, correct? > > regards, > Lin > > On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke <[EMAIL PROTECTED]> > wrote: > > > I'd say if the memcached model is working for you, stick with it. > > HBase (currently) caches whole blocks. With cache blocks enabled you > > can achieve 10s of thousands of reqs/sec with a pretty small cluster. > > However there's a catch. Once you reach the point where your tables > > are so large they can't all sit in memory at the same time you'll see > > a behavior change. User traffic tends to be very random access which, > > with block caching, can cause a lot of thrashing with frequent cache > > evictions. We've seen this bring our cluster to it's knees. > > > > IMHO a better model is persist things in HBase and then cache things > > with memcached just as you would with any other data store. If you're > > looking for a spiffy memcached replacement I'd recommend checking out > > Redis. > > > > > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <[EMAIL PROTECTED]> wrote: > > > Hello guys, > > > > > > In your experience, is it practical to use HBase directly for serving? > > > Saying handle directly user traffic (tens of thousands QPS scale) > behind > > > Apache, and replace the role of memcached? I am not sure whether there > > are > > > any known panic to replace memcached by using HBase? One issue I could > > > think about is for a specific row range, only one active region server > > > could handle the request, but in memcached, we can setup several > > memcached > > > instance with duplicate content (all of them are active) to serve the > > same > > > purpose under a VIP which could achieve better performance and > > scalability. > > > > > > Any advice or reference documents are appreciated. Thanks. > > > > > > regards, > > > Lin > > >
-
Re: Using HBase serving to replace memcachedJ Mohamed Zahoor 2012-08-21, 10:55
Again. if your data is so huge that it is much larger than the available
RAM, you might want to rethink. There are some configs in HBase that will help you in random read scenarios... like Bloom filters etc. Also more client connections is one more issue that might infest you... where connection pooling or asynchbase will help you. ./Zahoor On Tue, Aug 21, 2012 at 12:56 AM, Asif Ali <[EMAIL PROTECTED]> wrote: > I've used memcached heavily in such scenarios and all such data is always > in Memory. > > Memcached definitely is a great solution for this and scales very well. But > keep in mind - it is not consistent. Which means there are some requests > which will be handled incorrectly. > > Memcached is great but also look at Guava cache for similar use cases. > > Asif Ali > > > On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma <[EMAIL PROTECTED]> wrote: > > > Thank you Drew. I like your reply, especially blocking cache nature > > provided by HBase. A quick question, for traditional memcached, all of > the > > items are in memory, no disk is used, correct? > > > > regards, > > Lin > > > > On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke <[EMAIL PROTECTED]> > > wrote: > > > > > I'd say if the memcached model is working for you, stick with it. > > > HBase (currently) caches whole blocks. With cache blocks enabled you > > > can achieve 10s of thousands of reqs/sec with a pretty small cluster. > > > However there's a catch. Once you reach the point where your tables > > > are so large they can't all sit in memory at the same time you'll see > > > a behavior change. User traffic tends to be very random access which, > > > with block caching, can cause a lot of thrashing with frequent cache > > > evictions. We've seen this bring our cluster to it's knees. > > > > > > IMHO a better model is persist things in HBase and then cache things > > > with memcached just as you would with any other data store. If you're > > > looking for a spiffy memcached replacement I'd recommend checking out > > > Redis. > > > > > > > > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <[EMAIL PROTECTED]> wrote: > > > > Hello guys, > > > > > > > > In your experience, is it practical to use HBase directly for > serving? > > > > Saying handle directly user traffic (tens of thousands QPS scale) > > behind > > > > Apache, and replace the role of memcached? I am not sure whether > there > > > are > > > > any known panic to replace memcached by using HBase? One issue I > could > > > > think about is for a specific row range, only one active region > server > > > > could handle the request, but in memcached, we can setup several > > > memcached > > > > instance with duplicate content (all of them are active) to serve the > > > same > > > > purpose under a VIP which could achieve better performance and > > > scalability. > > > > > > > > Any advice or reference documents are appreciated. Thanks. > > > > > > > > regards, > > > > Lin > > > > > >
-
Re: Using HBase serving to replace memcachedLin Ma 2012-08-21, 13:28
Thanks Asif,
For your comments, "Which means there are some requests which will be handled incorrectly.", could you show me an example about what do you mean "handled incorrectly"? regards, Lin On Tue, Aug 21, 2012 at 3:26 AM, Asif Ali <[EMAIL PROTECTED]> wrote: > I've used memcached heavily in such scenarios and all such data is always > in Memory. > > Memcached definitely is a great solution for this and scales very well. But > keep in mind - it is not consistent. Which means there are some requests > which will be handled incorrectly. > > Memcached is great but also look at Guava cache for similar use cases. > > Asif Ali > > > On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma <[EMAIL PROTECTED]> wrote: > > > Thank you Drew. I like your reply, especially blocking cache nature > > provided by HBase. A quick question, for traditional memcached, all of > the > > items are in memory, no disk is used, correct? > > > > regards, > > Lin > > > > On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke <[EMAIL PROTECTED]> > > wrote: > > > > > I'd say if the memcached model is working for you, stick with it. > > > HBase (currently) caches whole blocks. With cache blocks enabled you > > > can achieve 10s of thousands of reqs/sec with a pretty small cluster. > > > However there's a catch. Once you reach the point where your tables > > > are so large they can't all sit in memory at the same time you'll see > > > a behavior change. User traffic tends to be very random access which, > > > with block caching, can cause a lot of thrashing with frequent cache > > > evictions. We've seen this bring our cluster to it's knees. > > > > > > IMHO a better model is persist things in HBase and then cache things > > > with memcached just as you would with any other data store. If you're > > > looking for a spiffy memcached replacement I'd recommend checking out > > > Redis. > > > > > > > > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <[EMAIL PROTECTED]> wrote: > > > > Hello guys, > > > > > > > > In your experience, is it practical to use HBase directly for > serving? > > > > Saying handle directly user traffic (tens of thousands QPS scale) > > behind > > > > Apache, and replace the role of memcached? I am not sure whether > there > > > are > > > > any known panic to replace memcached by using HBase? One issue I > could > > > > think about is for a specific row range, only one active region > server > > > > could handle the request, but in memcached, we can setup several > > > memcached > > > > instance with duplicate content (all of them are active) to serve the > > > same > > > > purpose under a VIP which could achieve better performance and > > > scalability. > > > > > > > > Any advice or reference documents are appreciated. Thanks. > > > > > > > > regards, > > > > Lin > > > > > >
-
Re: Using HBase serving to replace memcachedLin Ma 2012-08-21, 13:32
Thanks for the reply, Zahoor.
Some more comments, 1. I know very basics of Bloom filters, which is used for detect whether an item is in a set. How to use Bloom filters in HBase to improve random read performance? Could you show me an example? Thanks. 2. "Also more client connections is one more issue that might infest you" -- supposing I am doing random read from a Hadoop job to access HBase, do you mean using multiple client connections from the Hadoop job is good or not good? Sorry I am a bit lost. :-) 3. "asynchbase will help you" -- does HBase support asynchronous API? Sorry I cannot find it out. Appreciate if you could point me the APIs you are referring to. regards, Lin On Tue, Aug 21, 2012 at 6:55 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote: > Again. if your data is so huge that it is much larger than the available > RAM, you might want to rethink. > There are some configs in HBase that will help you in random read > scenarios... like Bloom filters etc. > Also more client connections is one more issue that might infest you... > where connection pooling or asynchbase will help you. > > ./Zahoor > > > On Tue, Aug 21, 2012 at 12:56 AM, Asif Ali <[EMAIL PROTECTED]> wrote: > > > I've used memcached heavily in such scenarios and all such data is always > > in Memory. > > > > Memcached definitely is a great solution for this and scales very well. > But > > keep in mind - it is not consistent. Which means there are some requests > > which will be handled incorrectly. > > > > Memcached is great but also look at Guava cache for similar use cases. > > > > Asif Ali > > > > > > On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma <[EMAIL PROTECTED]> wrote: > > > > > Thank you Drew. I like your reply, especially blocking cache nature > > > provided by HBase. A quick question, for traditional memcached, all of > > the > > > items are in memory, no disk is used, correct? > > > > > > regards, > > > Lin > > > > > > On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke <[EMAIL PROTECTED]> > > > wrote: > > > > > > > I'd say if the memcached model is working for you, stick with it. > > > > HBase (currently) caches whole blocks. With cache blocks enabled you > > > > can achieve 10s of thousands of reqs/sec with a pretty small cluster. > > > > However there's a catch. Once you reach the point where your tables > > > > are so large they can't all sit in memory at the same time you'll see > > > > a behavior change. User traffic tends to be very random access which, > > > > with block caching, can cause a lot of thrashing with frequent cache > > > > evictions. We've seen this bring our cluster to it's knees. > > > > > > > > IMHO a better model is persist things in HBase and then cache things > > > > with memcached just as you would with any other data store. If you're > > > > looking for a spiffy memcached replacement I'd recommend checking out > > > > Redis. > > > > > > > > > > > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <[EMAIL PROTECTED]> wrote: > > > > > Hello guys, > > > > > > > > > > In your experience, is it practical to use HBase directly for > > serving? > > > > > Saying handle directly user traffic (tens of thousands QPS scale) > > > behind > > > > > Apache, and replace the role of memcached? I am not sure whether > > there > > > > are > > > > > any known panic to replace memcached by using HBase? One issue I > > could > > > > > think about is for a specific row range, only one active region > > server > > > > > could handle the request, but in memcached, we can setup several > > > > memcached > > > > > instance with duplicate content (all of them are active) to serve > the > > > > same > > > > > purpose under a VIP which could achieve better performance and > > > > scalability. > > > > > > > > > > Any advice or reference documents are appreciated. Thanks. > > > > > > > > > > regards, > > > > > Lin > > > > > > > > > >
-
Re: Using HBase serving to replace memcachedjmozah 2012-08-21, 14:45
>
> > > 1. I know very basics of Bloom filters, which is used for detect whether an item is in a set. How to use Bloom filters in HBase to improve random read performance? Could you show me an example? Thanks. This will help omit loading the blocks (thereby saving IO and cache churn) which does not have the given row. For more on bloom, see 1 - https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf 2 - http://www.quora.com/How-are-bloom-filters-used-in-HBase > 2. "Also more client connections is one more issue that might infest you" -- supposing I am doing random read from a Hadoop job to access HBase, do you mean using multiple client connections from the Hadoop job is good or not good? Sorry I am a bit lost. :-) One Hadoop job doing random reads is perfectly fine. but , since you said "Handling directly user traffic"... i assumed you wanted to expose HBase independently to every client request, thereby having as many connections as the number of simultaneous req.. > 3. "asynchbase will help you" -- does HBase support asynchronous API? Sorry I cannot find it out. Appreciate if you could point me the APIs you are referring to. Not the default HTable API. asynchbase is another client for Hbase. read more about asynchbase here (https://github.com/stumbleupon/asynchbase)
-
Re: Using HBase serving to replace memcachedLin Ma 2012-08-21, 15:42
Thank you Zahoor,
Two more comments, 1. After reading the materials you sent to me, I am confused how Bloom Filter could save I/O during random read. Supposing I am not using Bloom Filter, in order to find whether a row (or row-key) exists, we need to scan the index block which is at the end part of an HFile, the scan is in memory (I think index block is always in memory, please feel free to correct me if I am wrong) using binary search -- it should be pretty fast. With Bloom Filter, we could be a bit faster by looking up Bloom Filter bit vector in memory. Since both index block binary search and Bloom Filter bit vector search are doing in memory (no I/O is involved), what kinds of I/O is saved? :-) 2. > One Hadoop job doing random reads is perfectly fine. but , since you said "Handling directly user traffic"... i assumed you wanted to > expose HBase independently to every client request, thereby having as many connections as the number of simultaneous req.. Sorry I need to confirm again on this point. I think you mean establishing a new connection for each request is not good, using connection pool or asynchronous I/O is preferred? regards, Lin On Tue, Aug 21, 2012 at 10:45 PM, jmozah <[EMAIL PROTECTED]> wrote: > > > > > > > > 1. I know very basics of Bloom filters, which is used for detect whether > an item is in a set. How to use Bloom filters in HBase to improve random > read performance? Could you show me an example? Thanks. > > This will help omit loading the blocks (thereby saving IO and cache churn) > which does not have the given row. > For more on bloom, see > 1 - > https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf > 2 - http://www.quora.com/How-are-bloom-filters-used-in-HBase > > > > 2. "Also more client connections is one more issue that might infest > you" -- supposing I am doing random read from a Hadoop job to access HBase, > do you mean using multiple client connections from the Hadoop job is good > or not good? Sorry I am a bit lost. :-) > > One Hadoop job doing random reads is perfectly fine. but , since you said > "Handling directly user traffic"... i assumed you wanted to expose HBase > independently to every client request, thereby having as many connections > as the number of simultaneous req.. > > > > 3. "asynchbase will help you" -- does HBase support asynchronous API? > Sorry I cannot find it out. Appreciate if you could point me the APIs you > are referring to. > > > Not the default HTable API. asynchbase is another client for Hbase. read > more about asynchbase here (https://github.com/stumbleupon/asynchbase) > >
-
Re: Using HBase serving to replace memcachedjmozah 2012-08-21, 15:56
>
> > 1. After reading the materials you sent to me, I am confused how Bloom Filter could save I/O during random read. Supposing I am not using Bloom Filter, in order to find whether a row (or row-key) exists, we need to scan the index block which is at the end part of an HFile, the scan is in memory (I think index block is always in memory, please feel free to correct me if I am wrong) using binary search -- it should be pretty fast. With Bloom Filter, we could be a bit faster by looking up Bloom Filter bit vector in memory. Since both index block binary search and Bloom Filter bit vector search are doing in memory (no I/O is involved), what kinds of I/O is saved? :-) > If bloom says the Row *may* be present.. the block is loaded otherwise not... If there is no bloom... you have to load every block and scan to find if the row exists.. This may incur more IO > 2. > > > One Hadoop job doing random reads is perfectly fine. but , since you said "Handling directly user traffic"... i assumed you wanted to > > expose HBase independently to every client request, thereby having as many connections as the number of simultaneous req.. > > Sorry I need to confirm again on this point. I think you mean establishing a new connection for each request is not good, using connection pool or asynchronous I/O is preferred? > Yes.
-
Re: Using HBase serving to replace memcachedLin Ma 2012-08-21, 16:30
Thanks Zahoor,
> If there is no bloom... you have to load every block and scan to find if the row exists.. I could be wrong. I think HFile index block (which is located at the end of HFile) is a binary search tree containing all row-key values (of the HFile) in the binary search tree. Searching a specific row-key in the binary search tree could easily find whether a row-key exists (some node in the tree has the same row-key value) or not. Why we need load every block to find if the row exists? regards, Lin On Tue, Aug 21, 2012 at 11:56 PM, jmozah <[EMAIL PROTECTED]> wrote: > > > > > > 1. After reading the materials you sent to me, I am confused how Bloom > Filter could save I/O during random read. Supposing I am not using Bloom > Filter, in order to find whether a row (or row-key) exists, we need to scan > the index block which is at the end part of an HFile, the scan is in memory > (I think index block is always in memory, please feel free to correct me if > I am wrong) using binary search -- it should be pretty fast. With Bloom > Filter, we could be a bit faster by looking up Bloom Filter bit vector in > memory. Since both index block binary search and Bloom Filter bit vector > search are doing in memory (no I/O is involved), what kinds of I/O is > saved? :-) > > > > If bloom says the Row *may* be present.. the block is loaded otherwise > not... > If there is no bloom... you have to load every block and scan to find if > the row exists.. > > This may incur more IO > > > > 2. > > > > > One Hadoop job doing random reads is perfectly fine. but , since you > said "Handling directly user traffic"... i assumed you wanted to > > > expose HBase independently to every client request, thereby having as > many connections as the number of simultaneous req.. > > > > Sorry I need to confirm again on this point. I think you mean > establishing a new connection for each request is not good, using > connection pool or asynchronous I/O is preferred? > > > > > Yes.
-
Re: Using HBase serving to replace memcachedJ Mohamed Zahoor 2012-08-22, 04:51
>
> I could be wrong. I think HFile index block (which is located at the end > of HFile) is a binary search tree containing all row-key values (of the > HFile) in the binary search tree. Searching a specific row-key in the > binary search tree could easily find whether a row-key exists (some node in > the tree has the same row-key value) or not. Why we need load every block > to find if the row exists? > > Hmm... It is a multilevel index. Only the root Index's (Data, Meta etc) are loaded when a region is opened. The rest of the tree (intermediate and leaf index's) are present in each block level. I am assuming a HFile v2 here for the discussion. Read this for more clarity http://hbase.apache.org/book/apes03.html Nice discussion. You made me read lot of things. :-) Now i will dig in to the code and check this out. ./Zahoor
-
Re: Using HBase serving to replace memcachedLin Ma 2012-08-22, 12:11
Thanks Zahoor,
I read through the document you referred to, I am confused about what means leaf-level index, intermediate-level index and root-level index. It is appreciate if you could give more details what they are, or point me to the related documents. BTW: the document you pointed me is very good, however I miss some basic background of 3 terms I mentioned above. :-) regards, Lin On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote: > I could be wrong. I think HFile index block (which is located at the end >> of HFile) is a binary search tree containing all row-key values (of the >> HFile) in the binary search tree. Searching a specific row-key in the >> binary search tree could easily find whether a row-key exists (some node in >> the tree has the same row-key value) or not. Why we need load every block >> to find if the row exists? >> >> > Hmm... > It is a multilevel index. Only the root Index's (Data, Meta etc) are > loaded when a region is opened. The rest of the tree (intermediate and leaf > index's) are present in each block level. > I am assuming a HFile v2 here for the discussion. > Read this for more clarity http://hbase.apache.org/book/apes03.html > > Nice discussion. You made me read lot of things. :-) > Now i will dig in to the code and check this out. > > ./Zahoor >
-
RE: Using HBase serving to replace memcachedAnoop Sam John 2012-08-22, 12:34
> I could be wrong. I think HFile index block (which is located at the end
>> of HFile) is a binary search tree containing all row-key values (of the >> HFile) in the binary search tree. Searching a specific row-key in the >> binary search tree could easily find whether a row-key exists (some node in >> the tree has the same row-key value) or not. Why we need load every block >> to find if the row exists? I think there is some confusion with you people regarding the blooms and the block index.I will try to clarify this point. Block index will be there with every HFile. Within an HFile the data will be written as multiple blocks. While reading data block by block only HBase read data from the HDFS layer. The block index contains the information regarding the blocks within that HFile. The information include the start and end rowkeys which resides in that particular block and the block information like offset of that block and its length etc. Now when a request comes for getting a rowkey 'x' all the HFiles within that region need to be checked.[KV can be present in any of the HFile] Now in order to know this row will be present in which block within an HFile, this block index will be used. Well this block index will be there in memory always. This lookup will tell only the possible block in which the row is present. HBase will load that block and will read through it to get the row which we are interested in now. Bloom is like it will have information about each and every row added into that HFile[Block index wont have info about each and every row]. This bloom information will be there in memory always. So when a read request to get row 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in this file or not. If this is not there, as per the bloom, no block at all will be fetched. But if bloom is not enabled, we might find one block which is having a row range such that 'x' comes in between and Hbase will load that block. So usage of blooms can avoid this IO. Hope this is clear for you now. -Anoop- ________________________________________ From: Lin Ma [[EMAIL PROTECTED]] Sent: Wednesday, August 22, 2012 5:41 PM To: J Mohamed Zahoor; [EMAIL PROTECTED] Subject: Re: Using HBase serving to replace memcached Thanks Zahoor, I read through the document you referred to, I am confused about what means leaf-level index, intermediate-level index and root-level index. It is appreciate if you could give more details what they are, or point me to the related documents. BTW: the document you pointed me is very good, however I miss some basic background of 3 terms I mentioned above. :-) regards, Lin On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote: > I could be wrong. I think HFile index block (which is located at the end >> of HFile) is a binary search tree containing all row-key values (of the >> HFile) in the binary search tree. Searching a specific row-key in the >> binary search tree could easily find whether a row-key exists (some node in >> the tree has the same row-key value) or not. Why we need load every block >> to find if the row exists? >> >> > Hmm... > It is a multilevel index. Only the root Index's (Data, Meta etc) are > loaded when a region is opened. The rest of the tree (intermediate and leaf > index's) are present in each block level. > I am assuming a HFile v2 here for the discussion. > Read this for more clarity http://hbase.apache.org/book/apes03.html > > Nice discussion. You made me read lot of things. :-) > Now i will dig in to the code and check this out. > > ./Zahoor >
-
Re: Using HBase serving to replace memcachedLin Ma 2012-08-22, 13:28
Thanks Anoop,
My question is answered. Are you writing related part of code in HBase? >From your detailed and knowledgeable description, you seems to be the author. :-) regards, Lin On Wed, Aug 22, 2012 at 8:34 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > I could be wrong. I think HFile index block (which is located at the end > >> of HFile) is a binary search tree containing all row-key values (of the > >> HFile) in the binary search tree. Searching a specific row-key in the > >> binary search tree could easily find whether a row-key exists (some > node in > >> the tree has the same row-key value) or not. Why we need load every > block > >> to find if the row exists? > > I think there is some confusion with you people regarding the blooms and > the block index.I will try to clarify this point. > Block index will be there with every HFile. Within an HFile the data will > be written as multiple blocks. While reading data block by block only HBase > read data from the HDFS layer. The block index contains the information > regarding the blocks within that HFile. The information include the start > and end rowkeys which resides in that particular block and the block > information like offset of that block and its length etc. Now when a > request comes for getting a rowkey 'x' all the HFiles within that region > need to be checked.[KV can be present in any of the HFile] Now in order to > know this row will be present in which block within an HFile, this block > index will be used. Well this block index will be there in memory always. > This lookup will tell only the possible block in which the row is present. > HBase will load that block and will read through it to get the row which we > are interested in now. > Bloom is like it will have information about each and every row added into > that HFile[Block index wont have info about each and every row]. This bloom > information will be there in memory always. So when a read request to get > row 'x' in an Hfile comes, 1st the bloom is checked whether this row is > there in this file or not. If this is not there, as per the bloom, no block > at all will be fetched. But if bloom is not enabled, we might find one > block which is having a row range such that 'x' comes in between and Hbase > will load that block. So usage of blooms can avoid this IO. Hope this is > clear for you now. > > -Anoop- > ________________________________________ > From: Lin Ma [[EMAIL PROTECTED]] > Sent: Wednesday, August 22, 2012 5:41 PM > To: J Mohamed Zahoor; [EMAIL PROTECTED] > Subject: Re: Using HBase serving to replace memcached > > Thanks Zahoor, > > I read through the document you referred to, I am confused about what means > leaf-level index, intermediate-level index and root-level index. It is > appreciate if you could give more details what they are, or point me to the > related documents. > > BTW: the document you pointed me is very good, however I miss some basic > background of 3 terms I mentioned above. :-) > > regards, > Lin > > On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> > wrote: > > > I could be wrong. I think HFile index block (which is located at the end > >> of HFile) is a binary search tree containing all row-key values (of the > >> HFile) in the binary search tree. Searching a specific row-key in the > >> binary search tree could easily find whether a row-key exists (some > node in > >> the tree has the same row-key value) or not. Why we need load every > block > >> to find if the row exists? > >> > >> > > Hmm... > > It is a multilevel index. Only the root Index's (Data, Meta etc) are > > loaded when a region is opened. The rest of the tree (intermediate and > leaf > > index's) are present in each block level. > > I am assuming a HFile v2 here for the discussion. > > Read this for more clarity http://hbase.apache.org/book/apes03.html > > > > Nice discussion. You made me read lot of things. :-) > > Now i will dig in to the code and check this out. > > > > ./Zahoor
-
Re: Using HBase serving to replace memcachedStack 2012-08-22, 15:28
On Wed, Aug 22, 2012 at 6:28 AM, Lin Ma <[EMAIL PROTECTED]> wrote:
> Thanks Anoop, > > My question is answered. Are you writing related part of code in HBase? > From your detailed and knowledgeable description, you seems to be the > author. :-) > Anoop did not write that particular piece of code. He has though made many other high calibre contributions to the hbase code base. St.Ack
-
Re: Using HBase serving to replace memcachedanil gupta 2012-08-22, 16:57
Nice explanation, Anoop. This deserves to be part of Hbase wiki.
On Wed, Aug 22, 2012 at 5:34 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > I could be wrong. I think HFile index block (which is located at the end > >> of HFile) is a binary search tree containing all row-key values (of the > >> HFile) in the binary search tree. Searching a specific row-key in the > >> binary search tree could easily find whether a row-key exists (some > node in > >> the tree has the same row-key value) or not. Why we need load every > block > >> to find if the row exists? > > I think there is some confusion with you people regarding the blooms and > the block index.I will try to clarify this point. > Block index will be there with every HFile. Within an HFile the data will > be written as multiple blocks. While reading data block by block only HBase > read data from the HDFS layer. The block index contains the information > regarding the blocks within that HFile. The information include the start > and end rowkeys which resides in that particular block and the block > information like offset of that block and its length etc. Now when a > request comes for getting a rowkey 'x' all the HFiles within that region > need to be checked.[KV can be present in any of the HFile] Now in order to > know this row will be present in which block within an HFile, this block > index will be used. Well this block index will be there in memory always. > This lookup will tell only the possible block in which the row is present. > HBase will load that block and will read through it to get the row which we > are interested in now. > Bloom is like it will have information about each and every row added into > that HFile[Block index wont have info about each and every row]. This bloom > information will be there in memory always. So when a read request to get > row 'x' in an Hfile comes, 1st the bloom is checked whether this row is > there in this file or not. If this is not there, as per the bloom, no block > at all will be fetched. But if bloom is not enabled, we might find one > block which is having a row range such that 'x' comes in between and Hbase > will load that block. So usage of blooms can avoid this IO. Hope this is > clear for you now. > > -Anoop- > ________________________________________ > From: Lin Ma [[EMAIL PROTECTED]] > Sent: Wednesday, August 22, 2012 5:41 PM > To: J Mohamed Zahoor; [EMAIL PROTECTED] > Subject: Re: Using HBase serving to replace memcached > > Thanks Zahoor, > > I read through the document you referred to, I am confused about what means > leaf-level index, intermediate-level index and root-level index. It is > appreciate if you could give more details what they are, or point me to the > related documents. > > BTW: the document you pointed me is very good, however I miss some basic > background of 3 terms I mentioned above. :-) > > regards, > Lin > > On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> > wrote: > > > I could be wrong. I think HFile index block (which is located at the end > >> of HFile) is a binary search tree containing all row-key values (of the > >> HFile) in the binary search tree. Searching a specific row-key in the > >> binary search tree could easily find whether a row-key exists (some > node in > >> the tree has the same row-key value) or not. Why we need load every > block > >> to find if the row exists? > >> > >> > > Hmm... > > It is a multilevel index. Only the root Index's (Data, Meta etc) are > > loaded when a region is opened. The rest of the tree (intermediate and > leaf > > index's) are present in each block level. > > I am assuming a HFile v2 here for the discussion. > > Read this for more clarity http://hbase.apache.org/book/apes03.html > > > > Nice discussion. You made me read lot of things. :-) > > Now i will dig in to the code and check this out. > > > > ./Zahoor > > > -- Thanks & Regards, Anil Gupta
-
RE: Using HBase serving to replace memcachedPamecha, Abhishek 2012-08-22, 17:28
Great explanation. May be diverging from the thread's original question, but could you also care to explain the difference if any, in searching for a rowkey [ that you mentioned below ] Vs searching for a specific column qualifier. Are there any optimizations for column qualifier search too or that one just needs to load all blocks that match the rowkey crieteria and then scan each one of them from start to end?
Thanks, Abhishek -----Original Message----- From: Anoop Sam John [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 22, 2012 5:35 AM To: [EMAIL PROTECTED]; J Mohamed Zahoor Subject: RE: Using HBase serving to replace memcached > I could be wrong. I think HFile index block (which is located at the > end >> of HFile) is a binary search tree containing all row-key values (of >> the >> HFile) in the binary search tree. Searching a specific row-key in the >> binary search tree could easily find whether a row-key exists (some >> node in the tree has the same row-key value) or not. Why we need load >> every block to find if the row exists? I think there is some confusion with you people regarding the blooms and the block index.I will try to clarify this point. Block index will be there with every HFile. Within an HFile the data will be written as multiple blocks. While reading data block by block only HBase read data from the HDFS layer. The block index contains the information regarding the blocks within that HFile. The information include the start and end rowkeys which resides in that particular block and the block information like offset of that block and its length etc. Now when a request comes for getting a rowkey 'x' all the HFiles within that region need to be checked.[KV can be present in any of the HFile] Now in order to know this row will be present in which block within an HFile, this block index will be used. Well this block index will be there in memory always. This lookup will tell only the possible block in which the row is present. HBase will load that block and will read through it to get the row which we are interested in now. Bloom is like it will have information about each and every row added into that HFile[Block index wont have info about each and every row]. This bloom information will be there in memory always. So when a read request to get row 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in this file or not. If this is not there, as per the bloom, no block at all will be fetched. But if bloom is not enabled, we might find one block which is having a row range such that 'x' comes in between and Hbase will load that block. So usage of blooms can avoid this IO. Hope this is clear for you now. -Anoop- ________________________________________ From: Lin Ma [[EMAIL PROTECTED]] Sent: Wednesday, August 22, 2012 5:41 PM To: J Mohamed Zahoor; [EMAIL PROTECTED] Subject: Re: Using HBase serving to replace memcached Thanks Zahoor, I read through the document you referred to, I am confused about what means leaf-level index, intermediate-level index and root-level index. It is appreciate if you could give more details what they are, or point me to the related documents. BTW: the document you pointed me is very good, however I miss some basic background of 3 terms I mentioned above. :-) regards, Lin On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote: > I could be wrong. I think HFile index block (which is located at the > end >> of HFile) is a binary search tree containing all row-key values (of >> the >> HFile) in the binary search tree. Searching a specific row-key in the >> binary search tree could easily find whether a row-key exists (some >> node in the tree has the same row-key value) or not. Why we need load >> every block to find if the row exists? >> >> > Hmm... > It is a multilevel index. Only the root Index's (Data, Meta etc) are > loaded when a region is opened. The rest of the tree (intermediate and > leaf > index's) are present in each block level.
-
RE: Using HBase serving to replace memcachedAnoop Sam John 2012-08-23, 03:50
>Are there any optimizations for column qualifier search too or that one just needs to load all blocks that match the rowkey crieteria and then scan each one of them from start to end?
With blooms there is optimization in search for a particular column qualifier also. Bloom can be ROW type or ROWCOL type. When it is rowcol type what is added in the bloom is the presence of particulat column qualifier in a row rather than just the row id. -Anoop- ______________________________________ From: Pamecha, Abhishek [[EMAIL PROTECTED]] Sent: Wednesday, August 22, 2012 10:58 PM To: [EMAIL PROTECTED]; J Mohamed Zahoor Subject: RE: Using HBase serving to replace memcached Great explanation. May be diverging from the thread's original question, but could you also care to explain the difference if any, in searching for a rowkey [ that you mentioned below ] Vs searching for a specific column qualifier. Are there any optimizations for column qualifier search too or that one just needs to load all blocks that match the rowkey crieteria and then scan each one of them from start to end? Thanks, Abhishek -----Original Message----- From: Anoop Sam John [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 22, 2012 5:35 AM To: [EMAIL PROTECTED]; J Mohamed Zahoor Subject: RE: Using HBase serving to replace memcached > I could be wrong. I think HFile index block (which is located at the > end >> of HFile) is a binary search tree containing all row-key values (of >> the >> HFile) in the binary search tree. Searching a specific row-key in the >> binary search tree could easily find whether a row-key exists (some >> node in the tree has the same row-key value) or not. Why we need load >> every block to find if the row exists? I think there is some confusion with you people regarding the blooms and the block index.I will try to clarify this point. Block index will be there with every HFile. Within an HFile the data will be written as multiple blocks. While reading data block by block only HBase read data from the HDFS layer. The block index contains the information regarding the blocks within that HFile. The information include the start and end rowkeys which resides in that particular block and the block information like offset of that block and its length etc. Now when a request comes for getting a rowkey 'x' all the HFiles within that region need to be checked.[KV can be present in any of the HFile] Now in order to know this row will be present in which block within an HFile, this block index will be used. Well this block index will be there in memory always. This lookup will tell only the possible block in which the row is present. HBase will load that block and will read through it to get the row which we are interested in now. Bloom is like it will have information about each and every row added into that HFile[Block index wont have info about each and every row]. This bloom information will be there in memory always. So when a read request to get row 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in this file or not. If this is not there, as per the bloom, no block at all will be fetched. But if bloom is not enabled, we might find one block which is having a row range such that 'x' comes in between and Hbase will load that block. So usage of blooms can avoid this IO. Hope this is clear for you now. -Anoop- ________________________________________ From: Lin Ma [[EMAIL PROTECTED]] Sent: Wednesday, August 22, 2012 5:41 PM To: J Mohamed Zahoor; [EMAIL PROTECTED] Subject: Re: Using HBase serving to replace memcached Thanks Zahoor, I read through the document you referred to, I am confused about what means leaf-level index, intermediate-level index and root-level index. It is appreciate if you could give more details what they are, or point me to the related documents. BTW: the document you pointed me is very good, however I miss some basic background of 3 terms I mentioned above. :-) regards, Lin On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote:
-
Re: Using HBase serving to replace memcachedJ Mohamed Zahoor 2012-08-23, 04:04
If you need to search row and column qualifiers you can pick row+ col bloom to help you skip blocks.
./Zahoor@iPad On 22-Aug-2012, at 10:58 PM, "Pamecha, Abhishek" <[EMAIL PROTECTED]> wrote: > Great explanation. May be diverging from the thread's original question, but could you also care to explain the difference if any, in searching for a rowkey [ that you mentioned below ] Vs searching for a specific column qualifier. Are there any optimizations for column qualifier search too or that one just needs to load all blocks that match the rowkey crieteria and then scan each one of them from start to end? > > Thanks, > Abhishek > > > -----Original Message----- > From: Anoop Sam John [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 22, 2012 5:35 AM > To: [EMAIL PROTECTED]; J Mohamed Zahoor > Subject: RE: Using HBase serving to replace memcached > >> I could be wrong. I think HFile index block (which is located at the >> end >>> of HFile) is a binary search tree containing all row-key values (of >>> the >>> HFile) in the binary search tree. Searching a specific row-key in the >>> binary search tree could easily find whether a row-key exists (some >>> node in the tree has the same row-key value) or not. Why we need load >>> every block to find if the row exists? > > I think there is some confusion with you people regarding the blooms and the block index.I will try to clarify this point. > Block index will be there with every HFile. Within an HFile the data will be written as multiple blocks. While reading data block by block only HBase read data from the HDFS layer. The block index contains the information regarding the blocks within that HFile. The information include the start and end rowkeys which resides in that particular block and the block information like offset of that block and its length etc. Now when a request comes for getting a rowkey 'x' all the HFiles within that region need to be checked.[KV can be present in any of the HFile] Now in order to know this row will be present in which block within an HFile, this block index will be used. Well this block index will be there in memory always. This lookup will tell only the possible block in which the row is present. HBase will load that block and will read through it to get the row which we are interested in now. > Bloom is like it will have information about each and every row added into that HFile[Block index wont have info about each and every row]. This bloom information will be there in memory always. So when a read request to get row 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in this file or not. If this is not there, as per the bloom, no block at all will be fetched. But if bloom is not enabled, we might find one block which is having a row range such that 'x' comes in between and Hbase will load that block. So usage of blooms can avoid this IO. Hope this is clear for you now. > > -Anoop- > ________________________________________ > From: Lin Ma [[EMAIL PROTECTED]] > Sent: Wednesday, August 22, 2012 5:41 PM > To: J Mohamed Zahoor; [EMAIL PROTECTED] > Subject: Re: Using HBase serving to replace memcached > > Thanks Zahoor, > > I read through the document you referred to, I am confused about what means leaf-level index, intermediate-level index and root-level index. It is appreciate if you could give more details what they are, or point me to the related documents. > > BTW: the document you pointed me is very good, however I miss some basic background of 3 terms I mentioned above. :-) > > regards, > Lin > > On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote: > >> I could be wrong. I think HFile index block (which is located at the >> end >>> of HFile) is a binary search tree containing all row-key values (of >>> the >>> HFile) in the binary search tree. Searching a specific row-key in the >>> binary search tree could easily find whether a row-key exists (some >>> node in the tree has the same row-key value) or not. Why we need load
-
Re: Using HBase serving to replace memcachedPamecha, Abhishek 2012-08-23, 05:05
Thanks all..
i Sent from my iPad with iMstakes On Aug 22, 2012, at 20:53, "J Mohamed Zahoor" <[EMAIL PROTECTED]> wrote: > If you need to search row and column qualifiers you can pick row+ col bloom to help you skip blocks. > > ./Zahoor@iPad > > On 22-Aug-2012, at 10:58 PM, "Pamecha, Abhishek" <[EMAIL PROTECTED]> wrote: > >> Great explanation. May be diverging from the thread's original question, but could you also care to explain the difference if any, in searching for a rowkey [ that you mentioned below ] Vs searching for a specific column qualifier. Are there any optimizations for column qualifier search too or that one just needs to load all blocks that match the rowkey crieteria and then scan each one of them from start to end? >> >> Thanks, >> Abhishek >> >> >> -----Original Message----- >> From: Anoop Sam John [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, August 22, 2012 5:35 AM >> To: [EMAIL PROTECTED]; J Mohamed Zahoor >> Subject: RE: Using HBase serving to replace memcached >> >>> I could be wrong. I think HFile index block (which is located at the >>> end >>>> of HFile) is a binary search tree containing all row-key values (of >>>> the >>>> HFile) in the binary search tree. Searching a specific row-key in the >>>> binary search tree could easily find whether a row-key exists (some >>>> node in the tree has the same row-key value) or not. Why we need load >>>> every block to find if the row exists? >> >> I think there is some confusion with you people regarding the blooms and the block index.I will try to clarify this point. >> Block index will be there with every HFile. Within an HFile the data will be written as multiple blocks. While reading data block by block only HBase read data from the HDFS layer. The block index contains the information regarding the blocks within that HFile. The information include the start and end rowkeys which resides in that particular block and the block information like offset of that block and its length etc. Now when a request comes for getting a rowkey 'x' all the HFiles within that region need to be checked.[KV can be present in any of the HFile] Now in order to know this row will be present in which block within an HFile, this block index will be used. Well this block index will be there in memory always. This lookup will tell only the possible block in which the row is present. HBase will load that block and will read through it to get the row which we are interested in now. >> Bloom is like it will have information about each and every row added into that HFile[Block index wont have info about each and every row]. This bloom information will be there in memory always. So when a read request to get row 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in this file or not. If this is not there, as per the bloom, no block at all will be fetched. But if bloom is not enabled, we might find one block which is having a row range such that 'x' comes in between and Hbase will load that block. So usage of blooms can avoid this IO. Hope this is clear for you now. >> >> -Anoop- >> ________________________________________ >> From: Lin Ma [[EMAIL PROTECTED]] >> Sent: Wednesday, August 22, 2012 5:41 PM >> To: J Mohamed Zahoor; [EMAIL PROTECTED] >> Subject: Re: Using HBase serving to replace memcached >> >> Thanks Zahoor, >> >> I read through the document you referred to, I am confused about what means leaf-level index, intermediate-level index and root-level index. It is appreciate if you could give more details what they are, or point me to the related documents. >> >> BTW: the document you pointed me is very good, however I miss some basic background of 3 terms I mentioned above. :-) >> >> regards, >> Lin >> >> On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote: >> >>> I could be wrong. I think HFile index block (which is located at the >>> end >>>> of HFile) is a binary search tree containing all row-key values (of |