|
Murali Krishna. P
2010-09-02, 17:43
Andrey Stepachev
2010-09-02, 18:44
Murali Krishna. P
2010-09-03, 12:30
Samuru Jackson
2010-09-03, 12:54
Michael Segel
2010-09-03, 14:57
Murali Krishna. P
2010-09-04, 13:55
Samuru Jackson
2010-09-04, 19:03
Todd Lipcon
2010-09-04, 19:16
Andrey Stepachev
2010-09-04, 22:23
Samuru Jackson
2010-09-05, 00:57
Murali Krishna. P
2010-09-05, 05:17
Andrey Stepachev
2010-09-05, 18:12
Andrey Stepachev
2010-09-05, 18:24
Murali Krishna. P
2010-09-06, 05:02
Ted Yu
2010-09-06, 13:53
Murali Krishna. P
2010-09-06, 17:13
Andrey Stepachev
2010-09-06, 18:46
|
-
HBase secondary index performanceMurali Krishna. P 2010-09-02, 17:43
Hi,
I have an indexedtable with index on around 20 columns. The write performance on the original table is around 60 per second. This is just a one node setup. Even with mutiple parallel clients, I am getting just 60 writes/second. That means a total write of 60 * 20 = 1200 writes/second due to 20 indextables? This is not good enough for our application. Is this number 1200 look right ? I was expecting around 15k. I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core, 7.2k rpm disk). Will adding nodes increase the writes linearly? Thanks, Murali Krishna
-
Re: HBase secondary index performanceAndrey Stepachev 2010-09-02, 18:44
First, check that you connection not in autoflash mode.
Second, you can think about custom indexing instead of using indexedtable. In my experience custom idexing (especially if data doesn't modified), is much more performant. Problem with indexedtable is in fact, that on every insert hbase performs one (random) get operation (to check, that we doesn't have previous indexed data, and delete if it exists). Random gets are lays around 100-400 request per node, so you get 60 looks good (for indexedtable). How to build custom indexes you can read http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/ 2010/9/2 Murali Krishna. P <[EMAIL PROTECTED]>: > Hi, > I have an indexedtable with index on around 20 columns. The write > performance on the original table is around 60 per second. This is just a one > node setup. Even with mutiple parallel clients, I am getting just 60 > writes/second. That means a total write of 60 * 20 = 1200 writes/second due to > 20 indextables? This is not good enough for our application. Is this number 1200 > look right ? I was expecting around 15k. > I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core, > 7.2k rpm disk). Will adding nodes increase the writes linearly? > > Thanks, > Murali Krishna >
-
Re: HBase secondary index performanceMurali Krishna. P 2010-09-03, 12:30
Thanks Andrey,
* Setting the autoflush to false and increasing the writeBuffer size to 12MB improved the writes to 100/s * custom indexing is good, but our data keeps changing every day. So, probably indextable is the best option for us * Just added one more regionserver and it did not help. Actually it went back to 60/s for some strange reason(with one client). The requests in the hbase ui is not uniform across 2 region servers. One server is doing around 2000 and the other 500. Probably once the region gets split and when we have lots of data, writes will improve ? (Now it is just writing to one region for the main table) * Is there some way to do bulk load the indexedtable? Earlier I have used the bulk loader tool (mapreduce job which creates the regions offline) but not sure whether it works with indexed table. Thanks, Murali Krishna ________________________________ From: Andrey Stepachev <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Fri, 3 September, 2010 12:14:29 AM Subject: Re: HBase secondary index performance First, check that you connection not in autoflash mode. Second, you can think about custom indexing instead of using indexedtable. In my experience custom idexing (especially if data doesn't modified), is much more performant. Problem with indexedtable is in fact, that on every insert hbase performs one (random) get operation (to check, that we doesn't have previous indexed data, and delete if it exists). Random gets are lays around 100-400 request per node, so you get 60 looks good (for indexedtable). How to build custom indexes you can read http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/ 2010/9/2 Murali Krishna. P <[EMAIL PROTECTED]>: > Hi, > I have an indexedtable with index on around 20 columns. The write > performance on the original table is around 60 per second. This is just a one > node setup. Even with mutiple parallel clients, I am getting just 60 > writes/second. That means a total write of 60 * 20 = 1200 writes/second due to > 20 indextables? This is not good enough for our application. Is this number >1200 > look right ? I was expecting around 15k. > I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core, > 7.2k rpm disk). Will adding nodes increase the writes linearly? > > Thanks, > Murali Krishna >
-
Re: HBase secondary index performanceSamuru Jackson 2010-09-03, 12:54
Hi,
I wrote my own Indexer and actually I have a pretty good performance. However, there are still known places where I could gain even more performance (just not having the time right now). What is important is to create bulk loads when you are indexing something. I posted this one before, but maybe you have missed it: I create a Put List out of those records: List<Put> pList = new ArrayList<Put>(); where each Put has WriteToWAL set to false; put.setWriteToWAL(false); pList.add(p); Then I set autoflush to false and create a larger writebuffer: hTable.setAutoFlush(false); hTable.setWriteBufferSize( 1024*1024*12); hTable.put(pList); hTable.setAutoFlush(true); The following settings have boosted my load performance 5times - without any bigger performance tunings, no special HW configuration I achieve 8000-9000 records per second: p.setWriteToWAL(false); hTable.setAutoFlush(false); hTable.setWriteBufferSize(1024*1024*12); /SJ http://uncinuscloud.blogspot.com/ On Fri, Sep 3, 2010 at 8:30 AM, Murali Krishna. P <[EMAIL PROTECTED]>wrote: > Thanks Andrey, > > * Setting the autoflush to false and increasing the writeBuffer size > to 12MB > improved the writes to 100/s > * custom indexing is good, but our data keeps changing every day. > So, probably > indextable is the best option for us > * Just added one more regionserver and it did not help. Actually it > went back > to 60/s for some strange reason(with one client). The requests in the hbase > ui > is not uniform across 2 region servers. One server is doing around 2000 and > the > other 500. Probably once the region gets split and when we have lots of > data, > writes will improve ? (Now it is just writing to one region for the main > table) > * Is there some way to do bulk load the indexedtable? Earlier I have > used the > bulk loader tool (mapreduce job which creates the regions offline) but not > sure > whether it works with indexed table. > > > Thanks, > Murali Krishna > > > > > ________________________________ > From: Andrey Stepachev <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Fri, 3 September, 2010 12:14:29 AM > Subject: Re: HBase secondary index performance > > First, check that you connection not in autoflash mode. > Second, you can think about custom indexing instead > of using indexedtable. In my experience custom idexing > (especially if data doesn't modified), is much more performant. > Problem with indexedtable is in fact, that on every insert > hbase performs one (random) get operation (to check, that we doesn't > have previous indexed data, and delete if it exists). Random gets are > lays around 100-400 request per node, so you get 60 looks good > (for indexedtable). > > How to build custom indexes you can read > > http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/ > > > 2010/9/2 Murali Krishna. P <[EMAIL PROTECTED]>: > > Hi, > > I have an indexedtable with index on around 20 columns. The write > > performance on the original table is around 60 per second. This is just a > one > > node setup. Even with mutiple parallel clients, I am getting just 60 > > writes/second. That means a total write of 60 * 20 = 1200 writes/second > due to > > 20 indextables? This is not good enough for our application. Is this > number > >1200 > > look right ? I was expecting around 15k. > > I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, > 2core, > > 7.2k rpm disk). Will adding nodes increase the writes linearly? > > > > Thanks, > > Murali Krishna > > >
-
RE: HBase secondary index performanceMichael Segel 2010-09-03, 14:57
> Date: Fri, 3 Sep 2010 18:00:42 +0530 > From: [EMAIL PROTECTED] > Subject: Re: HBase secondary index performance > To: [EMAIL PROTECTED] > > Thanks Andrey, > > * Setting the autoflush to false and increasing the writeBuffer size to 12MB > improved the writes to 100/s > * custom indexing is good, but our data keeps changing every day. So, probably > indextable is the best option for us > * Just added one more regionserver and it did not help. Actually it went back > to 60/s for some strange reason(with one client). The requests in the hbase ui > is not uniform across 2 region servers. One server is doing around 2000 and the > other 500. Probably once the region gets split and when we have lots of data, > writes will improve ? (Now it is just writing to one region for the main table) > * Is there some way to do bulk load the indexedtable? Earlier I have used the > bulk loader tool (mapreduce job which creates the regions offline) but not sure > whether it works with indexed table. Just a small suggestion... If you have a table that is populated and you add a new region server, your data isn't going to balance itself out. If you want to balance your existing data, you'll need to bring down hbase, then run hadoop's balancer app. When its completed, you'll see that your data is now spread more evenly across the cloud. Please remember that you need to have HBase down when you run the balancer app.
-
Re: HBase secondary index performanceMurali Krishna. P 2010-09-04, 13:55
Thanks Samuru,
I was reading about custom indexing in habse, just wanted to know how are we handling the updates incase of custom indexing. Probably if the original data doesn't change, it might be a good solution. Say, if one of the column value gets changed in the original table, we need to query the index table for the orignal column value, delete it and then add an entry for the new value. I think this will run into consistency issues since we are doing it in a non-transactional manner. Are we always doing full indexing and not worry about increments ? May be I am missing something here since I am new to this. My requirements are such that daily updates are around 10 million records where most of it are just updates and we want it to be real time (or NRT). Any suggestions are appreciated. Thanks, Murali Krishna ________________________________ From: Samuru Jackson <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Fri, 3 September, 2010 6:24:16 PM Subject: Re: HBase secondary index performance Hi, I wrote my own Indexer and actually I have a pretty good performance. However, there are still known places where I could gain even more performance (just not having the time right now). What is important is to create bulk loads when you are indexing something. I posted this one before, but maybe you have missed it: I create a Put List out of those records: List<Put> pList = new ArrayList<Put>(); where each Put has WriteToWAL set to false; put.setWriteToWAL(false); pList.add(p); Then I set autoflush to false and create a larger writebuffer: hTable.setAutoFlush(false); hTable.setWriteBufferSize( 1024*1024*12); hTable.put(pList); hTable.setAutoFlush(true); The following settings have boosted my load performance 5times - without any bigger performance tunings, no special HW configuration I achieve 8000-9000 records per second: p.setWriteToWAL(false); hTable.setAutoFlush(false); hTable.setWriteBufferSize(1024*1024*12); /SJ http://uncinuscloud.blogspot.com/ On Fri, Sep 3, 2010 at 8:30 AM, Murali Krishna. P <[EMAIL PROTECTED]>wrote: > Thanks Andrey, > > * Setting the autoflush to false and increasing the writeBuffer size > to 12MB > improved the writes to 100/s > * custom indexing is good, but our data keeps changing every day. > So, probably > indextable is the best option for us > * Just added one more regionserver and it did not help. Actually it > went back > to 60/s for some strange reason(with one client). The requests in the hbase > ui > is not uniform across 2 region servers. One server is doing around 2000 and > the > other 500. Probably once the region gets split and when we have lots of > data, > writes will improve ? (Now it is just writing to one region for the main > table) > * Is there some way to do bulk load the indexedtable? Earlier I have > used the > bulk loader tool (mapreduce job which creates the regions offline) but not > sure > whether it works with indexed table. > > > Thanks, > Murali Krishna > > > > > ________________________________ > From: Andrey Stepachev <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Fri, 3 September, 2010 12:14:29 AM > Subject: Re: HBase secondary index performance > > First, check that you connection not in autoflash mode. > Second, you can think about custom indexing instead > of using indexedtable. In my experience custom idexing > (especially if data doesn't modified), is much more performant. > Problem with indexedtable is in fact, that on every insert > hbase performs one (random) get operation (to check, that we doesn't > have previous indexed data, and delete if it exists). Random gets are > lays around 100-400 request per node, so you get 60 looks good > (for indexedtable). > > How to build custom indexes you can read > >http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/ >/ > > > 2010/9/2 Murali Krishna. P <[EMAIL PROTECTED]>:
-
Re: HBase secondary index performanceSamuru Jackson 2010-09-04, 19:03
Hi,
I'm not sure if I understand your problems completely, but relating to your update issue: HBase keeps versions of your columns. If you have an index on something that needs to be updated you just overwrite the value in the index. There is no need to remove things. I also organize my indexes in separate tables. There is one table for each indexed column of a table and I also keep separate tables for composite indexes. For a fast retrieval I created an indexmanager table which I can use to retrieve the corrsponding indexes for attributes and also keep statistics about them for query planning for instance. Cheers! /SJ ----------- http://uncinuscloud.blogspot.com/ On Sat, Sep 4, 2010 at 9:55 AM, Murali Krishna. P <[EMAIL PROTECTED]>wrote: > Thanks Samuru, > I was reading about custom indexing in habse, just wanted to know how > are we > handling the updates incase of custom indexing. Probably if the original > data > doesn't change, it might be a good solution. Say, if one of the column > value > gets changed in the original table, we need to query the index table for > the > orignal column value, delete it and then add an entry for the new value. I > think > this will run into consistency issues since we are doing it in a > non-transactional manner. > > Are we always doing full indexing and not worry about increments ? May > be I > am missing something here since I am new to this. > > My requirements are such that daily updates are around 10 million records > where > most of it are just updates and we want it to be real time (or NRT). Any > suggestions are appreciated. > > Thanks, > Murali Krishna > > > > > ________________________________ > From: Samuru Jackson <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Fri, 3 September, 2010 6:24:16 PM > Subject: Re: HBase secondary index performance > > Hi, > > I wrote my own Indexer and actually I have a pretty good performance. > However, there are still known places where I could gain even more > performance (just not having the time right now). > > What is important is to create bulk loads when you are indexing something. > I > posted this one before, but maybe you have missed it: > > I create a Put List out of those records: > > List<Put> pList = new ArrayList<Put>(); > > where each Put has WriteToWAL set to false; > > put.setWriteToWAL(false); > pList.add(p); > > Then I set autoflush to false and create a larger writebuffer: > > hTable.setAutoFlush(false); > hTable.setWriteBufferSize( > 1024*1024*12); > hTable.put(pList); > hTable.setAutoFlush(true); > > The following settings have boosted my load performance 5times - > without any bigger performance tunings, no special HW configuration I > achieve 8000-9000 records per second: > p.setWriteToWAL(false); > hTable.setAutoFlush(false); > hTable.setWriteBufferSize(1024*1024*12); > > > /SJ > http://uncinuscloud.blogspot.com/ > > > > > > > > On Fri, Sep 3, 2010 at 8:30 AM, Murali Krishna. P <[EMAIL PROTECTED] > >wrote: > > > Thanks Andrey, > > > > * Setting the autoflush to false and increasing the writeBuffer > size > > to 12MB > > improved the writes to 100/s > > * custom indexing is good, but our data keeps changing every day. > > So, probably > > indextable is the best option for us > > * Just added one more regionserver and it did not help. Actually > it > > went back > > to 60/s for some strange reason(with one client). The requests in the > hbase > > ui > > is not uniform across 2 region servers. One server is doing around 2000 > and > > the > > other 500. Probably once the region gets split and when we have lots of > > data, > > writes will improve ? (Now it is just writing to one region for the main > > table) > > * Is there some way to do bulk load the indexedtable? Earlier I > have > > used the > > bulk loader tool (mapreduce job which creates the regions offline) but > not > > sure > > whether it works with indexed table. > > > > > > Thanks, > > Murali Krishna /SJ http://uncinuscloud.blogspot.com/
-
Re: HBase secondary index performanceTodd Lipcon 2010-09-04, 19:16
On Fri, Sep 3, 2010 at 7:57 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
> > > > > Date: Fri, 3 Sep 2010 18:00:42 +0530 > > From: [EMAIL PROTECTED] > > Subject: Re: HBase secondary index performance > > To: [EMAIL PROTECTED] > > > > Thanks Andrey, > > > > * Setting the autoflush to false and increasing the writeBuffer > size to 12MB > > improved the writes to 100/s > > * custom indexing is good, but our data keeps changing every day. > So, probably > > indextable is the best option for us > > * Just added one more regionserver and it did not help. Actually it > went back > > to 60/s for some strange reason(with one client). The requests in the > hbase ui > > is not uniform across 2 region servers. One server is doing around 2000 > and the > > other 500. Probably once the region gets split and when we have lots of > data, > > writes will improve ? (Now it is just writing to one region for the main > table) > > * Is there some way to do bulk load the indexedtable? Earlier I > have used the > > bulk loader tool (mapreduce job which creates the regions offline) but > not sure > > whether it works with indexed table. > > Just a small suggestion... > > If you have a table that is populated and you add a new region server, your > data isn't going to balance itself out. > If you want to balance your existing data, you'll need to bring down hbase, > then run hadoop's balancer app. When its completed, you'll see that your > data is now spread more evenly across the cloud. Please remember that you > need to have HBase down when you run the balancer app. > > > The above is all incorrect. The data *will* balance itself out on HDFS after major compactions have taken place, and even before that, the regions *will* balance themselves across region servers. Running the balancer while HBase is running is also perfectly safe, though it is not necessary for performance reasons. -Todd > -- Todd Lipcon Software Engineer, Cloudera
-
Re: HBase secondary index performanceAndrey Stepachev 2010-09-04, 22:23
2010/9/3 Murali Krishna. P <[EMAIL PROTECTED]>:
> * custom indexing is good, but our data keeps changing every day. So, probably > indextable is the best option for us In case of custom indexing you can use timestamps to check, that index record still valid. (or ever simply recheck existance of the value) Also you need regular index cleanup (mr job or some custom application). To index some row identified by 'key' having 'value', we can create index table, where key will be [value:key] and insert rows every time, when we insert our values. We will got 30k rows/s/node. When we want to find all 'value', we scan [value:0000, value:9999] and find all keys, which point to rows, containing values. We scan index, random get rows, recheck, that index is still valid (check value or timestamp, index timestamp should be >= value timestamp) and return only valid values (may be we can even delete on the fly when we got negative result to automatically clenup stale data). > * Just added one more regionserver and it did not help. Actually it went back > to 60/s for some strange reason(with one client). The requests in the hbase ui > is not uniform across 2 region servers. One server is doing around 2000 and the > other 500. Probably once the region gets split and when we have lots of data, > writes will improve ? (Now it is just writing to one region for the main table) Looks like all data goes to one region server. Try to make more random writes (may be you should make key as random uuid or other key randomization technique) > * Is there some way to do bulk load the indexedtable? Earlier I have used the > bulk loader tool (mapreduce job which creates the regions offline) but not sure > whether it works with indexed table. No sure, but you can look at source code, and try to emulate indexing operations in your code after regular bulk loading. > > > Thanks, > Murali Krishna > > Andrey.
-
Re: HBase secondary index performanceSamuru Jackson 2010-09-05, 00:57
Hi,
> where key will be [value:key] and insert rows every time, when we insert > our values. We will got 30k rows/s/node. Could you specify on what kind of hardware you did this? How did you design your indexer? Is it multithreaded? /SJ ----------- http://uncinuscloud.blogspot.com/
-
Re: HBase secondary index performanceMurali Krishna. P 2010-09-05, 05:17
Hi,
Thanks for the detailed explanation, I liked the idea of timestamp check, this will be good enough for us and I can put a periodic MR cleaner. However I need some help in understanding the 30K number that was claimed. With the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns). I understood that there arean additional reads that indextable does but 25X improvement that you got is very impressive. Can you please help me to understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz) Thanks, Murali Krishna ________________________________ From: Andrey Stepachev <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Sun, 5 September, 2010 3:53:26 AM Subject: Re: HBase secondary index performance 2010/9/3 Murali Krishna. P <[EMAIL PROTECTED]>: > * custom indexing is good, but our data keeps changing every day. So, >probably > indextable is the best option for us In case of custom indexing you can use timestamps to check, that index record still valid. (or ever simply recheck existance of the value) Also you need regular index cleanup (mr job or some custom application). To index some row identified by 'key' having 'value', we can create index table, where key will be [value:key] and insert rows every time, when we insert our values. We will got 30k rows/s/node. When we want to find all 'value', we scan [value:0000, value:9999] and find all keys, which point to rows, containing values. We scan index, random get rows, recheck, that index is still valid (check value or timestamp, index timestamp should be >= value timestamp) and return only valid values (may be we can even delete on the fly when we got negative result to automatically clenup stale data). > * Just added one more regionserver and it did not help. Actually it went >back > to 60/s for some strange reason(with one client). The requests in the hbase ui > is not uniform across 2 region servers. One server is doing around 2000 and the > other 500. Probably once the region gets split and when we have lots of data, > writes will improve ? (Now it is just writing to one region for the main table) Looks like all data goes to one region server. Try to make more random writes (may be you should make key as random uuid or other key randomization technique) > * Is there some way to do bulk load the indexedtable? Earlier I have >used the > bulk loader tool (mapreduce job which creates the regions offline) but not sure > whether it works with indexed table. No sure, but you can look at source code, and try to emulate indexing operations in your code after regular bulk loading. > > > Thanks, > Murali Krishna > > Andrey.
-
Re: HBase secondary index performanceAndrey Stepachev 2010-09-05, 18:12
2010/9/5 Samuru Jackson <[EMAIL PROTECTED]>:
> Hi, > >> where key will be [value:key] and insert rows every time, when we insert >> our values. We will got 30k rows/s/node. > > Could you specify on what kind of hardware you did this? 3 node "cluster", 16Gb core2duo. sas raid10. > How did you design your indexer? Is it multithreaded? It is not and indexer, It is abstraction around HTable, which does put plus additional puts (as described before) into index tables. Later (i don't have actual date now), i release this code, but it is not a rocket science. 30k - it is peak requests/ps not a constant rate. Effective rows (json objects with 1-2 indexes on them and 100-500bytes) i got 1-3k objects per node. > > /SJ > ----------- > http://uncinuscloud.blogspot.com/ >
-
Re: HBase secondary index performanceAndrey Stepachev 2010-09-05, 18:24
2010/9/5 Murali Krishna. P <[EMAIL PROTECTED]>:
> Hi, > Thanks for the detailed explanation, I liked the idea of timestamp > check, this will be good enough for us and I can put a periodic MR cleaner. > However I need some help in understanding the 30K number that was claimed. Real insert rate will depend on size of row, size of write buffer etc. In case of simple row with one long per row i got 30k requests/second (shown in hbase). Json serialised objects 100-700bytes each, with validation I can insert 2-6k objects (json) per second. With > the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns). > I understood that there arean additional reads that indextable does but 25X > improvement that you got is very impressive. Can you please help me to > understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz) Did you try to insert data into non indexed region (disable indexedtables extension)? What numbers did you got? > > Thanks, > Murali Krishna > > > > > ________________________________ > From: Andrey Stepachev <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Sun, 5 September, 2010 3:53:26 AM > Subject: Re: HBase secondary index performance > > 2010/9/3 Murali Krishna. P <[EMAIL PROTECTED]>: > >> * custom indexing is good, but our data keeps changing every day. So, >>probably >> indextable is the best option for us > > In case of custom indexing you can use timestamps to check, that index > record still valid. > (or ever simply recheck existance of the value) > Also you need regular index cleanup (mr job or some custom application). > > To index some row identified by 'key' having 'value', we can create > index table, > where key will be [value:key] and insert rows every time, when we insert > our values. We will got 30k rows/s/node. > When we want to find all 'value', we scan [value:0000, value:9999] and > find all keys, > which point to rows, containing values. > We scan index, random get rows, recheck, that index is still valid > (check value or timestamp, index timestamp should be >= value timestamp) and > return only valid values (may be we can even delete on the fly when we > got negative > result to automatically clenup stale data). > > >> * Just added one more regionserver and it did not help. Actually it went >>back >> to 60/s for some strange reason(with one client). The requests in the hbase ui >> is not uniform across 2 region servers. One server is doing around 2000 and > the >> other 500. Probably once the region gets split and when we have lots of data, >> writes will improve ? (Now it is just writing to one region for the main > table) > > Looks like all data goes to one region server. Try to make more random writes > (may be you should make key as random uuid or other key randomization technique) > >> * Is there some way to do bulk load the indexedtable? Earlier I have >>used the >> bulk loader tool (mapreduce job which creates the regions offline) but not > sure >> whether it works with indexed table. > > No sure, but you can look at source code, and try to emulate indexing > operations in > your code after regular bulk loading. > >> >> >> Thanks, >> Murali Krishna >> >> > > Andrey. >
-
Re: HBase secondary index performanceMurali Krishna. P 2010-09-06, 05:02
Hi,
My row size is around 300 bytes with total 20 columns. I tried the custom indexing without the write to WAL. Currently having only 2 tables, one for the main table and another for all 20 indexes. My key to the index table is columnValue+columnName+rowKey. I am getting around 500 inserts/second now. (ie, total of ~10K puts). This is probably comparable with your numbers based on the data size. I have some doubts with the hbase write implementation. * Is this the max that we can achieve with any number of region servers? Why adding region servers not improving the write performance? Is it because when the data doesn't exist in the table, it always writes to one region ? * Probably writing to an existing, well distributed table might give better performance since the writes will be across machines ? In that case, if we have multiple tables (one per index), will it be better during the initial write itself (since each table has different region) ?? Thanks, Murali Krishna ________________________________ From: Andrey Stepachev <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Sun, 5 September, 2010 11:54:45 PM Subject: Re: HBase secondary index performance 2010/9/5 Murali Krishna. P <[EMAIL PROTECTED]>: > Hi, > Thanks for the detailed explanation, I liked the idea of timestamp > check, this will be good enough for us and I can put a periodic MR cleaner. > However I need some help in understanding the 30K number that was claimed. Real insert rate will depend on size of row, size of write buffer etc. In case of simple row with one long per row i got 30k requests/second (shown in hbase). Json serialised objects 100-700bytes each, with validation I can insert 2-6k objects (json) per second. With > the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns). > I understood that there arean additional reads that indextable does but 25X > improvement that you got is very impressive. Can you please help me to > understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz) Did you try to insert data into non indexed region (disable indexedtables extension)? What numbers did you got? > > Thanks, > Murali Krishna > > > > > ________________________________ > From: Andrey Stepachev <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Sun, 5 September, 2010 3:53:26 AM > Subject: Re: HBase secondary index performance > > 2010/9/3 Murali Krishna. P <[EMAIL PROTECTED]>: > >> * custom indexing is good, but our data keeps changing every day. So, >>probably >> indextable is the best option for us > > In case of custom indexing you can use timestamps to check, that index > record still valid. > (or ever simply recheck existance of the value) > Also you need regular index cleanup (mr job or some custom application). > > To index some row identified by 'key' having 'value', we can create > index table, > where key will be [value:key] and insert rows every time, when we insert > our values. We will got 30k rows/s/node. > When we want to find all 'value', we scan [value:0000, value:9999] and > find all keys, > which point to rows, containing values. > We scan index, random get rows, recheck, that index is still valid > (check value or timestamp, index timestamp should be >= value timestamp) and > return only valid values (may be we can even delete on the fly when we > got negative > result to automatically clenup stale data). > > >> * Just added one more regionserver and it did not help. Actually it >went >>back >> to 60/s for some strange reason(with one client). The requests in the hbase ui >> is not uniform across 2 region servers. One server is doing around 2000 and > the >> other 500. Probably once the region gets split and when we have lots of data, >> writes will improve ? (Now it is just writing to one region for the main > table) > > Looks like all data goes to one region server. Try to make more random writes > (may be you should make key as random uuid or other key randomization
-
Re: HBase secondary index performanceTed Yu 2010-09-06, 13:53
> My key to the index table is columnValue+columnName+rowKey.
You need to consider the distribution of the above key so that write to index table doesn't become bottleneck in the write path. Please clarify how this index table serves 20 columns - in the above schema, columnValue would be different for the 20 columns indexed, I assume. On Sun, Sep 5, 2010 at 10:02 PM, Murali Krishna. P <[EMAIL PROTECTED]>wrote: > Hi, > My row size is around 300 bytes with total 20 columns. I tried the custom > indexing without the write to WAL. Currently having only 2 tables, one for > the > main table and another for all 20 indexes. My key to the index table is > columnValue+columnName+rowKey. > I am getting around 500 inserts/second now. (ie, total of ~10K puts). This > is > probably comparable with your numbers based on the data size. > I have some doubts with the hbase write implementation. > * Is this the max that we can achieve with any number of region servers? > Why > adding region servers not improving the write performance? Is it because > when > the data doesn't exist in the table, it always writes to one region ? > > * Probably writing to an existing, well distributed table might give better > performance since the writes will be across machines ? In that case, if we > have > multiple tables (one per index), will it be better during the initial write > itself (since each table has different region) ?? > > Thanks, > Murali Krishna > > > > > ________________________________ > From: Andrey Stepachev <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Sun, 5 September, 2010 11:54:45 PM > Subject: Re: HBase secondary index performance > > 2010/9/5 Murali Krishna. P <[EMAIL PROTECTED]>: > > Hi, > > Thanks for the detailed explanation, I liked the idea of timestamp > > check, this will be good enough for us and I can put a periodic MR > cleaner. > > However I need some help in understanding the 30K number that was > claimed. > > Real insert rate will depend on size of row, size of write buffer etc. > In case of simple row with one long per row i got 30k requests/second > (shown in hbase). > Json serialised objects 100-700bytes each, with validation I can insert > 2-6k > objects (json) per second. > > With > > the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index > columns). > > I understood that there arean additional reads that indextable does but > 25X > > improvement that you got is very impressive. Can you please help me to > > understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz) > > Did you try to insert data into non indexed region (disable > indexedtables extension)? > What numbers did you got? > > > > > Thanks, > > Murali Krishna > > > > > > > > > > ________________________________ > > From: Andrey Stepachev <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Sun, 5 September, 2010 3:53:26 AM > > Subject: Re: HBase secondary index performance > > > > 2010/9/3 Murali Krishna. P <[EMAIL PROTECTED]>: > > > >> * custom indexing is good, but our data keeps changing every day. > So, > >>probably > >> indextable is the best option for us > > > > In case of custom indexing you can use timestamps to check, that index > > record still valid. > > (or ever simply recheck existance of the value) > > Also you need regular index cleanup (mr job or some custom application). > > > > To index some row identified by 'key' having 'value', we can create > > index table, > > where key will be [value:key] and insert rows every time, when we insert > > our values. We will got 30k rows/s/node. > > When we want to find all 'value', we scan [value:0000, value:9999] and > > find all keys, > > which point to rows, containing values. > > We scan index, random get rows, recheck, that index is still valid > > (check value or timestamp, index timestamp should be >= value timestamp) > and > > return only valid values (may be we can even delete on the fly when we > > got negative > > result to automatically clenup stale data).
-
Re: HBase secondary index performanceMurali Krishna. P 2010-09-06, 17:13
> Please clarify how this index table serves 20 columns - in the above schema,
> columnValue would be different for the 20 columns indexed, I assume. My query to the index table will be columnValue + columnName. This is for exact match, if you need scan on partial value, we have to reverse the key generation-> cName+ cValue + rowKey. I went for this schema to reduce the number of tables involved. Thanks, Murali Krishna ________________________________ From: Ted Yu <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Mon, 6 September, 2010 7:23:22 PM Subject: Re: HBase secondary index performance > My key to the index table is columnValue+columnName+rowKey. You need to consider the distribution of the above key so that write to index table doesn't become bottleneck in the write path. Please clarify how this index table serves 20 columns - in the above schema, columnValue would be different for the 20 columns indexed, I assume. On Sun, Sep 5, 2010 at 10:02 PM, Murali Krishna. P <[EMAIL PROTECTED]>wrote: > Hi, > My row size is around 300 bytes with total 20 columns. I tried the custom > indexing without the write to WAL. Currently having only 2 tables, one for > the > main table and another for all 20 indexes. My key to the index table is > columnValue+columnName+rowKey. > I am getting around 500 inserts/second now. (ie, total of ~10K puts). This > is > probably comparable with your numbers based on the data size. > I have some doubts with the hbase write implementation. > * Is this the max that we can achieve with any number of region servers? > Why > adding region servers not improving the write performance? Is it because > when > the data doesn't exist in the table, it always writes to one region ? > > * Probably writing to an existing, well distributed table might give better > performance since the writes will be across machines ? In that case, if we > have > multiple tables (one per index), will it be better during the initial write > itself (since each table has different region) ?? > > Thanks, > Murali Krishna > > > > > ________________________________ > From: Andrey Stepachev <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Sun, 5 September, 2010 11:54:45 PM > Subject: Re: HBase secondary index performance > > 2010/9/5 Murali Krishna. P <[EMAIL PROTECTED]>: > > Hi, > > Thanks for the detailed explanation, I liked the idea of timestamp > > check, this will be good enough for us and I can put a periodic MR > cleaner. > > However I need some help in understanding the 30K number that was > claimed. > > Real insert rate will depend on size of row, size of write buffer etc. > In case of simple row with one long per row i got 30k requests/second > (shown in hbase). > Json serialised objects 100-700bytes each, with validation I can insert > 2-6k > objects (json) per second. > > With > > the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index > columns). > > I understood that there arean additional reads that indextable does but > 25X > > improvement that you got is very impressive. Can you please help me to > > understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz) > > Did you try to insert data into non indexed region (disable > indexedtables extension)? > What numbers did you got? > > > > > Thanks, > > Murali Krishna > > > > > > > > > > ________________________________ > > From: Andrey Stepachev <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Sun, 5 September, 2010 3:53:26 AM > > Subject: Re: HBase secondary index performance > > > > 2010/9/3 Murali Krishna. P <[EMAIL PROTECTED]>: > > > >> * custom indexing is good, but our data keeps changing every day. > So, > >>probably > >> indextable is the best option for us > > > > In case of custom indexing you can use timestamps to check, that index > > record still valid. > > (or ever simply recheck existance of the value) > > Also you need regular index cleanup (mr job or some custom application).
-
Re: HBase secondary index performanceAndrey Stepachev 2010-09-06, 18:46
2010/9/6 Murali Krishna. P <[EMAIL PROTECTED]>:
> Hi, > My row size is around 300 bytes with total 20 columns. I tried the custom > indexing without the write to WAL. Currently having only 2 tables, one for the > main table and another for all 20 indexes. My key to the index table is > columnValue+columnName+rowKey. As mentioned before, you can randomize you index insertions. If you don't order scan or range scan on columnValue, you can prefix it with some hash, f.e. sha(columnValue) + columnValue + columnName + rowKey. This remove hotspot in one of your region servers. > I am getting around 500 inserts/second now. (ie, total of ~10K puts). This is > probably comparable with your numbers based on the data size. Are all region servers get equal load, or some servers are more busy, then others? > I have some doubts with the hbase write implementation. > * Is this the max that we can achieve with any number of region servers? Why > adding region servers not improving the write performance? Is it because when > the data doesn't exist in the table, it always writes to one region ? In general - yes. Before tables splits, you will get all writes into one region server. > * Probably writing to an existing, well distributed table might give better > performance since the writes will be across machines ? In that case, if we have > multiple tables (one per index), will it be better during the initial write > itself (since each table has different region) ?? More servers affect the recording, the better. Andrey. |