|
Christian Schäfer
2012-07-31, 15:27
Jerry Lam
2012-07-31, 17:10
Matt Corgan
2012-07-31, 17:41
Christian Schäfer
2012-08-01, 08:18
Michael Segel
2012-08-01, 11:52
Christian Schäfer
2012-08-02, 12:23
Alex Baranau
2012-08-02, 22:57
Matt Corgan
2012-08-02, 23:09
Alex Baranau
2012-08-03, 01:15
Matt Corgan
2012-08-03, 01:29
Christian Schäfer
2012-08-03, 09:23
Christian Schäfer
2012-08-03, 09:34
Michael Segel
2012-08-03, 12:21
Alex Baranau
2012-08-03, 22:14
Christian Schäfer
2012-08-06, 12:54
Christian Schäfer
2012-08-06, 13:00
Alex Baranau
2012-08-09, 20:18
Christian Schäfer
2012-08-09, 20:55
anil gupta
2012-08-22, 18:42
Christian Schäfer
2012-08-23, 08:41
anil gupta
2012-08-24, 07:53
|
-
How to query by rowKey-infixChristian Schäfer 2012-07-31, 15:27
Hello there,
I designed a row key for queries that need best performance (~100 ms) which looks like this: userId-date-sessionId These queries(scans) are always based on a userId and sometimes additionally on a date, too. That's no problem with the key above. However, another kind of queries shall be based on a given time range whereas the outermost left userId is not given or known. In this case I need to get all rows covering the given time range with their date to create a daily reporting. As I can't set wildcards at the beginning of a left-based index for the scan, I only see the possibility to scan the index of the whole table to collect the rowKeys that are inside the timerange I'm interested in. Is there a more elegant way to collect rows within time range X? (Unfortunately, the date attribute is not equal to the timestamp that is stored by hbase automatically.) Could/should one maybe leverage some kind of row key caching to accelerate the collection process? Is that covered by the block cache? Thanks in advance for any advice. regards Chris
-
Re: How to query by rowKey-infixJerry Lam 2012-07-31, 17:10
Hi Chris:
I'm thinking about building a secondary index for primary key lookup, then query using the primary keys in parallel. I'm interested to see if there is other option too. Best Regards, Jerry On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <[EMAIL PROTECTED]>wrote: > Hello there, > > I designed a row key for queries that need best performance (~100 ms) > which looks like this: > > userId-date-sessionId > > These queries(scans) are always based on a userId and sometimes > additionally on a date, too. > That's no problem with the key above. > > However, another kind of queries shall be based on a given time range > whereas the outermost left userId is not given or known. > In this case I need to get all rows covering the given time range with > their date to create a daily reporting. > > As I can't set wildcards at the beginning of a left-based index for the > scan, > I only see the possibility to scan the index of the whole table to collect > the > rowKeys that are inside the timerange I'm interested in. > > Is there a more elegant way to collect rows within time range X? > (Unfortunately, the date attribute is not equal to the timestamp that is > stored by hbase automatically.) > > Could/should one maybe leverage some kind of row key caching to accelerate > the collection process? > Is that covered by the block cache? > > Thanks in advance for any advice. > > regards > Chris >
-
Re: How to query by rowKey-infixMatt Corgan 2012-07-31, 17:41
When deciding between a table scan vs secondary index, you should try to
estimate what percent of the underlying data blocks will be used in the query. By default, each block is 64KB. If each user's data is small and you are fitting multiple users per block, then you're going to need all the blocks, so a tablescan is better because it's simpler. If each user has 1MB+ data then you will want to pick out the individual blocks relevant to each date. The secondary index will help you go directly to those sparse blocks, but with a cost in complexity, consistency, and extra denormalized data that knocks primary data out of your block cache. If latency is not a concern, I would start with the table scan. If that's too slow you add the secondary index, and if you still need it faster you do the primary key lookups in parallel as Jerry mentions. Matt On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <[EMAIL PROTECTED]> wrote: > Hi Chris: > > I'm thinking about building a secondary index for primary key lookup, then > query using the primary keys in parallel. > > I'm interested to see if there is other option too. > > Best Regards, > > Jerry > > On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <[EMAIL PROTECTED] > >wrote: > > > Hello there, > > > > I designed a row key for queries that need best performance (~100 ms) > > which looks like this: > > > > userId-date-sessionId > > > > These queries(scans) are always based on a userId and sometimes > > additionally on a date, too. > > That's no problem with the key above. > > > > However, another kind of queries shall be based on a given time range > > whereas the outermost left userId is not given or known. > > In this case I need to get all rows covering the given time range with > > their date to create a daily reporting. > > > > As I can't set wildcards at the beginning of a left-based index for the > > scan, > > I only see the possibility to scan the index of the whole table to > collect > > the > > rowKeys that are inside the timerange I'm interested in. > > > > Is there a more elegant way to collect rows within time range X? > > (Unfortunately, the date attribute is not equal to the timestamp that is > > stored by hbase automatically.) > > > > Could/should one maybe leverage some kind of row key caching to > accelerate > > the collection process? > > Is that covered by the block cache? > > > > Thanks in advance for any advice. > > > > regards > > Chris > > >
-
Re: How to query by rowKey-infixChristian Schäfer 2012-08-01, 08:18
Thanks Matt & Jerry for your replies.
The data for each row is small (some hundred Bytes). So, I will try the parallel table scan at first as you suggested... Before organizing that by myself, wouldn't it be a better idea to create a map reduce job for that? I'm not so keen on implementing secondary indices especially due to the mentioned consistency concerns. Unfortunately projects like ithbase and ihbase are no more supporting current hbase and secondary indexes by coprocessors seems are not yet to there. If I'm wrong feel free to correct me :) regards, Chris ----- Ursprüngliche Message ----- Von: Matt Corgan <[EMAIL PROTECTED]> An: [EMAIL PROTECTED] CC: Christian Schäfer <[EMAIL PROTECTED]> Gesendet: 19:41 Dienstag, 31.Juli 2012 Betreff: Re: How to query by rowKey-infix When deciding between a table scan vs secondary index, you should try to estimate what percent of the underlying data blocks will be used in the query. By default, each block is 64KB. If each user's data is small and you are fitting multiple users per block, then you're going to need all the blocks, so a tablescan is better because it's simpler. If each user has 1MB+ data then you will want to pick out the individual blocks relevant to each date. The secondary index will help you go directly to those sparse blocks, but with a cost in complexity, consistency, and extra denormalized data that knocks primary data out of your block cache. If latency is not a concern, I would start with the table scan. If that's too slow you add the secondary index, and if you still need it faster you do the primary key lookups in parallel as Jerry mentions. Matt On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <[EMAIL PROTECTED]> wrote: > Hi Chris: > > I'm thinking about building a secondary index for primary key lookup, then > query using the primary keys in parallel. > > I'm interested to see if there is other option too. > > Best Regards, > > Jerry > > On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <[EMAIL PROTECTED] > >wrote: > > > Hello there, > > > > I designed a row key for queries that need best performance (~100 ms) > > which looks like this: > > > > userId-date-sessionId > > > > These queries(scans) are always based on a userId and sometimes > > additionally on a date, too. > > That's no problem with the key above. > > > > However, another kind of queries shall be based on a given time range > > whereas the outermost left userId is not given or known. > > In this case I need to get all rows covering the given time range with > > their date to create a daily reporting. > > > > As I can't set wildcards at the beginning of a left-based index for the > > scan, > > I only see the possibility to scan the index of the whole table to > collect > > the > > rowKeys that are inside the timerange I'm interested in. > > > > Is there a more elegant way to collect rows within time range X? > > (Unfortunately, the date attribute is not equal to the timestamp that is > > stored by hbase automatically.) > > > > Could/should one maybe leverage some kind of row key caching to > accelerate > > the collection process? > > Is that covered by the block cache? > > > > Thanks in advance for any advice. > > > > regards > > Chris > > >
-
Re: How to query by rowKey-infixMichael Segel 2012-08-01, 11:52
Actually w coprocessors you can create a secondary index in short order.
Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. On Jul 31, 2012, at 12:41 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > When deciding between a table scan vs secondary index, you should try to > estimate what percent of the underlying data blocks will be used in the > query. By default, each block is 64KB. > > If each user's data is small and you are fitting multiple users per block, > then you're going to need all the blocks, so a tablescan is better because > it's simpler. If each user has 1MB+ data then you will want to pick out > the individual blocks relevant to each date. The secondary index will help > you go directly to those sparse blocks, but with a cost in complexity, > consistency, and extra denormalized data that knocks primary data out of > your block cache. > > If latency is not a concern, I would start with the table scan. If that's > too slow you add the secondary index, and if you still need it faster you > do the primary key lookups in parallel as Jerry mentions. > > Matt > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <[EMAIL PROTECTED]> wrote: > >> Hi Chris: >> >> I'm thinking about building a secondary index for primary key lookup, then >> query using the primary keys in parallel. >> >> I'm interested to see if there is other option too. >> >> Best Regards, >> >> Jerry >> >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <[EMAIL PROTECTED] >>> wrote: >> >>> Hello there, >>> >>> I designed a row key for queries that need best performance (~100 ms) >>> which looks like this: >>> >>> userId-date-sessionId >>> >>> These queries(scans) are always based on a userId and sometimes >>> additionally on a date, too. >>> That's no problem with the key above. >>> >>> However, another kind of queries shall be based on a given time range >>> whereas the outermost left userId is not given or known. >>> In this case I need to get all rows covering the given time range with >>> their date to create a daily reporting. >>> >>> As I can't set wildcards at the beginning of a left-based index for the >>> scan, >>> I only see the possibility to scan the index of the whole table to >> collect >>> the >>> rowKeys that are inside the timerange I'm interested in. >>> >>> Is there a more elegant way to collect rows within time range X? >>> (Unfortunately, the date attribute is not equal to the timestamp that is >>> stored by hbase automatically.) >>> >>> Could/should one maybe leverage some kind of row key caching to >> accelerate >>> the collection process? >>> Is that covered by the block cache? >>> >>> Thanks in advance for any advice. >>> >>> regards >>> Chris >>> >>
-
Re: How to query by rowKey-infixChristian Schäfer 2012-08-02, 12:23
OK,
at first I will try the scans. If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors. Currently I'm stuck at the scans because it requires two steps (therefore some kind of filter chaining) The key: userId-dateInMllis-sessionId At first I need to extract dateInMllis with regex or substring (using special delimiters for date) Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this: ----- Ursprüngliche Message ----- Von: Michael Segel <[EMAIL PROTECTED]> An: [EMAIL PROTECTED] CC: Gesendet: 13:52 Mittwoch, 1.August 2012 Betreff: Re: How to query by rowKey-infix Actually w coprocessors you can create a secondary index in short order. Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. On Jul 31, 2012, at 12:41 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > When deciding between a table scan vs secondary index, you should try to > estimate what percent of the underlying data blocks will be used in the > query. By default, each block is 64KB. > > If each user's data is small and you are fitting multiple users per block, > then you're going to need all the blocks, so a tablescan is better because > it's simpler. If each user has 1MB+ data then you will want to pick out > the individual blocks relevant to each date. The secondary index will help > you go directly to those sparse blocks, but with a cost in complexity, > consistency, and extra denormalized data that knocks primary data out of > your block cache. > > If latency is not a concern, I would start with the table scan. If that's > too slow you add the secondary index, and if you still need it faster you > do the primary key lookups in parallel as Jerry mentions. > > Matt > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <[EMAIL PROTECTED]> wrote: > >> Hi Chris: >> >> I'm thinking about building a secondary index for primary key lookup, then >> query using the primary keys in parallel. >> >> I'm interested to see if there is other option too. >> >> Best Regards, >> >> Jerry >> >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <[EMAIL PROTECTED] >>> wrote: >> >>> Hello there, >>> >>> I designed a row key for queries that need best performance (~100 ms) >>> which looks like this: >>> >>> userId-date-sessionId >>> >>> These queries(scans) are always based on a userId and sometimes >>> additionally on a date, too. >>> That's no problem with the key above. >>> >>> However, another kind of queries shall be based on a given time range >>> whereas the outermost left userId is not given or known. >>> In this case I need to get all rows covering the given time range with >>> their date to create a daily reporting. >>> >>> As I can't set wildcards at the beginning of a left-based index for the >>> scan, >>> I only see the possibility to scan the index of the whole table to >> collect >>> the >>> rowKeys that are inside the timerange I'm interested in. >>> >>> Is there a more elegant way to collect rows within time range X? >>> (Unfortunately, the date attribute is not equal to the timestamp that is >>> stored by hbase automatically.) >>> >>> Could/should one maybe leverage some kind of row key caching to >> accelerate >>> the collection process? >>> Is that covered by the block cache? >>> >>> Thanks in advance for any advice. >>> >>> regards >>> Chris >>> >>
-
Re: How to query by rowKey-infixAlex Baranau 2012-08-02, 22:57
Hi Christian!
If to put off secondary indexes and assume you are going with "heavy scans", you can try two following things to make it much faster. If this is appropriate to your situation, of course. 1. > Is there a more elegant way to collect rows within time range X? > (Unfortunately, the date attribute is not equal to the timestamp that is stored by hbase automatically.) Can you set timestamp of the Puts to the one you have in row key? Instead of relying on the one that HBase puts automatically (current ts). If you can, this will improve reading speed a lot by setting time range on scanner. Depending on how you are writing your data of course, but I assume that you mostly write data in "time-increasing" manner. 2. If your userId has fixed length, or you can change it so that it has fixed length, then you can actually use smth like "wildcard" in row key. There's a way in Filter implementation to fast-forward to the record with specific row key and by doing this skip many records. This might be used as follows: * suppose your userId is 5 characters in length * suppose you are scanning for records with time between 2012-08-01 and 2012-08-08 * when you scanning records and you face e.g. key "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". Because you know that all remained records of user "aaaaa" don't fall into the interval you need (as the time for its records will be >= 2012-08-09). As of now, I believe you will have to implement your custom filter to do that. Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT I believe I implemented similar thing some time ago. If this idea works for you I could look for the implementation and share it if it helps. Or may be even simply add it to HBase codebase. Hope this helps, Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <[EMAIL PROTECTED]>wrote: > > > Excuse my double posting. > Here is the complete mail: > > > OK, > > at first I will try the scans. > > If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) > to be able to use coprocessors. > > > Currently I'm stuck at the scans because it requires two steps (therefore > maybe some kind of filter chaining is required) > > > The key: userId-dateInMillis-sessionId > > At first I need to extract dateInMllis with regex or substring (using > special delimiters for date) > > Second, the extracted value must be parsed to Long and set to a RowFilter > Comparator like this: > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new > BinaryComparator(Bytes.toBytes((Long)dateInMillis)))); > > How to chain that? > Do I have to write a custom filter? > (Would like to avoid that due to deployment) > > regards > Chris > > ----- Ursprüngliche Message ----- > Von: Michael Segel <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED] > CC: > Gesendet: 13:52 Mittwoch, 1.August 2012 > Betreff: Re: How to query by rowKey-infix > > Actually w coprocessors you can create a secondary index in short order. > Then your cost is going to be 2 fetches. Trying to do a partial table scan > will be more expensive. > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > When deciding between a table scan vs secondary index, you should try to > > estimate what percent of the underlying data blocks will be used in the > > query. By default, each block is 64KB. > > > > If each user's data is small and you are fitting multiple users per > block, > > then you're going to need all the blocks, so a tablescan is better > because > > it's simpler. If each user has 1MB+ data then you will want to pick out > > the individual blocks relevant to each date. The secondary index will > help > > you go directly to those sparse blocks, but with a cost in complexity, > > consistency, and extra denormalized data that knocks primary data out of Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: How to query by rowKey-infixMatt Corgan 2012-08-02, 23:09
Also Christian, don't forget you can read all the rows back to the client
and do the filtering there using whatever logic you like. HBase Filters can be thought of as an optimization (predicate push-down) over client-side filtering. Pulling all the rows over the network will be slower, but I don't think we know enough about your data or speed requirements to rule it out. On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <[EMAIL PROTECTED]>wrote: > Hi Christian! > > If to put off secondary indexes and assume you are going with "heavy > scans", you can try two following things to make it much faster. If this is > appropriate to your situation, of course. > > 1. > > > Is there a more elegant way to collect rows within time range X? > > (Unfortunately, the date attribute is not equal to the timestamp that is > stored by hbase automatically.) > > Can you set timestamp of the Puts to the one you have in row key? Instead > of relying on the one that HBase puts automatically (current ts). If you > can, this will improve reading speed a lot by setting time range on > scanner. Depending on how you are writing your data of course, but I assume > that you mostly write data in "time-increasing" manner. > > 2. > > If your userId has fixed length, or you can change it so that it has fixed > length, then you can actually use smth like "wildcard" in row key. There's > a way in Filter implementation to fast-forward to the record with specific > row key and by doing this skip many records. This might be used as follows: > * suppose your userId is 5 characters in length > * suppose you are scanning for records with time between 2012-08-01 > and 2012-08-08 > * when you scanning records and you face e.g. key > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell > the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". > Because you know that all remained records of user "aaaaa" don't fall into > the interval you need (as the time for its records will be >= 2012-08-09). > > As of now, I believe you will have to implement your custom filter to do > that. > Pointer: > org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT > I believe I implemented similar thing some time ago. If this idea works for > you I could look for the implementation and share it if it helps. Or may be > even simply add it to HBase codebase. > > Hope this helps, > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > > On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <[EMAIL PROTECTED] > >wrote: > > > > > > > Excuse my double posting. > > Here is the complete mail: > > > > > > OK, > > > > at first I will try the scans. > > > > If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) > > to be able to use coprocessors. > > > > > > Currently I'm stuck at the scans because it requires two steps (therefore > > maybe some kind of filter chaining is required) > > > > > > The key: userId-dateInMillis-sessionId > > > > At first I need to extract dateInMllis with regex or substring (using > > special delimiters for date) > > > > Second, the extracted value must be parsed to Long and set to a RowFilter > > Comparator like this: > > > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new > > BinaryComparator(Bytes.toBytes((Long)dateInMillis)))); > > > > How to chain that? > > Do I have to write a custom filter? > > (Would like to avoid that due to deployment) > > > > regards > > Chris > > > > ----- Ursprüngliche Message ----- > > Von: Michael Segel <[EMAIL PROTECTED]> > > An: [EMAIL PROTECTED] > > CC: > > Gesendet: 13:52 Mittwoch, 1.August 2012 > > Betreff: Re: How to query by rowKey-infix > > > > Actually w coprocessors you can create a secondary index in short order. > > Then your cost is going to be 2 fetches. Trying to do a partial table > scan > > will be more expensive. > > > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
-
Re: How to query by rowKey-infixAlex Baranau 2012-08-03, 01:15
I think this is exactly what Christian is trying to (and should be trying
to) avoid ;). I can't imagine use-case when you need to filter something and you can do it with (at least) server-side filter, and yet in this situation you want to try to do it on the client-side... Doing filtering on client-side when you can do it on server-side just feels wrong. Esp. given that there's a lot of data in HBase (otherwise why would you use it). Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > Also Christian, don't forget you can read all the rows back to the client > and do the filtering there using whatever logic you like. HBase Filters > can be thought of as an optimization (predicate push-down) over client-side > filtering. Pulling all the rows over the network will be slower, but I > don't think we know enough about your data or speed requirements to rule it > out. > > > On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <[EMAIL PROTECTED] > >wrote: > > > Hi Christian! > > > > If to put off secondary indexes and assume you are going with "heavy > > scans", you can try two following things to make it much faster. If this > is > > appropriate to your situation, of course. > > > > 1. > > > > > Is there a more elegant way to collect rows within time range X? > > > (Unfortunately, the date attribute is not equal to the timestamp that > is > > stored by hbase automatically.) > > > > Can you set timestamp of the Puts to the one you have in row key? Instead > > of relying on the one that HBase puts automatically (current ts). If you > > can, this will improve reading speed a lot by setting time range on > > scanner. Depending on how you are writing your data of course, but I > assume > > that you mostly write data in "time-increasing" manner. > > > > 2. > > > > If your userId has fixed length, or you can change it so that it has > fixed > > length, then you can actually use smth like "wildcard" in row key. > There's > > a way in Filter implementation to fast-forward to the record with > specific > > row key and by doing this skip many records. This might be used as > follows: > > * suppose your userId is 5 characters in length > > * suppose you are scanning for records with time between 2012-08-01 > > and 2012-08-08 > > * when you scanning records and you face e.g. key > > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell > > the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". > > Because you know that all remained records of user "aaaaa" don't fall > into > > the interval you need (as the time for its records will be >> 2012-08-09). > > > > As of now, I believe you will have to implement your custom filter to do > > that. > > Pointer: > > org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT > > I believe I implemented similar thing some time ago. If this idea works > for > > you I could look for the implementation and share it if it helps. Or may > be > > even simply add it to HBase codebase. > > > > Hope this helps, > > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - > > Solr > > > > > > On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <[EMAIL PROTECTED] > > >wrote: > > > > > > > > > > > Excuse my double posting. > > > Here is the complete mail: > > > > > > > > > OK, > > > > > > at first I will try the scans. > > > > > > If that's too slow I will have to upgrade hbase (currently > 0.90.4-cdh3u2) > > > to be able to use coprocessors. > > > > > > > > > Currently I'm stuck at the scans because it requires two steps > (therefore > > > maybe some kind of filter chaining is required) > > > > > > > > > The key: userId-dateInMillis-sessionId > > > > > > At first I need to extract dateInMllis with regex or substring (using > > > special delimiters for date) > > > > > > Second, the extracted value must be parsed to Long and set to a Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: How to query by rowKey-infixMatt Corgan 2012-08-03, 01:29
Yeah - just thought i'd point it out since people often have small tables
in their cluster alongside the big ones, and when generating reports, sometimes you don't care if it finishes in 10 minutes vs an hour. On Thu, Aug 2, 2012 at 6:15 PM, Alex Baranau <[EMAIL PROTECTED]>wrote: > I think this is exactly what Christian is trying to (and should be trying > to) avoid ;). > > I can't imagine use-case when you need to filter something and you can do > it with (at least) server-side filter, and yet in this situation you want > to try to do it on the client-side... Doing filtering on client-side when > you can do it on server-side just feels wrong. Esp. given that there's a > lot of data in HBase (otherwise why would you use it). > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > Also Christian, don't forget you can read all the rows back to the client > > and do the filtering there using whatever logic you like. HBase Filters > > can be thought of as an optimization (predicate push-down) over > client-side > > filtering. Pulling all the rows over the network will be slower, but I > > don't think we know enough about your data or speed requirements to rule > it > > out. > > > > > > On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <[EMAIL PROTECTED] > > >wrote: > > > > > Hi Christian! > > > > > > If to put off secondary indexes and assume you are going with "heavy > > > scans", you can try two following things to make it much faster. If > this > > is > > > appropriate to your situation, of course. > > > > > > 1. > > > > > > > Is there a more elegant way to collect rows within time range X? > > > > (Unfortunately, the date attribute is not equal to the timestamp that > > is > > > stored by hbase automatically.) > > > > > > Can you set timestamp of the Puts to the one you have in row key? > Instead > > > of relying on the one that HBase puts automatically (current ts). If > you > > > can, this will improve reading speed a lot by setting time range on > > > scanner. Depending on how you are writing your data of course, but I > > assume > > > that you mostly write data in "time-increasing" manner. > > > > > > 2. > > > > > > If your userId has fixed length, or you can change it so that it has > > fixed > > > length, then you can actually use smth like "wildcard" in row key. > > There's > > > a way in Filter implementation to fast-forward to the record with > > specific > > > row key and by doing this skip many records. This might be used as > > follows: > > > * suppose your userId is 5 characters in length > > > * suppose you are scanning for records with time between 2012-08-01 > > > and 2012-08-08 > > > * when you scanning records and you face e.g. key > > > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can > tell > > > the scanner from your filter to fast-forward to key "aaaab_ > 2012-08-01". > > > Because you know that all remained records of user "aaaaa" don't fall > > into > > > the interval you need (as the time for its records will be >> > 2012-08-09). > > > > > > As of now, I believe you will have to implement your custom filter to > do > > > that. > > > Pointer: > > > org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT > > > I believe I implemented similar thing some time ago. If this idea works > > for > > > you I could look for the implementation and share it if it helps. Or > may > > be > > > even simply add it to HBase codebase. > > > > > > Hope this helps, > > > > > > Alex Baranau > > > ------ > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > ElasticSearch > > - > > > Solr > > > > > > > > > On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > > > > > > > > > Excuse my double posting. > > > > Here is the complete mail: > > > > > > > > > > > > OK, > > > > > > > > at first I will try the scans.
-
Re: How to query by rowKey-infixChristian Schäfer 2012-08-03, 09:23
Hi Alex,
thanks a lot for the hint about setting the timestamp of the put. I didn't know that this would be possible but that's solving the problem (first test was successful). So I'm really glad that I don't need to apply a filter to extract the time and so on for every row. Nevertheless I would like to see your custom filter implementation. Would be nice if you could provide it helping me to get a bit into it. And yes that helped :) regards Chris ________________________________ Von: Alex Baranau <[EMAIL PROTECTED]> An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> Gesendet: 0:57 Freitag, 3.August 2012 Betreff: Re: How to query by rowKey-infix Hi Christian! If to put off secondary indexes and assume you are going with "heavy scans", you can try two following things to make it much faster. If this is appropriate to your situation, of course. 1. > Is there a more elegant way to collect rows within time range X? > (Unfortunately, the date attribute is not equal to the timestamp that is stored by hbase automatically.) Can you set timestamp of the Puts to the one you have in row key? Instead of relying on the one that HBase puts automatically (current ts). If you can, this will improve reading speed a lot by setting time range on scanner. Depending on how you are writing your data of course, but I assume that you mostly write data in "time-increasing" manner. 2. If your userId has fixed length, or you can change it so that it has fixed length, then you can actually use smth like "wildcard" in row key. There's a way in Filter implementation to fast-forward to the record with specific row key and by doing this skip many records. This might be used as follows: * suppose your userId is 5 characters in length * suppose you are scanning for records with time between 2012-08-01 and 2012-08-08 * when you scanning records and you face e.g. key "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". Because you know that all remained records of user "aaaaa" don't fall into the interval you need (as the time for its records will be >= 2012-08-09). As of now, I believe you will have to implement your custom filter to do that. Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT I believe I implemented similar thing some time ago. If this idea works for you I could look for the implementation and share it if it helps. Or may be even simply add it to HBase codebase. Hope this helps, Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <[EMAIL PROTECTED]> wrote: > >Excuse my double posting. >Here is the complete mail: > > > >OK, > >at first I will try the scans. > >If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors. > > >Currently I'm stuck at the scans because it requires two steps (therefore maybe some kind of filter chaining is required) > > >The key: userId-dateInMillis-sessionId > > >At first I need to extract dateInMllis with regex or substring (using special delimiters for date) > >Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this: > >scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes((Long)dateInMillis)))); > >How to chain that? >Do I have to write a custom filter? >(Would like to avoid that due to deployment) > >regards >Chris > > >----- Ursprüngliche Message ----- >Von: Michael Segel <[EMAIL PROTECTED]> >An: [EMAIL PROTECTED] >CC: >Gesendet: 13:52 Mittwoch, 1.August 2012 >Betreff: Re: How to query by rowKey-infix > >Actually w coprocessors you can create a secondary index in short order. >Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. > >On Jul 31, 2012, at 12:41 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: How to query by rowKey-infixChristian Schäfer 2012-08-03, 09:34
Hi Matt,
sure I got this in mind as an last option (at least on a limited subset of data). Due to our estimation of some billions rows a week a selective filtering needs to take place at the server side. But I agree that one could do fine filtering stuff on the client side on a handy data subset to avoid getting the hbase schema & indexing (by coprocessors) too complicated. regards Chris ----- Ursprüngliche Message ----- Von: Matt Corgan <[EMAIL PROTECTED]> An: [EMAIL PROTECTED] CC: Gesendet: 3:29 Freitag, 3.August 2012 Betreff: Re: How to query by rowKey-infix Yeah - just thought i'd point it out since people often have small tables in their cluster alongside the big ones, and when generating reports, sometimes you don't care if it finishes in 10 minutes vs an hour. On Thu, Aug 2, 2012 at 6:15 PM, Alex Baranau <[EMAIL PROTECTED]>wrote: > I think this is exactly what Christian is trying to (and should be trying > to) avoid ;). > > I can't imagine use-case when you need to filter something and you can do > it with (at least) server-side filter, and yet in this situation you want > to try to do it on the client-side... Doing filtering on client-side when > you can do it on server-side just feels wrong. Esp. given that there's a > lot of data in HBase (otherwise why would you use it). > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > Also Christian, don't forget you can read all the rows back to the client > > and do the filtering there using whatever logic you like. HBase Filters > > can be thought of as an optimization (predicate push-down) over > client-side > > filtering. Pulling all the rows over the network will be slower, but I > > don't think we know enough about your data or speed requirements to rule > it > > out. > > > > > > On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <[EMAIL PROTECTED] > > >wrote: > > > > > Hi Christian! > > > > > > If to put off secondary indexes and assume you are going with "heavy > > > scans", you can try two following things to make it much faster. If > this > > is > > > appropriate to your situation, of course. > > > > > > 1. > > > > > > > Is there a more elegant way to collect rows within time range X? > > > > (Unfortunately, the date attribute is not equal to the timestamp that > > is > > > stored by hbase automatically.) > > > > > > Can you set timestamp of the Puts to the one you have in row key? > Instead > > > of relying on the one that HBase puts automatically (current ts). If > you > > > can, this will improve reading speed a lot by setting time range on > > > scanner. Depending on how you are writing your data of course, but I > > assume > > > that you mostly write data in "time-increasing" manner. > > > > > > 2. > > > > > > If your userId has fixed length, or you can change it so that it has > > fixed > > > length, then you can actually use smth like "wildcard" in row key. > > There's > > > a way in Filter implementation to fast-forward to the record with > > specific > > > row key and by doing this skip many records. This might be used as > > follows: > > > * suppose your userId is 5 characters in length > > > * suppose you are scanning for records with time between 2012-08-01 > > > and 2012-08-08 > > > * when you scanning records and you face e.g. key > > > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can > tell > > > the scanner from your filter to fast-forward to key "aaaab_ > 2012-08-01". > > > Because you know that all remained records of user "aaaaa" don't fall > > into > > > the interval you need (as the time for its records will be >> > 2012-08-09). > > > > > > As of now, I believe you will have to implement your custom filter to > do > > > that. > > > Pointer: > > > org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT > > > I believe I implemented similar thing some time ago. If this idea works
-
Re: How to query by rowKey-infixMichael Segel 2012-08-03, 12:21
Hi,
What does your schema look like? Would it make sense to changing the key to user_id '|' timestamp and then use the session_id in the column name? On Aug 2, 2012, at 7:23 AM, Christian Schäfer <[EMAIL PROTECTED]> wrote: > OK, > > at first I will try the scans. > > If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors. > > Currently I'm stuck at the scans because it requires two steps (therefore some kind of filter chaining) > > The key: userId-dateInMllis-sessionId > > At first I need to extract dateInMllis with regex or substring (using special delimiters for date) > > Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this: > > > > > > ----- Ursprüngliche Message ----- > Von: Michael Segel <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED] > CC: > Gesendet: 13:52 Mittwoch, 1.August 2012 > Betreff: Re: How to query by rowKey-infix > > Actually w coprocessors you can create a secondary index in short order. > Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > >> When deciding between a table scan vs secondary index, you should try to >> estimate what percent of the underlying data blocks will be used in the >> query. By default, each block is 64KB. >> >> If each user's data is small and you are fitting multiple users per block, >> then you're going to need all the blocks, so a tablescan is better because >> it's simpler. If each user has 1MB+ data then you will want to pick out >> the individual blocks relevant to each date. The secondary index will help >> you go directly to those sparse blocks, but with a cost in complexity, >> consistency, and extra denormalized data that knocks primary data out of >> your block cache. >> >> If latency is not a concern, I would start with the table scan. If that's >> too slow you add the secondary index, and if you still need it faster you >> do the primary key lookups in parallel as Jerry mentions. >> >> Matt >> >> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <[EMAIL PROTECTED]> wrote: >> >>> Hi Chris: >>> >>> I'm thinking about building a secondary index for primary key lookup, then >>> query using the primary keys in parallel. >>> >>> I'm interested to see if there is other option too. >>> >>> Best Regards, >>> >>> Jerry >>> >>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <[EMAIL PROTECTED] >>>> wrote: >>> >>>> Hello there, >>>> >>>> I designed a row key for queries that need best performance (~100 ms) >>>> which looks like this: >>>> >>>> userId-date-sessionId >>>> >>>> These queries(scans) are always based on a userId and sometimes >>>> additionally on a date, too. >>>> That's no problem with the key above. >>>> >>>> However, another kind of queries shall be based on a given time range >>>> whereas the outermost left userId is not given or known. >>>> In this case I need to get all rows covering the given time range with >>>> their date to create a daily reporting. >>>> >>>> As I can't set wildcards at the beginning of a left-based index for the >>>> scan, >>>> I only see the possibility to scan the index of the whole table to >>> collect >>>> the >>>> rowKeys that are inside the timerange I'm interested in. >>>> >>>> Is there a more elegant way to collect rows within time range X? >>>> (Unfortunately, the date attribute is not equal to the timestamp that is >>>> stored by hbase automatically.) >>>> >>>> Could/should one maybe leverage some kind of row key caching to >>> accelerate >>>> the collection process? >>>> Is that covered by the block cache? >>>> >>>> Thanks in advance for any advice. >>>> >>>> regards >>>> Chris >>>> >>> >
-
Re: How to query by rowKey-infixAlex Baranau 2012-08-03, 22:14
Good!
Submitted initial patch of fuzzy row key filter at https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the filter class and include it in your code and use it in your setup as any other custom filter (no need to patch HBase). Please let me know if you try it out (or post your comments at HBASE-6509). Alex Baranau ------ Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <[EMAIL PROTECTED]>wrote: > Hi Alex, > > thanks a lot for the hint about setting the timestamp of the put. > I didn't know that this would be possible but that's solving the problem > (first test was successful). > So I'm really glad that I don't need to apply a filter to extract the time > and so on for every row. > > Nevertheless I would like to see your custom filter implementation. > Would be nice if you could provide it helping me to get a bit into it. > > And yes that helped :) > > regards > Chris > > > ________________________________ > Von: Alex Baranau <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> > Gesendet: 0:57 Freitag, 3.August 2012 > Betreff: Re: How to query by rowKey-infix > > > Hi Christian! > If to put off secondary indexes and assume you are going with "heavy > scans", you can try two following things to make it much faster. If this is > appropriate to your situation, of course. > > 1. > > > Is there a more elegant way to collect rows within time range X? > > (Unfortunately, the date attribute is not equal to the timestamp that is > stored by hbase automatically.) > > Can you set timestamp of the Puts to the one you have in row key? Instead > of relying on the one that HBase puts automatically (current ts). If you > can, this will improve reading speed a lot by setting time range on > scanner. Depending on how you are writing your data of course, but I assume > that you mostly write data in "time-increasing" manner. > > > 2. > > If your userId has fixed length, or you can change it so that it has fixed > length, then you can actually use smth like "wildcard" in row key. There's > a way in Filter implementation to fast-forward to the record with specific > row key and by doing this skip many records. This might be used as follows: > * suppose your userId is 5 characters in length > * suppose you are scanning for records with time between 2012-08-01 > and 2012-08-08 > * when you scanning records and you face e.g. key > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell > the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". > Because you know that all remained records of user "aaaaa" don't fall into > the interval you need (as the time for its records will be >= 2012-08-09). > > As of now, I believe you will have to implement your custom filter to do > that. > Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT > I believe I implemented similar thing some time ago. If this idea works > for you I could look for the implementation and share it if it helps. Or > may be even simply add it to HBase codebase. > > Hope this helps, > > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > > > On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <[EMAIL PROTECTED]> > wrote: > > > > > >Excuse my double posting. > >Here is the complete mail: > > > > > > > >OK, > > > >at first I will try the scans. > > > >If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) > to be able to use coprocessors. > > > > > >Currently I'm stuck at the scans because it requires two steps (therefore > maybe some kind of filter chaining is required) > > > > > >The key: userId-dateInMillis-sessionId > > > > > >At first I need to extract dateInMllis with regex or substring (using > special delimiters for date) > > > >Second, the extracted value must be parsed to Long and set to a RowFilter Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: How to query by rowKey-infixChristian Schäfer 2012-08-06, 12:54
The point is that I / we want to make reports for each session that could be present on many rows distributed over all regions.
As I expect it to be slower to scan Columns than rowkeys I chose the latter. I guess I may not (yet) share the schema. The userID and session stuff mentioned is just there to illustrate an comparable situation. Thanks, Chris ----- Ursprüngliche Message ----- Von: Michael Segel <[EMAIL PROTECTED]> An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> CC: Gesendet: 14:21 Freitag, 3.August 2012 Betreff: Re: How to query by rowKey-infix Hi, What does your schema look like? Would it make sense to changing the key to user_id '|' timestamp and then use the session_id in the column name? On Aug 2, 2012, at 7:23 AM, Christian Schäfer <[EMAIL PROTECTED]> wrote: > OK, > > at first I will try the scans. > > If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors. > > Currently I'm stuck at the scans because it requires two steps (therefore some kind of filter chaining) > > The key: userId-dateInMllis-sessionId > > At first I need to extract dateInMllis with regex or substring (using special delimiters for date) > > Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this: > > > > > > ----- Ursprüngliche Message ----- > Von: Michael Segel <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED] > CC: > Gesendet: 13:52 Mittwoch, 1.August 2012 > Betreff: Re: How to query by rowKey-infix > > Actually w coprocessors you can create a secondary index in short order. > Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > >> When deciding between a table scan vs secondary index, you should try to >> estimate what percent of the underlying data blocks will be used in the >> query. By default, each block is 64KB. >> >> If each user's data is small and you are fitting multiple users per block, >> then you're going to need all the blocks, so a tablescan is better because >> it's simpler. If each user has 1MB+ data then you will want to pick out >> the individual blocks relevant to each date. The secondary index will help >> you go directly to those sparse blocks, but with a cost in complexity, >> consistency, and extra denormalized data that knocks primary data out of >> your block cache. >> >> If latency is not a concern, I would start with the table scan. If that's >> too slow you add the secondary index, and if you still need it faster you >> do the primary key lookups in parallel as Jerry mentions. >> >> Matt >> >> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <[EMAIL PROTECTED]> wrote: >> >>> Hi Chris: >>> >>> I'm thinking about building a secondary index for primary key lookup, then >>> query using the primary keys in parallel. >>> >>> I'm interested to see if there is other option too. >>> >>> Best Regards, >>> >>> Jerry >>> >>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <[EMAIL PROTECTED] >>>> wrote: >>> >>>> Hello there, >>>> >>>> I designed a row key for queries that need best performance (~100 ms) >>>> which looks like this: >>>> >>>> userId-date-sessionId >>>> >>>> These queries(scans) are always based on a userId and sometimes >>>> additionally on a date, too. >>>> That's no problem with the key above. >>>> >>>> However, another kind of queries shall be based on a given time range >>>> whereas the outermost left userId is not given or known. >>>> In this case I need to get all rows covering the given time range with >>>> their date to create a daily reporting. >>>> >>>> As I can't set wildcards at the beginning of a left-based index for the >>>> scan, >>>> I only see the possibility to scan the index of the whole table to >>> collect >>>> the >>>> rowKeys that are inside the timerange I'm interested in. >>>> >>>> Is there a more elegant way to collect rows within time range X?
-
Re: How to query by rowKey-infixChristian Schäfer 2012-08-06, 13:00
Hi,
shouldn't all people who are using hbase for time series data have exactly the same problem when trying to get a time-related subset of their data? So setting the put timeStamp "manually" is THE way of choice? It works for me but I would also be interested in alternative approaches that are applied for efficient time-related scans (except coprocessors & full table scans). ----- Ursprüngliche Message ----- Von: Christian Schäfer <[EMAIL PROTECTED]> An: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> CC: Gesendet: 11:23 Freitag, 3.August 2012 Betreff: Re: How to query by rowKey-infix Hi Alex, thanks a lot for the hint about setting the timestamp of the put. I didn't know that this would be possible but that's solving the problem (first test was successful). So I'm really glad that I don't need to apply a filter to extract the time and so on for every row. Nevertheless I would like to see your custom filter implementation. Would be nice if you could provide it helping me to get a bit into it. And yes that helped :) regards Chris ________________________________ Von: Alex Baranau <[EMAIL PROTECTED]> An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> Gesendet: 0:57 Freitag, 3.August 2012 Betreff: Re: How to query by rowKey-infix Hi Christian! If to put off secondary indexes and assume you are going with "heavy scans", you can try two following things to make it much faster. If this is appropriate to your situation, of course. 1. > Is there a more elegant way to collect rows within time range X? > (Unfortunately, the date attribute is not equal to the timestamp that is stored by hbase automatically.) Can you set timestamp of the Puts to the one you have in row key? Instead of relying on the one that HBase puts automatically (current ts). If you can, this will improve reading speed a lot by setting time range on scanner. Depending on how you are writing your data of course, but I assume that you mostly write data in "time-increasing" manner. 2. If your userId has fixed length, or you can change it so that it has fixed length, then you can actually use smth like "wildcard" in row key. There's a way in Filter implementation to fast-forward to the record with specific row key and by doing this skip many records. This might be used as follows: * suppose your userId is 5 characters in length * suppose you are scanning for records with time between 2012-08-01 and 2012-08-08 * when you scanning records and you face e.g. key "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". Because you know that all remained records of user "aaaaa" don't fall into the interval you need (as the time for its records will be >= 2012-08-09). As of now, I believe you will have to implement your custom filter to do that. Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT I believe I implemented similar thing some time ago. If this idea works for you I could look for the implementation and share it if it helps. Or may be even simply add it to HBase codebase. Hope this helps, Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <[EMAIL PROTECTED]> wrote: > >Excuse my double posting. >Here is the complete mail: > > > >OK, > >at first I will try the scans. > >If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors. > > >Currently I'm stuck at the scans because it requires two steps (therefore maybe some kind of filter chaining is required) > > >The key: userId-dateInMillis-sessionId > > >At first I need to extract dateInMllis with regex or substring (using special delimiters for date) > >Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this: > >scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes((Long)dateInMillis)))); Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: How to query by rowKey-infixAlex Baranau 2012-08-09, 20:18
jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will add
documentation to HBase book very soon [1] Alex Baranau ------ Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr [1] https://issues.apache.org/jira/browse/HBASE-6526 On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <[EMAIL PROTECTED]>wrote: > Good! > > Submitted initial patch of fuzzy row key filter at > https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the > filter class and include it in your code and use it in your setup as any > other custom filter (no need to patch HBase). > > Please let me know if you try it out (or post your comments at HBASE-6509). > > Alex Baranau > ------ > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <[EMAIL PROTECTED]>wrote: > >> Hi Alex, >> >> thanks a lot for the hint about setting the timestamp of the put. >> I didn't know that this would be possible but that's solving the problem >> (first test was successful). >> So I'm really glad that I don't need to apply a filter to extract the >> time and so on for every row. >> >> Nevertheless I would like to see your custom filter implementation. >> Would be nice if you could provide it helping me to get a bit into it. >> >> And yes that helped :) >> >> regards >> Chris >> >> >> ________________________________ >> Von: Alex Baranau <[EMAIL PROTECTED]> >> An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> >> Gesendet: 0:57 Freitag, 3.August 2012 >> Betreff: Re: How to query by rowKey-infix >> >> >> Hi Christian! >> If to put off secondary indexes and assume you are going with "heavy >> scans", you can try two following things to make it much faster. If this is >> appropriate to your situation, of course. >> >> 1. >> >> > Is there a more elegant way to collect rows within time range X? >> > (Unfortunately, the date attribute is not equal to the timestamp that >> is stored by hbase automatically.) >> >> Can you set timestamp of the Puts to the one you have in row key? Instead >> of relying on the one that HBase puts automatically (current ts). If you >> can, this will improve reading speed a lot by setting time range on >> scanner. Depending on how you are writing your data of course, but I assume >> that you mostly write data in "time-increasing" manner. >> >> >> 2. >> >> If your userId has fixed length, or you can change it so that it has >> fixed length, then you can actually use smth like "wildcard" in row key. >> There's a way in Filter implementation to fast-forward to the record with >> specific row key and by doing this skip many records. This might be used as >> follows: >> * suppose your userId is 5 characters in length >> * suppose you are scanning for records with time between 2012-08-01 >> and 2012-08-08 >> * when you scanning records and you face e.g. key >> "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell >> the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". >> Because you know that all remained records of user "aaaaa" don't fall into >> the interval you need (as the time for its records will be >= 2012-08-09). >> >> As of now, I believe you will have to implement your custom filter to do >> that. >> Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT >> I believe I implemented similar thing some time ago. If this idea works >> for you I could look for the implementation and share it if it helps. Or >> may be even simply add it to HBase codebase. >> >> Hope this helps, >> >> >> Alex Baranau >> ------ >> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch >> - Solr >> >> >> >> On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <[EMAIL PROTECTED]> >> wrote: >> >> >> > >> >Excuse my double posting. >> >Here is the complete mail: >> > >> > >> > >> >OK, >> > >> >at first I will try the scans. >> > >> >If that's too slow I will have to upgrade hbase (currently
-
Re: How to query by rowKey-infixChristian Schäfer 2012-08-09, 20:55
Nice. Thanks Alex for sharing your experiences with that custom filter implementation.
Currently I'm still using key filter with substring comparator. As soon as I got a good amount of test data I will measure performance of that naiive substring filter in comparison to your fuzzy row filter. regards, Christian ________________________________ Von: Alex Baranau <[EMAIL PROTECTED]> An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> Gesendet: 22:18 Donnerstag, 9.August 2012 Betreff: Re: How to query by rowKey-infix jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will add documentation to HBase book very soon [1] Alex Baranau ------ Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr [1] https://issues.apache.org/jira/browse/HBASE-6526 On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <[EMAIL PROTECTED]> wrote: Good! > > >Submitted initial patch of fuzzy row key filter at https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the filter class and include it in your code and use it in your setup as any other custom filter (no need to patch HBase). > > >Please let me know if you try it out (or post your comments at HBASE-6509). > > >Alex Baranau >------ >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > >On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <[EMAIL PROTECTED]> wrote: > >Hi Alex, >> >>thanks a lot for the hint about setting the timestamp of the put. >>I didn't know that this would be possible but that's solving the problem (first test was successful). >>So I'm really glad that I don't need to apply a filter to extract the time and so on for every row. >> >>Nevertheless I would like to see your custom filter implementation. >>Would be nice if you could provide it helping me to get a bit into it. >> >>And yes that helped :) >> >>regards >>Chris >> >> >> >>________________________________ >>Von: Alex Baranau <[EMAIL PROTECTED]> >>An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> >>Gesendet: 0:57 Freitag, 3.August 2012 >> >>Betreff: Re: How to query by rowKey-infix >> >> >>Hi Christian! >>If to put off secondary indexes and assume you are going with "heavy scans", you can try two following things to make it much faster. If this is appropriate to your situation, of course. >> >>1. >> >>> Is there a more elegant way to collect rows within time range X? >>> (Unfortunately, the date attribute is not equal to the timestamp that is stored by hbase automatically.) >> >>Can you set timestamp of the Puts to the one you have in row key? Instead of relying on the one that HBase puts automatically (current ts). If you can, this will improve reading speed a lot by setting time range on scanner. Depending on how you are writing your data of course, but I assume that you mostly write data in "time-increasing" manner. >> >> >>2. >> >>If your userId has fixed length, or you can change it so that it has fixed length, then you can actually use smth like "wildcard" in row key. There's a way in Filter implementation to fast-forward to the record with specific row key and by doing this skip many records. This might be used as follows: >>* suppose your userId is 5 characters in length >>* suppose you are scanning for records with time between 2012-08-01 and 2012-08-08 >>* when you scanning records and you face e.g. key "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". Because you know that all remained records of user "aaaaa" don't fall into the interval you need (as the time for its records will be >= 2012-08-09). >> >>As of now, I believe you will have to implement your custom filter to do that. Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT >>I believe I implemented similar thing some time ago. If this idea works for you I could look for the implementation and share it if it helps. Or may be even simply add it to HBase codebase.
-
Re: How to query by rowKey-infixanil gupta 2012-08-22, 18:42
Hi Christian,
I had the similar requirements as yours. So, till now i have used timestamps for filtering the data and I would say the performance is satisfactory. Here are the results of timestamp based filtering: The table has 34 million records(average row size is 1.21 KB), in 136 seconds i get the entire result of query which had 225 rows. I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up is hosting 2 Slaves Instance(2 VM's running Datanode, NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't done any modification in the block size of HDFS or HBase. Considering the below-par hardware configuration of cluster i feel the performance is OK and IMO it'll be better than substring comparator of column values since in substring comparator filter you are essentially doing a FULL TABLE scan. Whereas, in timerange based scan you can *Skip Store Files*. On a side note, Alex created a JIRA for enhancing the current FuzzyRowFilter to do range based filtering also. Here is the link: https://issues.apache.org/jira/browse/HBASE-6618 . You are more than welcome if you would like to chime in. HTH, Anil Gupta On Thu, Aug 9, 2012 at 1:55 PM, Christian Schäfer <[EMAIL PROTECTED]>wrote: > Nice. Thanks Alex for sharing your experiences with that custom filter > implementation. > > > Currently I'm still using key filter with substring comparator. > As soon as I got a good amount of test data I will measure performance of > that naiive substring filter in comparison to your fuzzy row filter. > > regards, > Christian > > > > ________________________________ > Von: Alex Baranau <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> > Gesendet: 22:18 Donnerstag, 9.August 2012 > Betreff: Re: How to query by rowKey-infix > > > jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will > add documentation to HBase book very soon [1] > > Alex Baranau > ------ > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > [1] https://issues.apache.org/jira/browse/HBASE-6526 > > On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <[EMAIL PROTECTED]> > wrote: > > Good! > > > > > >Submitted initial patch of fuzzy row key filter at > https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the > filter class and include it in your code and use it in your setup as any > other custom filter (no need to patch HBase). > > > > > >Please let me know if you try it out (or post your comments at > HBASE-6509). > > > > > >Alex Baranau > >------ > >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > > > > >On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <[EMAIL PROTECTED]> > wrote: > > > >Hi Alex, > >> > >>thanks a lot for the hint about setting the timestamp of the put. > >>I didn't know that this would be possible but that's solving the problem > (first test was successful). > >>So I'm really glad that I don't need to apply a filter to extract the > time and so on for every row. > >> > >>Nevertheless I would like to see your custom filter implementation. > >>Would be nice if you could provide it helping me to get a bit into it. > >> > >>And yes that helped :) > >> > >>regards > >>Chris > >> > >> > >> > >>________________________________ > >>Von: Alex Baranau <[EMAIL PROTECTED]> > >>An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> > >>Gesendet: 0:57 Freitag, 3.August 2012 > >> > >>Betreff: Re: How to query by rowKey-infix > >> > >> > >>Hi Christian! > >>If to put off secondary indexes and assume you are going with "heavy > scans", you can try two following things to make it much faster. If this is > appropriate to your situation, of course. > >> > >>1. > >> > >>> Is there a more elegant way to collect rows within time range X? > >>> (Unfortunately, the date attribute is not equal to the timestamp that > is stored by hbase automatically.) Thanks & Regards, Anil Gupta
-
Re: How to query by rowKey-infixChristian Schäfer 2012-08-23, 08:41
Hi Anil,
to restrict data to a certain time window I also set timerange for the scan. I'm slightly shocked about the processing time of more than 2 mins to return 225 rows. I would actually need a response in 5-10 sec. In your timestamp based filtering, do you check the timestamp as part of the row key or do you use the put timestamp (as I do)? How many rows are scanned/touched at your timestamp based filtering? Is it a full table scan where each row's key is checked against a given timestamp/timerange? My use case of obtaining data by substring comparator operates on the row key. It can't be replaced by setting the time range in my case, really. Btw. the scan is additionally restricted to a certain timerange to increase skipping of irrelevant files and thus improve performance. regards, Christian ----- Ursprüngliche Message ----- Von: anil gupta <[EMAIL PROTECTED]> An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> CC: Gesendet: 20:42 Mittwoch, 22.August 2012 Betreff: Re: How to query by rowKey-infix Hi Christian, I had the similar requirements as yours. So, till now i have used timestamps for filtering the data and I would say the performance is satisfactory. Here are the results of timestamp based filtering: The table has 34 million records(average row size is 1.21 KB), in 136 seconds i get the entire result of query which had 225 rows. I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up is hosting 2 Slaves Instance(2 VM's running Datanode, NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't done any modification in the block size of HDFS or HBase. Considering the below-par hardware configuration of cluster i feel the performance is OK and IMO it'll be better than substring comparator of column values since in substring comparator filter you are essentially doing a FULL TABLE scan. Whereas, in timerange based scan you can *Skip Store Files*. On a side note, Alex created a JIRA for enhancing the current FuzzyRowFilter to do range based filtering also. Here is the link: https://issues.apache.org/jira/browse/HBASE-6618 . You are more than welcome if you would like to chime in. HTH, Anil Gupta On Thu, Aug 9, 2012 at 1:55 PM, Christian Schäfer <[EMAIL PROTECTED]>wrote: > Nice. Thanks Alex for sharing your experiences with that custom filter > implementation. > > > Currently I'm still using key filter with substring comparator. > As soon as I got a good amount of test data I will measure performance of > that naiive substring filter in comparison to your fuzzy row filter. > > regards, > Christian > > > > ________________________________ > Von: Alex Baranau <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> > Gesendet: 22:18 Donnerstag, 9.August 2012 > Betreff: Re: How to query by rowKey-infix > > > jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will > add documentation to HBase book very soon [1] > > Alex Baranau > ------ > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > [1] https://issues.apache.org/jira/browse/HBASE-6526 > > On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <[EMAIL PROTECTED]> > wrote: > > Good! > > > > > >Submitted initial patch of fuzzy row key filter at > https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the > filter class and include it in your code and use it in your setup as any > other custom filter (no need to patch HBase). > > > > > >Please let me know if you try it out (or post your comments at > HBASE-6509). > > > > > >Alex Baranau > >------ > >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > > > > >On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <[EMAIL PROTECTED]> > wrote: > > > >Hi Alex, > >> > >>thanks a lot for the hint about setting the timestamp of the put. > >>I didn't know that this would be possible but that's solving the problem Thanks & Regards, Anil Gupta
-
Re: How to query by rowKey-infixanil gupta 2012-08-24, 07:53
Christian: I'm slightly shocked about the processing time of more than 2
mins to return 225 rows.I would actually need a response in 5-10 sec. Anil: I started getting the response within 1-2 sec of firing the query but i got all the 225 results in 2 mins. My table was having 34 million rows and every rows was having 25 columns on an average. Average size of each row is around 1.21 KB. Size of one replica is ~40 GB in HDFS. I havent done the comparison of timestamp based filtering and column value based filtering. However, I strongly believe that timestamp based filtering will be a winner due to the reason that it can skip Blocks. Regarding the concern that my query took 2 min, one of the reason is that the Hardware conf is way below par so i dont really look for blazing fast performance on this cluster. If you get a really well tuned HBase then your performance can improve by 3-4x easily(query will be done in 20-30 seconds). But, i dont think you can get blazing fast result like the ones we get when we do scanning based on RowKey. Christian: In your timestamp based filtering, do you check the timestamp as part of the row key or do you use the put timestamp (as I do)? Anil: I use the timestamp by using Scan.setTimeRange(long, long). In my use case i am not using row key at all. So, roughly it is full table scan but timestamp is doing all the magic. It's a definite advantage if you can use rowkey in your query. Christian:Is it a full table scan where each row's key is checked against a given timestamp/timerange? Anil: Essentially its a full table scan since i am not using any rowkey or other filters. Christian:How many rows are scanned/touched at your timestamp based filtering? Anil: I dont know how to get these stats. Can anyone enlighten me? I am also curious to know this stat. I'll try to run the column value based filter also so that we get some more insights into the best option available. Let me know your thoughts on my reply. Thanks, Anil Gupta On Thu, Aug 23, 2012 at 1:41 AM, Christian Schäfer <[EMAIL PROTECTED]>wrote: > Hi Anil, > > to restrict data to a certain time window I also set timerange for the > scan. > > > > How many rows are scanned/touched at your timestamp based filtering? > > > > My use case of obtaining data by substring comparator operates on the row > key. > It can't be replaced by setting the time range in my case, really. > > Btw. the scan is additionally restricted to a certain timerange to > increase skipping of irrelevant files and thus improve performance. > > > regards, > Christian > > > > ----- Ursprüngliche Message ----- > Von: anil gupta <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]> > CC: > Gesendet: 20:42 Mittwoch, 22.August 2012 > Betreff: Re: How to query by rowKey-infix > > Hi Christian, > > I had the similar requirements as yours. So, till now i have used > timestamps for filtering the data and I would say the performance is > satisfactory. Here are the results of timestamp based filtering: > The table has 34 million records(average row size is 1.21 KB), in 136 > seconds i get the entire result of query which had 225 rows. > I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node > had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up > is hosting 2 Slaves Instance(2 VM's running Datanode, > NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't > done any modification in the block size of HDFS or HBase. Considering the > below-par hardware configuration of cluster i feel the performance is OK > and IMO it'll be better than substring comparator of column values since in > substring comparator filter you are essentially doing a FULL TABLE scan. > Whereas, in timerange based scan you can *Skip Store Files*. > > On a side note, Alex created a JIRA for enhancing the current > FuzzyRowFilter to do range based filtering also. Here is the link: > https://issues.apache.org/jira/browse/HBASE-6618 . You are more than Thanks & Regards, Anil Gupta |