|
|
-
Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
Alex Baranau 2012-08-17, 20:42
There was a question [1] in https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes more sense to answer it here. With the current FuzzyRowFilter I believe the only way to approach the problem is to add 150 fuzzy rules to the filter: ??????00200, ??????00201, ..., ??????00350. As for performance of this approach I can say the following: * there are two "checks" happening for each processed row key (i.e. those row keys we don't skip) * first one performs simple check if the given row key satisfies the fuzzy rule and also determines if there's next row key to advance to (if this one doesn't satisfy). The check takes up at max O(n), where n is the length of fuzzy rule. I.e. this is done in one simple loop, which can be broken before all bytes are checked. For m rules this will be O(m*n). * second piece calculates the next row key to provide it as a hint for fast-forwarding. We again check all rules and finding the smallest hint. Operation is also done in one loop, i.e. O(m*n) here as well. With 150 fuzzy rules of length 11, the applying filter is equivalent to the loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a lot, but can work quite fast. So I'd just try it. As for extension which will be more efficient, it makes sense to consider implementing it. Let me think more about it and get back with the JIRA Issue to you :). But I'd suggest you trying existing FuzzyRowFilter first. The output (performance) would give us some food for thinking, or may be even turns out to be acceptable for you (hopefully). > Can i run this kind of filter on HBase0.92 without doing any significant update to the cluster Until the next release, you'll have to use the FuzzyRowFilter as any other custom filter. Just grab the patch from HBASE-6509 and copy the filter. No need to patch & rebuild HBase. Alex Baranau ------ Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr [1] Anil Gupta added a comment - 18/Aug/12 04:37 Hi Alex, I have a question related to this filter. I have a similar filtering requirement which will be an extension to FuzzyFilterRow. Suppose, i have the following structure of rowkeys: userid_actionid, where userid is of 6 digit and then actionid is 5 digit. I would like to get all the rows with actionid between 00200 to 00350. With current FuzzyRowFilter i can search for all the rows a particular actionid. Instead of searching for a particular actionid i would like to search for a range of actionid. Does this use case sounds like an extension to current FuzzyRowFilter? Can i run this kind of filter on HBase0.92 without doing any significant update to the cluster. If i develop this kind of filter then what is needed to run it on all the RS's? Thanks, Anil
+
Alex Baranau 2012-08-17, 20:42
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
anil gupta 2012-08-17, 21:34
Hi Alex, Thanks for the answer. I have successfully compiled FuzzyRowFilter class with HBase0.92. To try out FuzzyRowFilter, i'll need to make some changes to my RowKey. So, i'll get back to you with performance numbers after loading the data and trying out FuzzyRowFilter for a particular value. The range example i told in my original post is very small. In my real use case the range can lie from 0 to 31536000. So, in my opinion using the current FuzzyRowFilter might not be a good idea. I agree with you that extension is the right way for solving this. Here is my real use case : I have a table in which is store event from customers using customerid+timestamp. Sample Query: I want to get all the event which happened in last month. Current Possible Solutions: 1. I can do this filtering by using a filter checking the column value of "timestamp" column. I think this will be highly inefficient. 2. Other possible way i think is to use RegexComparator with RowFilter to get all the row with a certain numeric range of timestamp. In this case also every rowkey of the table will be checked. So, the most optimum way is to use something like FuzzyRowFilter with range. Also, my range will always be numerical and this can be really handy for others storing timestamp in the rowkey and wants to do time based queries using the RowKey. Thanks, Anil Gupta On Fri, Aug 17, 2012 at 1:42 PM, Alex Baranau <[EMAIL PROTECTED]>wrote: > There was a question [1] in > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes > more sense to answer it here. > > With the current FuzzyRowFilter I believe the only way to approach the > problem is to add 150 fuzzy rules to the filter: ??????00200, ??????00201, > ..., ??????00350. > > As for performance of this approach I can say the following: > * there are two "checks" happening for each processed row key (i.e. those > row keys we don't skip) > * first one performs simple check if the given row key satisfies the fuzzy > rule and also determines if there's next row key to advance to (if this one > doesn't satisfy). The check takes up at max O(n), where n is the length of > fuzzy rule. I.e. this is done in one simple loop, which can be broken > before all bytes are checked. For m rules this will be O(m*n). > * second piece calculates the next row key to provide it as a hint for > fast-forwarding. We again check all rules and finding the smallest hint. > Operation is also done in one loop, i.e. O(m*n) here as well. > > With 150 fuzzy rules of length 11, the applying filter is equivalent to the > loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a > lot, but can work quite fast. So I'd just try it. > > As for extension which will be more efficient, it makes sense to consider > implementing it. Let me think more about it and get back with the JIRA > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter first. > The output (performance) would give us some food for thinking, or may be > even turns out to be acceptable for you (hopefully). > > > Can i run this kind of filter on HBase0.92 without doing any significant > update to the cluster > > Until the next release, you'll have to use the FuzzyRowFilter as any other > custom filter. Just grab the patch from HBASE-6509 and copy the filter. No > need to patch & rebuild HBase. > > Alex Baranau > ------ > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > [1] > > Anil Gupta added a comment - 18/Aug/12 04:37 > Hi Alex, > I have a question related to this filter. I have a similar filtering > requirement which will be an extension to FuzzyFilterRow. > Suppose, i have the following structure of rowkeys: userid_actionid, where > userid is of 6 digit and then actionid is 5 digit. I would like to get all > the rows with actionid between 00200 to 00350. With current FuzzyRowFilter > i can search for all the rows a particular actionid. Instead of searching > for a particular actionid i would like to search for a range of actionid. Thanks & Regards, Anil Gupta
+
anil gupta 2012-08-17, 21:34
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
Michael Segel 2012-08-18, 10:56
What row keys are you skipping? Using your example... You have a start row of 00000000200, and an end key of xFFxFFxFFxFFxFFxFF00350. Note that you could also write that end key as xFF(1..6) 01 since it looks like you're trying to match the 00 in positons 7 and 8 of your numeric string. Assuming that when you say ? you mean that you expect to have a character in that spot and that your row key is exactly 11 characters in length. While you may not return all the rows in that range, you do have to still check the row key, unless I am missing something. So what am I missing? On Aug 17, 2012, at 3:42 PM, Alex Baranau <[EMAIL PROTECTED]> wrote: > There was a question [1] in > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes > more sense to answer it here. > > With the current FuzzyRowFilter I believe the only way to approach the > problem is to add 150 fuzzy rules to the filter: ??????00200, ??????00201, > ..., ??????00350. > > As for performance of this approach I can say the following: > * there are two "checks" happening for each processed row key (i.e. those > row keys we don't skip) > * first one performs simple check if the given row key satisfies the fuzzy > rule and also determines if there's next row key to advance to (if this one > doesn't satisfy). The check takes up at max O(n), where n is the length of > fuzzy rule. I.e. this is done in one simple loop, which can be broken > before all bytes are checked. For m rules this will be O(m*n). > * second piece calculates the next row key to provide it as a hint for > fast-forwarding. We again check all rules and finding the smallest hint. > Operation is also done in one loop, i.e. O(m*n) here as well. > > With 150 fuzzy rules of length 11, the applying filter is equivalent to the > loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a > lot, but can work quite fast. So I'd just try it. > > As for extension which will be more efficient, it makes sense to consider > implementing it. Let me think more about it and get back with the JIRA > Issue to you :). But I'd suggest you trying existing FuzzyRowFilter first. > The output (performance) would give us some food for thinking, or may be > even turns out to be acceptable for you (hopefully). > >> Can i run this kind of filter on HBase0.92 without doing any significant > update to the cluster > > Until the next release, you'll have to use the FuzzyRowFilter as any other > custom filter. Just grab the patch from HBASE-6509 and copy the filter. No > need to patch & rebuild HBase. > > Alex Baranau > ------ > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > [1] > > Anil Gupta added a comment - 18/Aug/12 04:37 > Hi Alex, > I have a question related to this filter. I have a similar filtering > requirement which will be an extension to FuzzyFilterRow. > Suppose, i have the following structure of rowkeys: userid_actionid, where > userid is of 6 digit and then actionid is 5 digit. I would like to get all > the rows with actionid between 00200 to 00350. With current FuzzyRowFilter > i can search for all the rows a particular actionid. Instead of searching > for a particular actionid i would like to search for a range of actionid. > Does this use case sounds like an extension to current FuzzyRowFilter? Can > i run this kind of filter on HBase0.92 without doing any significant update > to the cluster. If i develop this kind of filter then what is needed to run > it on all the RS's? > Thanks, > Anil
+
Michael Segel 2012-08-18, 10:56
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
Alex Baranau 2012-08-18, 19:13
@Michael, This is not a simple partial key scan. Take this example of rows: aaaaa_100001_20120801 aaaaa_100001_20120802 aaaaa_100001_20120802 aaaaa_100001_20120803 aaaaa_100001_20120804 aaaaa_100001_20120805 aaaaa_100002_20120801 aaaaa_100002_20120802 aaaaa_100002_20120802 aaaaa_100002_20120803 aaaaa_100002_20120804 aaaaa_100002_20120805 where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If the query is to select actions in the range 20120803-20120805 (in this case last 3 days), then when scan encounters row: aaaaa_100001_20120801 it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and skip some records (in practice, this may mean skipping really a LOT of recrods). @Anil, > Sample Query: I want to get all the event which happened in last month. 1. What other queries do you do? Just trying to understand why this row key format was chosen. 2. Can you set timestamp on Puts the same as timestamp "assigned" to your record by app logic? If you can, then this is the first thing to try and perform scan with the help of scan.setTimeRange(startTs, stopTs). Depending on how you write the data this may help a lot with the reading speed by ts, because that way you may skip the whole HFiles from reading based on ts. I don't know about your data a lot to judge, but: * in case you have not a lot of users most of which are with long history of interaction with you system (i.e. there are a lot of records for specific "userX_actionY") and * if you write data with monotonically increasing timestamp * your regions are not too big then this might help you, as it will increase the chance that some of the HFiles will contain data *all of which* doesn't fell into the time interval you select by. Otherwise, if written data items with different timestamps are very well spread across the HFiles the chance that some HFiles are skipped from reading is very small. I believe Lars George has illustrated this in one of his presentations, but couldn't find it quickly. > something like FuzzyRowFilter with range Yes, smth like this looks like would be very valuable. It would be interesting to implement too. Let's see if I find the time for that in my work plan. If you want to try it by yourself, go for it! Let me know if you need a help in that case ;) Alex Baranau ------ Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Sat, Aug 18, 2012 at 6:56 AM, Michael Segel <[EMAIL PROTECTED]>wrote: > What row keys are you skipping? > > Using your example... > You have a start row of 00000000200, and an end key of > xFFxFFxFFxFFxFFxFF00350. > Note that you could also write that end key as xFF(1..6) 01 since it looks > like you're trying to match the 00 in positons 7 and 8 of your numeric > string. > > Assuming that when you say ? you mean that you expect to have a character > in that spot and that your row key is exactly 11 characters in length. > > While you may not return all the rows in that range, you do have to still > check the row key, unless I am missing something. > > So what am I missing? > > On Aug 17, 2012, at 3:42 PM, Alex Baranau <[EMAIL PROTECTED]> > wrote: > > > There was a question [1] in > > https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes > > more sense to answer it here. > > > > With the current FuzzyRowFilter I believe the only way to approach the > > problem is to add 150 fuzzy rules to the filter: ??????00200, > ??????00201, > > ..., ??????00350. > > > > As for performance of this approach I can say the following: > > * there are two "checks" happening for each processed row key (i.e. those > > row keys we don't skip) > > * first one performs simple check if the given row key satisfies the > fuzzy > > rule and also determines if there's next row key to advance to (if this > one > > doesn't satisfy). The check takes up at max O(n), where n is the length > of > > fuzzy rule. I.e. this is done in one simple loop, which can be broken > > before all bytes are checked. For m rules this will be O(m*n).
+
Alex Baranau 2012-08-18, 19:13
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
anil gupta 2012-08-18, 21:02
Hi Alex,
Apart from the query which i mentioned in last email. Till now, i have implemented the following queries using filters and coprocessors:
1. Getting all the records for a customer. 2. Perform min,max,avg,sum aggregation for a customer using coprocessors. I am storing some of the data as BigDecimal also to do accurate floating point calculations. 3. Perform min,max,avg,sum aggregation for a customer within a given time-range using coprocessors. 4. Filter that data for a customer within a given time-range on the basis of column values. The filtering on column values can be matching a string value or it can be doing range based numerical comparison.
Basically, as per our current requirement all the queries have customerid and most of the queries have timerange also. We are not in prod yet. All of this effort is part of a POC.
2. Can you set timestamp on Puts the same as timestamp "assigned" to your record by app logic? Anil: Wow! This sounds like an awesome idea. Actually, my data is non-mutable so at present i was putting 0 as the timestamp for all the data. I will definitely try this stuff. Currently, i run bulkloader to load the data so i think its gonna be a small change.
Yes, i would love to give a try from my side for developing a range based FuzzyRowFilter. However, first i am going to try putting in the timestamp.
Thanks for a very helpful discussion. Let me know when you create the JIRA for range-based FuzzyRowFilter.
Thanks, Anil Gupta
On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <[EMAIL PROTECTED]>wrote:
> @Michael, > > This is not a simple partial key scan. Take this example of rows: > > aaaaa_100001_20120801 > aaaaa_100001_20120802 > aaaaa_100001_20120802 > aaaaa_100001_20120803 > aaaaa_100001_20120804 > aaaaa_100001_20120805 > aaaaa_100002_20120801 > aaaaa_100002_20120802 > aaaaa_100002_20120802 > aaaaa_100002_20120803 > aaaaa_100002_20120804 > aaaaa_100002_20120805 > > where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If > the query is to select actions in the range 20120803-20120805 (in this case > last 3 days), then when scan encounters row: > > aaaaa_100001_20120801 > > it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and > skip some records (in practice, this may mean skipping really a LOT of > recrods). > > > @Anil, > > > Sample Query: I want to get all the event which happened in last month. > > 1. What other queries do you do? Just trying to understand why this row key > format was chosen. > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your > record by app logic? If you can, then this is the first thing to try and > perform scan with the help of scan.setTimeRange(startTs, stopTs). Depending > on how you write the data this may help a lot with the reading speed by ts, > because that way you may skip the whole HFiles from reading based on ts. I > don't know about your data a lot to judge, but: > * in case you have not a lot of users most of which are with long history > of interaction with you system (i.e. there are a lot of records for > specific "userX_actionY") and > * if you write data with monotonically increasing timestamp > * your regions are not too big > then this might help you, as it will increase the chance that some of the > HFiles will contain data *all of which* doesn't fell into the time interval > you select by. Otherwise, if written data items with different timestamps > are very well spread across the HFiles the chance that some HFiles are > skipped from reading is very small. I believe Lars George has illustrated > this in one of his presentations, but couldn't find it quickly. > > > something like FuzzyRowFilter with range > > Yes, smth like this looks like would be very valuable. It would be > interesting to implement too. Let's see if I find the time for that in my > work plan. If you want to try it by yourself, go for it! Let me know if you > need a help in that case ;) > > Alex Baranau > - Thanks & Regards, Anil Gupta
+
anil gupta 2012-08-18, 21:02
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
Alex Baranau 2012-08-20, 20:07
Created: https://issues.apache.org/jira/browse/HBASE-6618Alex Baranau ------ Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Sat, Aug 18, 2012 at 5:02 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Alex, > > Apart from the query which i mentioned in last email. Till now, i have > implemented the following queries using filters and coprocessors: > > 1. Getting all the records for a customer. > 2. Perform min,max,avg,sum aggregation for a customer using coprocessors. I > am storing some of the data as BigDecimal also to do accurate floating > point calculations. > 3. Perform min,max,avg,sum aggregation for a customer within a given > time-range using coprocessors. > 4. Filter that data for a customer within a given time-range on the basis > of column values. The filtering on column values can be matching a string > value or it can be doing range based numerical comparison. > > Basically, as per our current requirement all the queries have customerid > and most of the queries have timerange also. We are not in prod yet. All of > this effort is part of a POC. > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your > record by app logic? > Anil: Wow! This sounds like an awesome idea. Actually, my data is > non-mutable so at present i was putting 0 as the timestamp for all the > data. I will definitely try this stuff. Currently, i run bulkloader to load > the data so i think its gonna be a small change. > > Yes, i would love to give a try from my side for developing a range based > FuzzyRowFilter. However, first i am going to try putting in the timestamp. > > Thanks for a very helpful discussion. Let me know when you create the JIRA > for range-based FuzzyRowFilter. > > Thanks, > Anil Gupta > > On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <[EMAIL PROTECTED] > >wrote: > > > @Michael, > > > > This is not a simple partial key scan. Take this example of rows: > > > > aaaaa_100001_20120801 > > aaaaa_100001_20120802 > > aaaaa_100001_20120802 > > aaaaa_100001_20120803 > > aaaaa_100001_20120804 > > aaaaa_100001_20120805 > > aaaaa_100002_20120801 > > aaaaa_100002_20120802 > > aaaaa_100002_20120802 > > aaaaa_100002_20120803 > > aaaaa_100002_20120804 > > aaaaa_100002_20120805 > > > > where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If > > the query is to select actions in the range 20120803-20120805 (in this > case > > last 3 days), then when scan encounters row: > > > > aaaaa_100001_20120801 > > > > it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and > > skip some records (in practice, this may mean skipping really a LOT of > > recrods). > > > > > > @Anil, > > > > > Sample Query: I want to get all the event which happened in last month. > > > > 1. What other queries do you do? Just trying to understand why this row > key > > format was chosen. > > > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your > > record by app logic? If you can, then this is the first thing to try and > > perform scan with the help of scan.setTimeRange(startTs, stopTs). > Depending > > on how you write the data this may help a lot with the reading speed by > ts, > > because that way you may skip the whole HFiles from reading based on ts. > I > > don't know about your data a lot to judge, but: > > * in case you have not a lot of users most of which are with long > history > > of interaction with you system (i.e. there are a lot of records for > > specific "userX_actionY") and > > * if you write data with monotonically increasing timestamp > > * your regions are not too big > > then this might help you, as it will increase the chance that some of the > > HFiles will contain data *all of which* doesn't fell into the time > interval > > you select by. Otherwise, if written data items with different timestamps > > are very well spread across the HFiles the chance that some HFiles are > > skipped from reading is very small. I believe Lars George has illustrated
+
Alex Baranau 2012-08-20, 20:07
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
anil gupta 2012-08-22, 06:18
Hi Alex, Thanks for creating the JIRA. On Monday, I completed testing the time range filtering using timestamps and IMO the results seems satisfactory(if not great). The table has 34 million records(average row size is 1.21 KB), in 136 seconds i get the entire result of query which had 225 rows. I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up is hosting 2 Slaves Instance(2 VM's running Datanode, NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't done any modification in the block size of HDFS or HBase. Considering the below-par hardware configuration of cluster, does the performance sounds OK for timestamp filtering? Thanks, Anil On Mon, Aug 20, 2012 at 1:07 PM, Alex Baranau <[EMAIL PROTECTED]>wrote: > Created: https://issues.apache.org/jira/browse/HBASE-6618> > Alex Baranau > ------ > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > On Sat, Aug 18, 2012 at 5:02 PM, anil gupta <[EMAIL PROTECTED]> wrote: > > > Hi Alex, > > > > Apart from the query which i mentioned in last email. Till now, i have > > implemented the following queries using filters and coprocessors: > > > > 1. Getting all the records for a customer. > > 2. Perform min,max,avg,sum aggregation for a customer using > coprocessors. I > > am storing some of the data as BigDecimal also to do accurate floating > > point calculations. > > 3. Perform min,max,avg,sum aggregation for a customer within a given > > time-range using coprocessors. > > 4. Filter that data for a customer within a given time-range on the basis > > of column values. The filtering on column values can be matching a string > > value or it can be doing range based numerical comparison. > > > > Basically, as per our current requirement all the queries have customerid > > and most of the queries have timerange also. We are not in prod yet. All > of > > this effort is part of a POC. > > > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your > > record by app logic? > > Anil: Wow! This sounds like an awesome idea. Actually, my data is > > non-mutable so at present i was putting 0 as the timestamp for all the > > data. I will definitely try this stuff. Currently, i run bulkloader to > load > > the data so i think its gonna be a small change. > > > > Yes, i would love to give a try from my side for developing a range based > > FuzzyRowFilter. However, first i am going to try putting in the > timestamp. > > > > Thanks for a very helpful discussion. Let me know when you create the > JIRA > > for range-based FuzzyRowFilter. > > > > Thanks, > > Anil Gupta > > > > On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <[EMAIL PROTECTED] > > >wrote: > > > > > @Michael, > > > > > > This is not a simple partial key scan. Take this example of rows: > > > > > > aaaaa_100001_20120801 > > > aaaaa_100001_20120802 > > > aaaaa_100001_20120802 > > > aaaaa_100001_20120803 > > > aaaaa_100001_20120804 > > > aaaaa_100001_20120805 > > > aaaaa_100002_20120801 > > > aaaaa_100002_20120802 > > > aaaaa_100002_20120802 > > > aaaaa_100002_20120803 > > > aaaaa_100002_20120804 > > > aaaaa_100002_20120805 > > > > > > where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. > If > > > the query is to select actions in the range 20120803-20120805 (in this > > case > > > last 3 days), then when scan encounters row: > > > > > > aaaaa_100001_20120801 > > > > > > it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and > > > skip some records (in practice, this may mean skipping really a LOT of > > > recrods). > > > > > > > > > @Anil, > > > > > > > Sample Query: I want to get all the event which happened in last > month. > > > > > > 1. What other queries do you do? Just trying to understand why this row > > key > > > format was chosen. > > > > > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to > your > > > record by app logic? If you can, then this is the first thing to try Thanks & Regards, Anil Gupta
+
anil gupta 2012-08-22, 06:18
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
Alex Baranau 2012-08-22, 22:41
Anil, It really depends on how many HFiles can be skipped entirely. In general, given that this is like full-table scan with filter, your time is good. Especially if it is satisfactory to you :). Glad that the idea with setting manually ts helped. This trick is overlooked too often :( Alex Baranau ------ Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Wed, Aug 22, 2012 at 2:18 AM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Alex, > > Thanks for creating the JIRA. > On Monday, I completed testing the time range filtering using timestamps > and IMO the results seems satisfactory(if not great). The table has 34 > million records(average row size is 1.21 KB), in 136 seconds i get the > entire result of query which had 225 rows. > I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node > had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up > is hosting 2 Slaves Instance(2 VM's running Datanode, > NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't > done any modification in the block size of HDFS or HBase. Considering the > below-par hardware configuration of cluster, does the performance sounds OK > for timestamp filtering? > > Thanks, > Anil > > On Mon, Aug 20, 2012 at 1:07 PM, Alex Baranau <[EMAIL PROTECTED] > >wrote: > > > Created: https://issues.apache.org/jira/browse/HBASE-6618> > > > Alex Baranau > > ------ > > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > > > On Sat, Aug 18, 2012 at 5:02 PM, anil gupta <[EMAIL PROTECTED]> > wrote: > > > > > Hi Alex, > > > > > > Apart from the query which i mentioned in last email. Till now, i have > > > implemented the following queries using filters and coprocessors: > > > > > > 1. Getting all the records for a customer. > > > 2. Perform min,max,avg,sum aggregation for a customer using > > coprocessors. I > > > am storing some of the data as BigDecimal also to do accurate floating > > > point calculations. > > > 3. Perform min,max,avg,sum aggregation for a customer within a given > > > time-range using coprocessors. > > > 4. Filter that data for a customer within a given time-range on the > basis > > > of column values. The filtering on column values can be matching a > string > > > value or it can be doing range based numerical comparison. > > > > > > Basically, as per our current requirement all the queries have > customerid > > > and most of the queries have timerange also. We are not in prod yet. > All > > of > > > this effort is part of a POC. > > > > > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to > your > > > record by app logic? > > > Anil: Wow! This sounds like an awesome idea. Actually, my data is > > > non-mutable so at present i was putting 0 as the timestamp for all the > > > data. I will definitely try this stuff. Currently, i run bulkloader to > > load > > > the data so i think its gonna be a small change. > > > > > > Yes, i would love to give a try from my side for developing a range > based > > > FuzzyRowFilter. However, first i am going to try putting in the > > timestamp. > > > > > > Thanks for a very helpful discussion. Let me know when you create the > > JIRA > > > for range-based FuzzyRowFilter. > > > > > > Thanks, > > > Anil Gupta > > > > > > On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > @Michael, > > > > > > > > This is not a simple partial key scan. Take this example of rows: > > > > > > > > aaaaa_100001_20120801 > > > > aaaaa_100001_20120802 > > > > aaaaa_100001_20120802 > > > > aaaaa_100001_20120803 > > > > aaaaa_100001_20120804 > > > > aaaaa_100001_20120805 > > > > aaaaa_100002_20120801 > > > > aaaaa_100002_20120802 > > > > aaaaa_100002_20120802 > > > > aaaaa_100002_20120803 > > > > aaaaa_100002_20120804 > > > > aaaaa_100002_20120805 > > > > > > > > where aaaaa is userId, 10000x is actionId and 201208xx is a > timestamp. > > If > > > > the query is to select actions in the range 20120803-20120805 (in
+
Alex Baranau 2012-08-22, 22:41
|
|