Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Can I specify the range inside of fuzzy rule in FuzzyRowFilter?


+
Alex Baranau 2012-08-17, 20:42
+
anil gupta 2012-08-17, 21:34
+
Michael Segel 2012-08-18, 10:56
+
Alex Baranau 2012-08-18, 19:13
Copy link to this message
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
Hi Alex,

Apart from the query which i mentioned in last email. Till now, i have
implemented the following queries using filters and coprocessors:

1. Getting all the records for a customer.
2. Perform min,max,avg,sum aggregation for a customer using coprocessors. I
am storing some of the data as BigDecimal also to do accurate floating
point calculations.
3. Perform min,max,avg,sum aggregation for a customer within a given
time-range using coprocessors.
4. Filter that data for a customer within a given time-range on the basis
of column values. The filtering on column values can be matching a string
value or it can be doing range based numerical comparison.

Basically, as per our current requirement all the queries have customerid
and most of the queries have timerange also. We are not in prod yet. All of
this effort is part of a POC.

2. Can you set timestamp on Puts the same as timestamp "assigned" to your
record by app logic?
Anil: Wow! This sounds like an awesome idea. Actually, my data is
non-mutable so at present i was putting 0 as the timestamp for all the
data. I will definitely try this stuff. Currently, i run bulkloader to load
the data so i think its gonna be a small change.

Yes, i would love to give a try from my side for developing a range based
FuzzyRowFilter. However, first i am going to try putting in the timestamp.

Thanks for a very helpful discussion. Let me know when you create the JIRA
for range-based FuzzyRowFilter.

Thanks,
Anil Gupta

On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> @Michael,
>
> This is not a simple partial key scan. Take this example of rows:
>
> aaaaa_100001_20120801
> aaaaa_100001_20120802
> aaaaa_100001_20120802
> aaaaa_100001_20120803
> aaaaa_100001_20120804
> aaaaa_100001_20120805
> aaaaa_100002_20120801
> aaaaa_100002_20120802
> aaaaa_100002_20120802
> aaaaa_100002_20120803
> aaaaa_100002_20120804
> aaaaa_100002_20120805
>
> where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp. If
> the query is to select actions in the range 20120803-20120805 (in this case
> last 3 days), then when scan encounters row:
>
> aaaaa_100001_20120801
>
> it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and
> skip some records (in practice, this may mean skipping really a LOT of
> recrods).
>
>
> @Anil,
>
> > Sample Query: I want to get all the event which happened in last month.
>
> 1. What other queries do you do? Just trying to understand why this row key
> format was chosen.
>
> 2. Can you set timestamp on Puts the same as timestamp "assigned" to your
> record by app logic? If you can, then this is the first thing to try and
> perform scan with the help of scan.setTimeRange(startTs, stopTs). Depending
> on how you write the data this may help a lot with the reading speed by ts,
> because that way you may skip the whole HFiles from reading based on ts. I
> don't know about your data a lot to judge, but:
>   * in case you have not a lot of users most of which are with long history
> of interaction with you system (i.e. there are a lot of records for
> specific "userX_actionY") and
>   * if you write data with monotonically increasing timestamp
>   * your regions are not too big
> then this might help you, as it will increase the chance that some of the
> HFiles will contain data *all of which* doesn't fell into the time interval
> you select by. Otherwise, if written data items with different timestamps
> are very well spread across the HFiles the chance that some HFiles are
> skipped from reading is very small. I believe Lars George has illustrated
> this in one of his presentations, but couldn't find it quickly.
>
> > something like FuzzyRowFilter with range
>
> Yes, smth like this looks like would be very valuable. It would be
> interesting to implement too. Let's see if I find the time for that in my
> work plan. If you want to try it by yourself, go for it! Let me know if you
> need a help in that case ;)
>
> Alex Baranau
> -
Thanks & Regards,
Anil Gupta
+
Alex Baranau 2012-08-20, 20:07
+
anil gupta 2012-08-22, 06:18
+
Alex Baranau 2012-08-22, 22:41
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB