Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Can I specify the range inside of fuzzy rule in FuzzyRowFilter?


+
Alex Baranau 2012-08-17, 20:42
+
anil gupta 2012-08-17, 21:34
+
Michael Segel 2012-08-18, 10:56
+
Alex Baranau 2012-08-18, 19:13
+
anil gupta 2012-08-18, 21:02
+
Alex Baranau 2012-08-20, 20:07
Copy link to this message
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
Hi Alex,

Thanks for creating the JIRA.
On Monday, I completed testing the time range filtering using timestamps
and IMO the results seems satisfactory(if not great). The table has 34
million records(average row size is 1.21 KB), in 136 seconds i get the
entire result of query which had 225 rows.
I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
is hosting 2 Slaves Instance(2 VM's running Datanode,
NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
done any modification in the block size of HDFS or HBase. Considering the
below-par hardware configuration of cluster, does the performance sounds OK
for timestamp filtering?

Thanks,
Anil

On Mon, Aug 20, 2012 at 1:07 PM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> Created: https://issues.apache.org/jira/browse/HBASE-6618
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> On Sat, Aug 18, 2012 at 5:02 PM, anil gupta <[EMAIL PROTECTED]> wrote:
>
> > Hi Alex,
> >
> > Apart from the query which i mentioned in last email. Till now, i have
> > implemented the following queries using filters and coprocessors:
> >
> > 1. Getting all the records for a customer.
> > 2. Perform min,max,avg,sum aggregation for a customer using
> coprocessors. I
> > am storing some of the data as BigDecimal also to do accurate floating
> > point calculations.
> > 3. Perform min,max,avg,sum aggregation for a customer within a given
> > time-range using coprocessors.
> > 4. Filter that data for a customer within a given time-range on the basis
> > of column values. The filtering on column values can be matching a string
> > value or it can be doing range based numerical comparison.
> >
> > Basically, as per our current requirement all the queries have customerid
> > and most of the queries have timerange also. We are not in prod yet. All
> of
> > this effort is part of a POC.
> >
> > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your
> > record by app logic?
> > Anil: Wow! This sounds like an awesome idea. Actually, my data is
> > non-mutable so at present i was putting 0 as the timestamp for all the
> > data. I will definitely try this stuff. Currently, i run bulkloader to
> load
> > the data so i think its gonna be a small change.
> >
> > Yes, i would love to give a try from my side for developing a range based
> > FuzzyRowFilter. However, first i am going to try putting in the
> timestamp.
> >
> > Thanks for a very helpful discussion. Let me know when you create the
> JIRA
> > for range-based FuzzyRowFilter.
> >
> > Thanks,
> > Anil Gupta
> >
> > On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau <[EMAIL PROTECTED]
> > >wrote:
> >
> > > @Michael,
> > >
> > > This is not a simple partial key scan. Take this example of rows:
> > >
> > > aaaaa_100001_20120801
> > > aaaaa_100001_20120802
> > > aaaaa_100001_20120802
> > > aaaaa_100001_20120803
> > > aaaaa_100001_20120804
> > > aaaaa_100001_20120805
> > > aaaaa_100002_20120801
> > > aaaaa_100002_20120802
> > > aaaaa_100002_20120802
> > > aaaaa_100002_20120803
> > > aaaaa_100002_20120804
> > > aaaaa_100002_20120805
> > >
> > > where aaaaa is userId, 10000x is actionId and 201208xx is a timestamp.
> If
> > > the query is to select actions in the range 20120803-20120805 (in this
> > case
> > > last 3 days), then when scan encounters row:
> > >
> > > aaaaa_100001_20120801
> > >
> > > it "knows" it can fast forward scanning to "aaaaa_100001_20120803", and
> > > skip some records (in practice, this may mean skipping really a LOT of
> > > recrods).
> > >
> > >
> > > @Anil,
> > >
> > > > Sample Query: I want to get all the event which happened in last
> month.
> > >
> > > 1. What other queries do you do? Just trying to understand why this row
> > key
> > > format was chosen.
> > >
> > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to
> your
> > > record by app logic? If you can, then this is the first thing to try

Thanks & Regards,
Anil Gupta
+
Alex Baranau 2012-08-22, 22:41
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB