Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - How to query by rowKey-infix


+
Christian Schäfer 2012-07-31, 15:27
+
Jerry Lam 2012-07-31, 17:10
+
Matt Corgan 2012-07-31, 17:41
+
Christian Schäfer 2012-08-01, 08:18
+
Michael Segel 2012-08-01, 11:52
+
Christian Schäfer 2012-08-02, 12:23
+
Michael Segel 2012-08-03, 12:21
+
Christian Schäfer 2012-08-06, 12:54
+
Alex Baranau 2012-08-02, 22:57
+
Matt Corgan 2012-08-02, 23:09
+
Alex Baranau 2012-08-03, 01:15
+
Matt Corgan 2012-08-03, 01:29
+
Christian Schäfer 2012-08-03, 09:34
+
Christian Schäfer 2012-08-03, 09:23
+
Alex Baranau 2012-08-03, 22:14
+
Alex Baranau 2012-08-09, 20:18
+
Christian Schäfer 2012-08-06, 13:00
+
Christian Schäfer 2012-08-09, 20:55
+
anil gupta 2012-08-22, 18:42
+
Christian Schäfer 2012-08-23, 08:41
Copy link to this message
-
Re: How to query by rowKey-infix
anil gupta 2012-08-24, 07:53
Christian: I'm slightly shocked about the processing time of more than 2
mins to return 225 rows.I would actually need a response in 5-10 sec.
Anil: I started getting the response within 1-2 sec of firing the query but
i got all the 225 results in 2 mins. My table was having 34 million rows
and every rows was having 25 columns on an average. Average size of each
row is around 1.21 KB. Size of one replica is ~40 GB in HDFS.
I havent done the comparison of timestamp based filtering and column value
based filtering. However, I strongly believe that timestamp based filtering
will be a winner due to the reason that it can skip Blocks.
Regarding the concern that my query took 2 min, one of the reason is that
the Hardware conf is way below par so i dont really look for blazing fast
performance on this cluster. If you get a really well tuned HBase then your
performance can improve by 3-4x easily(query will be done in 20-30
seconds). But, i dont think you can get blazing fast result like the ones
we get when we do scanning based on RowKey.

Christian: In your  timestamp based filtering, do you check the timestamp
as part of the row key or do you use the put timestamp (as I do)?
Anil: I use the timestamp by using Scan.setTimeRange(long, long). In my use
case i am not using row key at all. So, roughly it is full table scan but
timestamp is doing all the magic. It's a definite advantage if you can use
rowkey in your query.

Christian:Is it a full table scan where each row's key is checked against a
given timestamp/timerange?
Anil: Essentially its a full table scan since i am not using any rowkey or
other filters.

Christian:How many rows are scanned/touched  at your timestamp based
filtering?
Anil: I dont know how to get these stats. Can anyone enlighten me? I am
also curious to know this stat.

I'll try to run the column value based filter also so that we get some more
insights into the best option available. Let me know your thoughts on my
reply.

Thanks,
Anil Gupta
On Thu, Aug 23, 2012 at 1:41 AM, Christian Schäfer <[EMAIL PROTECTED]>wrote:

> Hi Anil,
>
> to restrict data to a certain time window I also set timerange for the
> scan.
>
>
>
> How many rows are scanned/touched  at your timestamp based filtering?
>
>
>
> My use case of obtaining data by substring comparator operates on the row
> key.
> It can't be replaced by setting the time range in my case, really.
>
> Btw. the scan is additionally restricted to a certain timerange to
> increase skipping of irrelevant files and thus improve performance.
>
>
> regards,
> Christian
>
>
>
> ----- Ursprüngliche Message -----
> Von: anil gupta <[EMAIL PROTECTED]>
> An: [EMAIL PROTECTED]; Christian Schäfer <[EMAIL PROTECTED]>
> CC:
> Gesendet: 20:42 Mittwoch, 22.August 2012
> Betreff: Re: How to query by rowKey-infix
>
> Hi Christian,
>
> I had the similar requirements as yours. So, till now i have used
> timestamps for filtering the data and I would say the performance is
> satisfactory. Here are the results of timestamp based filtering:
> The table has 34 million records(average row size is 1.21 KB), in 136
> seconds i get the entire result of query which had 225 rows.
> I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
> had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
> is hosting 2 Slaves Instance(2 VM's running Datanode,
> NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
> done any modification in the block size of HDFS or HBase. Considering the
> below-par hardware configuration of cluster i feel the performance is OK
> and IMO it'll be better than substring comparator of column values since in
> substring comparator filter you are essentially doing a FULL TABLE scan.
> Whereas, in timerange based scan you can *Skip Store Files*.
>
> On a side note, Alex created a JIRA for enhancing the current
> FuzzyRowFilter to do range based filtering also. Here is the link:
> https://issues.apache.org/jira/browse/HBASE-6618 . You are more than
Thanks & Regards,
Anil Gupta