Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Understanding scan behaviour


+
Mohit Anchlia 2013-03-28, 04:15
+
Ted Yu 2013-03-28, 04:22
+
ramkrishna vasudevan 2013-03-28, 04:23
+
ramkrishna vasudevan 2013-03-28, 04:23
+
Mohit Anchlia 2013-03-28, 14:38
+
Jean-Marc Spaggiari 2013-03-28, 14:53
+
Mohit Anchlia 2013-03-28, 15:17
+
Jean-Marc Spaggiari 2013-03-28, 15:26
+
Mohit Anchlia 2013-03-28, 16:02
+
Ted Yu 2013-03-28, 16:15
+
Mohit Anchlia 2013-03-28, 17:17
+
Ted Yu 2013-03-28, 17:23
+
Li, Min 2013-03-29, 05:48
+
ramkrishna vasudevan 2013-03-29, 06:20
Copy link to this message
-
Re: Understanding scan behaviour
Mohith,
Are you wanting to reduce the amount of data you're scanning and bring
down your query time when:
- you have a row key has a multi-part row key of a string and time value and
- you know the prefix of the string and a range of the time value?
That's possible (but not easy) to do with HBase using the filter's
ability to return a seek hint to jump to the next set of contiguous
rows. If the cardinality of your string value isn't too large, this
approach can make a pretty dramatic performance improvement.

You should take a look at Phoenix
(https://github.com/forcedotcom/phoenix), a SQL skin on top of HBase -
we just introduced the above optimization. You'd create your table like
this:

CREATE TABLE t1 (id VARCHAR not null, timestamp DATE not null CONSTRAINT
pk PRIMARY KEY (id, timestamp));

Then your query would look like this:

SELECT id, timestamp FROM t1 WHERE id LIKE 'abc%' AND timestamp > ? AND
timestamp < ?;

and you'd bind the ? using the regular JDBC PreparedStatement APIs.

Regards,
James
@JamesPlusPlus

On 03/28/2013 11:20 PM, ramkrishna vasudevan wrote:
> Mohith,
>
> It is always better to go with start row and end row if you are knowing
> what are they.
> Just add one byte more to the actual end row (inclusive row) and form the
> end key.  This will narrow down the search.
>
> Remeber the byte comparison is the way that HBase scans.
> Regards
> Ram
>
> On Fri, Mar 29, 2013 at 11:18 AM, Li, Min <[EMAIL PROTECTED]> wrote:
>
>> Hi, Mohit,
>>
>> Try using ENDROW. STARTROW&ENDROW is much faster than PrefixFilter.
>>
>> "+" ascii code is 43
>> "," ascii code is 44
>>
>> scan 'SESSIONID_TIMELINE', {LIMIT => 1,STARTROW => '++++', ENDROW=>'+++,'}
>>
>> Min
>>
>> -----Original Message-----
>> From: Mohit Anchlia [mailto:[EMAIL PROTECTED]]
>> Sent: Friday, March 29, 2013 1:18 AM
>> To: [EMAIL PROTECTED]
>> Subject: Re: Understanding scan behaviour
>>
>> Could the prefix filter lead to full tablescan? In other words is
>> PrefixFilter applied after fetching the rows?
>>
>> Another question I have is say I have row key abc and abd and I search for
>> row "abc", is it always guranteed to be the first key when returned from
>> scanned results? If so I can alway put a condition in the client app.
>>
>> On Thu, Mar 28, 2013 at 9:15 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>
>>> Take a look at the following in
>>> hbase-server/src/main/ruby/shell/commands/scan.rb
>>> (trunk)
>>>
>>>    hbase> scan 't1', {FILTER => "(PrefixFilter ('row2') AND
>>>      (QualifierFilter (>=, 'binary:xyz'))) AND (TimestampsFilter ( 123,
>>> 456))"}
>>>
>>> Cheers
>>>
>>> On Thu, Mar 28, 2013 at 9:02 AM, Mohit Anchlia <[EMAIL PROTECTED]
>>>> wrote:
>>>> I see then I misunderstood the behaviour. My keys are id + timestamp so
>>>> that I can do a range type search. So what I really want is to return a
>>> row
>>>> where id matches the prefix. Is there a way to do this without having
>> to
>>>> scan large amounts of data?
>>>>
>>>>
>>>>
>>>> On Thu, Mar 28, 2013 at 8:26 AM, Jean-Marc Spaggiari <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hi Mohit,
>>>>>
>>>>> "+" ascii code is 43
>>>>> "9" ascii code is 57.
>>>>>
>>>>> So "+9" is coming after "++". If you don't have any row with the
>> exact
>>>>> key "+++++", HBase will look for the first one after this one. And in
>>>>> your case, it's +9hC\xFC\x82s\xABL3\xB3B\xC0\xF9\x87\x03\x7F\xFF\xF.
>>>>>
>>>>> JM
>>>>>
>>>>> 2013/3/28 Mohit Anchlia <[EMAIL PROTECTED]>:
>>>>>> My understanding is that the row key would start with +++++ for
>>>> instance.
>>>>>> On Thu, Mar 28, 2013 at 7:53 AM, Jean-Marc Spaggiari <
>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>> Hi Mohit,
>>>>>>>
>>>>>>> I see nothing wrong with the results below. What would I have
>>>> expected?
>>>>>>> JM
>>>>>>>
>>>>>>> 2013/3/28 Mohit Anchlia <[EMAIL PROTECTED]>:
>>>>>>>   > I am running 92.1 version and this is what happens.
>>>>>>>>
>>>>>>>> hbase(main):003:0> scan 'SESSIONID_TIMELINE', {LIMIT => 1,
+
Mohit Anchlia 2013-03-29, 16:31
+
Asaf Mesika 2013-03-30, 13:55
+
Mohit Anchlia 2013-03-30, 15:25
+
Ted Yu 2013-03-30, 16:37
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB