Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> How to query by rowKey-infix


+
Christian Schäfer 2012-07-31, 15:27
+
Jerry Lam 2012-07-31, 17:10
+
Matt Corgan 2012-07-31, 17:41
+
Christian Schäfer 2012-08-01, 08:18
+
Michael Segel 2012-08-01, 11:52
+
Christian Schäfer 2012-08-02, 12:23
Copy link to this message
-
Re: How to query by rowKey-infix
Hi,

What does your schema look like?

Would it make sense to changing the key to user_id '|' timestamp and then use the session_id in the column name?

On Aug 2, 2012, at 7:23 AM, Christian Schäfer <[EMAIL PROTECTED]> wrote:

> OK,
>
> at first I will try the scans.
>
> If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use coprocessors.
>
> Currently I'm stuck at the scans because it requires two steps (therefore some kind of filter chaining)
>
> The key:  userId-dateInMllis-sessionId
>
> At first I need to extract dateInMllis with regex or substring (using special delimiters for date)
>
> Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this:
>
>
>
>
>
> ----- Ursprüngliche Message -----
> Von: Michael Segel <[EMAIL PROTECTED]>
> An: [EMAIL PROTECTED]
> CC:
> Gesendet: 13:52 Mittwoch, 1.August 2012
> Betreff: Re: How to query by rowKey-infix
>
> Actually w coprocessors you can create a secondary index in short order.
> Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive.
>
> On Jul 31, 2012, at 12:41 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
>
>> When deciding between a table scan vs secondary index, you should try to
>> estimate what percent of the underlying data blocks will be used in the
>> query.  By default, each block is 64KB.
>>
>> If each user's data is small and you are fitting multiple users per block,
>> then you're going to need all the blocks, so a tablescan is better because
>> it's simpler.  If each user has 1MB+ data then you will want to pick out
>> the individual blocks relevant to each date.  The secondary index will help
>> you go directly to those sparse blocks, but with a cost in complexity,
>> consistency, and extra denormalized data that knocks primary data out of
>> your block cache.
>>
>> If latency is not a concern, I would start with the table scan.  If that's
>> too slow you add the secondary index, and if you still need it faster you
>> do the primary key lookups in parallel as Jerry mentions.
>>
>> Matt
>>
>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <[EMAIL PROTECTED]> wrote:
>>
>>> Hi Chris:
>>>
>>> I'm thinking about building a secondary index for primary key lookup, then
>>> query using the primary keys in parallel.
>>>
>>> I'm interested to see if there is other option too.
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <[EMAIL PROTECTED]
>>>> wrote:
>>>
>>>> Hello there,
>>>>
>>>> I designed a row key for queries that need best performance (~100 ms)
>>>> which looks like this:
>>>>
>>>> userId-date-sessionId
>>>>
>>>> These queries(scans) are always based on a userId and sometimes
>>>> additionally on a date, too.
>>>> That's no problem with the key above.
>>>>
>>>> However, another kind of queries shall be based on a given time range
>>>> whereas the outermost left userId is not given or known.
>>>> In this case I need to get all rows covering the given time range with
>>>> their date to create a daily reporting.
>>>>
>>>> As I can't set wildcards at the beginning of a left-based index for the
>>>> scan,
>>>> I only see the possibility to scan the index of the whole table to
>>> collect
>>>> the
>>>> rowKeys that are inside the timerange I'm interested in.
>>>>
>>>> Is there a more elegant way to collect rows within time range X?
>>>> (Unfortunately, the date attribute is not equal to the timestamp that is
>>>> stored by hbase automatically.)
>>>>
>>>> Could/should one maybe leverage some kind of row key caching to
>>> accelerate
>>>> the collection process?
>>>> Is that covered by the block cache?
>>>>
>>>> Thanks in advance for any advice.
>>>>
>>>> regards
>>>> Chris
>>>>
>>>
>
+
Christian Schäfer 2012-08-06, 12:54
+
Alex Baranau 2012-08-02, 22:57
+
Matt Corgan 2012-08-02, 23:09
+
Alex Baranau 2012-08-03, 01:15
+
Matt Corgan 2012-08-03, 01:29
+
Christian Schäfer 2012-08-03, 09:34
+
Christian Schäfer 2012-08-03, 09:23
+
Alex Baranau 2012-08-03, 22:14
+
Alex Baranau 2012-08-09, 20:18
+
Christian Schäfer 2012-08-06, 13:00
+
Christian Schäfer 2012-08-09, 20:55
+
anil gupta 2012-08-22, 18:42
+
Christian Schäfer 2012-08-23, 08:41
+
anil gupta 2012-08-24, 07:53