Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Using separator/delimiter in HBase rowkey?


+
Jason Huang 2013-07-08, 14:19
+
Shahab Yunus 2013-07-08, 15:17
+
Mike Axiak 2013-07-08, 15:14
Copy link to this message
-
Re: Using separator/delimiter in HBase rowkey?
Michael Segel 2013-07-08, 15:29
Is murmur part of the standard java libraries?

If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff.

On Jul 8, 2013, at 10:14 AM, Mike Axiak <[EMAIL PROTECTED]> wrote:

> Hello Jason,
>
> Have you considered the following rowkey?
>
>  murmur_128(userId) + timestamp + userId ?
>
> This handles both of your cases as (1) murmur 128 is much faster than
> md5 so will have very low overhead and (2) the userid at the end of
> the key will ensure that no murmur collisions will cause issues. This
> key also handle incrementing userIds well because close userIds will
> likely be in separate regions.
>
> Cheers,
> Mike
>
> On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <[EMAIL PROTECTED]> wrote:
>> Hello,
>>
>> I am trying to get some advice on pros/cons of using separator/delimiter as
>> part of HBase row key.
>>
>> Currently one of our user activity tables has a rowkey design of
>> "UserID^TimeStamp" with a separator of "^". (UserID is a string that won't
>> include '^').
>>
>> This is designed for the two common use cases in our system:
>> (1) If we come from a context where the UserID is known, we can do a scan
>> easily for all the user activities with a startRowKey and stopRowKey.
>> (2) If we come from a external networked table where the row key of this
>> user activity table is stored and can be retrieved as activityRowKey, then
>> we can use the following code to parse out the UserID and do the same scan
>> as in (1):
>>
>>    String activityRowKeyStr = Bytes.toString(activityRowKey);
>>    String userId >> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1)
>>
>> Then I can set startRowKey and stopRowKey for the scan based on userId.
>> Here we get benefit of having the User ID as part of the row key with the
>> separator (comparing to another solution that stores the userID as one of
>> the columns in the user activity table).
>>
>> The reason I pick a separator after UserID is that sometimes we may not get
>> a fixed length string of the UserID value. At one point I actually thought
>> of using MD5 to hash the UserID and make it a fixed length, however, the
>> possibility of collision and possible overhead of applying the hash
>> function makes me pick the separator "^".
>>
>> My question:
>> (1) I kind of make the argument that using a separator is kind of better
>> than using a MD5 hash value. Does that seem reasonable? Could you comments
>> on other pros and cons that I might miss (as the bases for my argument)?
>>
>> (2) On using a separator/delimiter, besides the requirements that this
>> separator/delimiter shouldn't appear elsewhere in the rowkey, are there any
>> other requirements? Are there any special separator/delimiters that are
>> better/worse than the average ones?
>>
>> thanks!
>>
>> Jason
>
+
Ted Yu 2013-07-08, 15:40
+
Mike Axiak 2013-07-08, 15:36
+
Michael Segel 2013-07-08, 15:54
+
Mike Axiak 2013-07-08, 16:00
+
Michael Segel 2013-07-08, 16:25
+
Jason Huang 2013-07-09, 01:09
+
Ted Yu 2013-07-08, 15:58