-Re: Using separator/delimiter in HBase rowkey?
Michael Segel 2013-07-08, 15:29
Is murmur part of the standard java libraries?
If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff.
On Jul 8, 2013, at 10:14 AM, Mike Axiak <[EMAIL PROTECTED]> wrote:
> Hello Jason,
> Have you considered the following rowkey?
> murmur_128(userId) + timestamp + userId ?
> This handles both of your cases as (1) murmur 128 is much faster than
> md5 so will have very low overhead and (2) the userid at the end of
> the key will ensure that no murmur collisions will cause issues. This
> key also handle incrementing userIds well because close userIds will
> likely be in separate regions.
> On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <[EMAIL PROTECTED]> wrote:
>> I am trying to get some advice on pros/cons of using separator/delimiter as
>> part of HBase row key.
>> Currently one of our user activity tables has a rowkey design of
>> "UserID^TimeStamp" with a separator of "^". (UserID is a string that won't
>> include '^').
>> This is designed for the two common use cases in our system:
>> (1) If we come from a context where the UserID is known, we can do a scan
>> easily for all the user activities with a startRowKey and stopRowKey.
>> (2) If we come from a external networked table where the row key of this
>> user activity table is stored and can be retrieved as activityRowKey, then
>> we can use the following code to parse out the UserID and do the same scan
>> as in (1):
>> String activityRowKeyStr = Bytes.toString(activityRowKey);
>> String userId >> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1)
>> Then I can set startRowKey and stopRowKey for the scan based on userId.
>> Here we get benefit of having the User ID as part of the row key with the
>> separator (comparing to another solution that stores the userID as one of
>> the columns in the user activity table).
>> The reason I pick a separator after UserID is that sometimes we may not get
>> a fixed length string of the UserID value. At one point I actually thought
>> of using MD5 to hash the UserID and make it a fixed length, however, the
>> possibility of collision and possible overhead of applying the hash
>> function makes me pick the separator "^".
>> My question:
>> (1) I kind of make the argument that using a separator is kind of better
>> than using a MD5 hash value. Does that seem reasonable? Could you comments
>> on other pros and cons that I might miss (as the bases for my argument)?
>> (2) On using a separator/delimiter, besides the requirements that this
>> separator/delimiter shouldn't appear elsewhere in the rowkey, are there any
>> other requirements? Are there any special separator/delimiters that are
>> better/worse than the average ones?