Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Using separator/delimiter in HBase rowkey?


+
Jason Huang 2013-07-08, 14:19
+
Shahab Yunus 2013-07-08, 15:17
+
Mike Axiak 2013-07-08, 15:14
Copy link to this message
-
Re: Using separator/delimiter in HBase rowkey?
Is murmur part of the standard java libraries?

If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff.

On Jul 8, 2013, at 10:14 AM, Mike Axiak <[EMAIL PROTECTED]> wrote:

> Hello Jason,
>
> Have you considered the following rowkey?
>
>  murmur_128(userId) + timestamp + userId ?
>
> This handles both of your cases as (1) murmur 128 is much faster than
> md5 so will have very low overhead and (2) the userid at the end of
> the key will ensure that no murmur collisions will cause issues. This
> key also handle incrementing userIds well because close userIds will
> likely be in separate regions.
>
> Cheers,
> Mike
>
> On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <[EMAIL PROTECTED]> wrote:
>> Hello,
>>
>> I am trying to get some advice on pros/cons of using separator/delimiter as
>> part of HBase row key.
>>
>> Currently one of our user activity tables has a rowkey design of
>> "UserID^TimeStamp" with a separator of "^". (UserID is a string that won't
>> include '^').
>>
>> This is designed for the two common use cases in our system:
>> (1) If we come from a context where the UserID is known, we can do a scan
>> easily for all the user activities with a startRowKey and stopRowKey.
>> (2) If we come from a external networked table where the row key of this
>> user activity table is stored and can be retrieved as activityRowKey, then
>> we can use the following code to parse out the UserID and do the same scan
>> as in (1):
>>
>>    String activityRowKeyStr = Bytes.toString(activityRowKey);
>>    String userId >> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1)
>>
>> Then I can set startRowKey and stopRowKey for the scan based on userId.
>> Here we get benefit of having the User ID as part of the row key with the
>> separator (comparing to another solution that stores the userID as one of
>> the columns in the user activity table).
>>
>> The reason I pick a separator after UserID is that sometimes we may not get
>> a fixed length string of the UserID value. At one point I actually thought
>> of using MD5 to hash the UserID and make it a fixed length, however, the
>> possibility of collision and possible overhead of applying the hash
>> function makes me pick the separator "^".
>>
>> My question:
>> (1) I kind of make the argument that using a separator is kind of better
>> than using a MD5 hash value. Does that seem reasonable? Could you comments
>> on other pros and cons that I might miss (as the bases for my argument)?
>>
>> (2) On using a separator/delimiter, besides the requirements that this
>> separator/delimiter shouldn't appear elsewhere in the rowkey, are there any
>> other requirements? Are there any special separator/delimiters that are
>> better/worse than the average ones?
>>
>> thanks!
>>
>> Jason
>
+
Ted Yu 2013-07-08, 15:40
+
Mike Axiak 2013-07-08, 15:36
+
Michael Segel 2013-07-08, 15:54
+
Mike Axiak 2013-07-08, 16:00
+
Michael Segel 2013-07-08, 16:25
+
Jason Huang 2013-07-09, 01:09
+
Ted Yu 2013-07-08, 15:58
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB