Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> HBase table row key design question.


+
Jason Huang 2012-10-02, 14:28
+
Mohammad Tariq 2012-10-02, 19:30
+
Jason Huang 2012-10-02, 21:38
Copy link to this message
-
RE: HBase table row key design question.
For 1. I wouldn't worry about that problem until it really happens. Just my opinion. If you really want to solve it you will need to generate a unique id per row-key 'put' outside of hbase [ say some hash of serverip + timestamp etc ] and append it to the end of your row key.

For 2. You can investigate bloom filters and that can help you filter out invalid rows  faster. Also, there are way to organize names based on phonetics. You can, may be build a secondary table in background with phonetic keys as row keys.
http://en.wikipedia.org/wiki/Soundex
hth,
Abhishek
-----Original Message-----
From: Jason Huang [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, October 02, 2012 2:38 PM
To: [EMAIL PROTECTED]
Subject: Re: HBase table row key design question.

Thanks Mohammad.

The issue about phone number is that it tends to change over time and we think name and DOB are more reliable. SSN is more unique but the issue is that we can't force the user to provide it. Basically we have limited information that can be used.

thanks,

Jason

On Tue, Oct 2, 2012 at 3:30 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
> Hello Sir,
>
>      Although we should always try to keep the rowkey length as less
> as possible, but still a short key that doesn't help much in faster
> data access is also of no use. So, it totally depends on that
> particular use case. However, in your case, how about using "phone number" as the rowkey??
> Since it is always unique, you will always get the correct result with
> much shorter rowkey. It's just that in this case you will have to ask
> for the user's phone number instead of name and DOB.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Tue, Oct 2, 2012 at 7:58 PM, Jason Huang <[EMAIL PROTECTED]> wrote:
>
>> Hello,
>>
>> I am designing a HBase table for users and hope to get some
>> suggestions for my row key design. Thanks...
>>
>> This user table will have columns which include user information such
>> as names, birthday, gender, address, phone number, etc... The first
>> time user comes to us we will ask all these information and we should
>> generate a new row in the table with a unique row key. The next time
>> the same user comes in again we will ask for his/her names and
>> birthday and our application should quickly get the row(s) in the
>> table which meets the name and birthday provided.
>>
>> Here is what I am thinking as row key:
>>
>> {first 6 digit of user's first name}_{first 6 digit of user's last
>> name}_{birthday in MMDDYYYY}_{timestamp when user comes in for the
>> first time}
>>
>> However, I see a few questions from this row key:
>>
>> (1) Although it is not very likely but there could be some small
>> chances that two users with same name and birthday came in at the
>> same day. And the two requests to generate new user came at the same
>> time (the timestamps were defined in the HTable API and happened to
>> be of the same value before calling the put method). This means the
>> row key design above won't guarantee a unique row key. Any
>> suggestions on how to modify it and ensure a unique ID?
>>
>> (2) Sometimes we will only have part of user's first name and/or last
>> name. In that case, we will need to perform a scan and return
>> multiple matches to the client. To avoid scanning the whole table, if
>> we have user's first name, we can set start/stop row accordingly. But
>> then if we only have user's last name, we can't set up a good start/stop row.
>> What's even worse, if the user provides a "sounds-like" first or last
>> name, then our scan won't be able to return good possible matches.
>> Does anyone ever use names as part of the row key and encounter this
>> type of issue?
>>
>> (3) The row key seems to be long (30+ chars), will this affect our
>> read/write performance? Maybe it will increase the storage a bit (say
>> we have 3 million rows per month)? In other words, does the length of
>> the row key matter a lot?
>>
>> thanks!
>>
>> Jason
>>
+
Jason Huang 2012-10-03, 12:31
+
Doug Meil 2012-10-02, 20:02
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB