Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Is it necessary to set MD5 on rowkey?


+
bigdata 2012-12-18, 09:20
+
Doug Meil 2012-12-18, 13:40
+
Damien Hardy 2012-12-18, 09:33
+
Michael Segel 2012-12-18, 13:52
Copy link to this message
-
RE: Is it necessary to set MD5 on rowkey?
Thanks to all of you!
Actually, I want to make some reports about device access times daily and some selected days range. I design a table like this:
row key:  date_deviceid
this rowkey can help me calculate daily login devices count. I can add a prefix (such as 2 digital bytes of MD5(date)), and calculate a special day quickly. But when I calculate for a range time, it is not suitable.It's hard to balance it because I think I have 50% for daily reports and another 50% for range reports.
But another question is I have a report about daily new deviceid count (never access system before), it means that I should use deviceid for search condition with all date. I've met several problems like this: I use a rowkey for one query but no way for another query. I should create another rowkey format for other query. But question is I can not create two original tables with different rowkey!!!
Any suggestions? Or better solutions for my questions? Thanks

> Subject: Re: Is it necessary to set MD5 on rowkey?
> From: [EMAIL PROTECTED]
> Date: Tue, 18 Dec 2012 07:52:53 -0600
> To: [EMAIL PROTECTED]
>
>
> Hi,
>
> First, the use of a 'Salt' is a very, very bad idea and I would really hope that the author of that blog take it down.
> While it may solve an initial problem in terms of region hot spotting, it creates another problem when it comes to fetching data. Fetching data takes more effort.
>
> With respect to using a hash (MD5 or SHA-1) you are creating a more random key that is unique to the record.  Some would argue that using MD5 or SHA-1 that mathematically you could have a collision, however you could then append the key to the hash to guarantee uniqueness. You could also do things like take the hash and then truncate it to the first byte and then append the record key. This should give you enough randomness to avoid hot spotting after the initial region completion and you could pre-split out any number of regions. (First byte 0-255 for values, so you can program the split...
>
>
> Having said that... yes, you lose the ability to perform a sequential scan of the data.  At least to a point.  It depends on your schema.
>
> Note that you need to think about how you are primarily going to access the data.  You can then determine the best way to store the data to gain the best performance. For some applications... the region hot spotting isn't an important issue.
>
> Note YMMV
>
> HTH
>
> -Mike
>
> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[EMAIL PROTECTED]> wrote:
>
> > Hello,
> >
> > There is middle term betwen sequecial keys (hot spoting risk) and md5
> > (heavy scan):
> >  * you can use composed keys with a field that can segregate data
> > (hostname, productname, metric name) like OpenTSDB
> >  * or use Salt with a limited number of values (example
> > substr(md5(rowid),0,1) = 16 values)
> >    so that a scan is a combination of 16 filters on on each salt values
> >    you can base your code on HBaseWD by sematext
> >
> > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> >       https://github.com/sematext/HBaseWD
> >
> > Cheers,
> >
> >
> > 2012/12/18 bigdata <[EMAIL PROTECTED]>
> >
> >> Many articles tell me that MD5 rowkey or part of it is good method to
> >> balance the records stored in different parts. But If I want to search some
> >> sequential rowkey records, such as date as rowkey or partially. I can not
> >> use rowkey filter to scan a range of date value one time on the date by
> >> MD5. How to balance this issue?
> >> Thanks.
> >>
> >>
> >
> >
> >
> >
> > --
> > Damien HARDY
>
     
+
Alex Baranau 2012-12-18, 17:12
+
Michael Segel 2012-12-18, 17:24
+
Alex Baranau 2012-12-18, 17:36
+
Michael Segel 2012-12-18, 23:29
+
lars hofhansl 2012-12-19, 18:37
+
Michael Segel 2012-12-19, 19:46
+
lars hofhansl 2012-12-19, 20:51
+
Michael Segel 2012-12-19, 21:02
+
David Arthur 2012-12-19, 21:26
+
Nick Dimiduk 2012-12-19, 22:15
+
Andrew Purtell 2012-12-19, 22:28
+
David Arthur 2012-12-19, 23:04
+
Alex Baranau 2012-12-19, 23:07
+
Michael Segel 2012-12-20, 01:09
+
Michael Segel 2012-12-20, 01:02
+
Jean-Marc Spaggiari 2012-12-20, 01:11
+
Michael Segel 2012-12-20, 01:23
+
Jean-Marc Spaggiari 2012-12-20, 01:35
+
Michel Segel 2012-12-20, 01:47
+
lars hofhansl 2012-12-20, 02:06
+
Michael Segel 2012-12-20, 13:20
+
Nick Dimiduk 2012-12-20, 18:15
+
Michael Segel 2012-12-20, 20:15
+
k8 robot 2013-02-06, 01:46