Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Use of MD5 as row keys - is this safe?


Copy link to this message
-
Re: Use of MD5 as row keys - is this safe?
I don't believe that there has been any reports of collisions, but if. You are concerned you could use the SHA-1 for generating the hash. Relatively speaking, SHA-1is slower, but still fast enough for most applications.

Don't know if it's speed relative to an MD5 and string cat, but it should yield a smaller key.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 20, 2012, at 11:31 AM, Damien Hardy <[EMAIL PROTECTED]> wrote:

> Le 20/07/2012 18:22, Jonathan Bishop a écrit :
>> Hi,
>>
>> I know it is a commonly suggested to use an MD5 checksum to create a row
>> key from some other identifier, such as a string or long. This is usually
>> done to guard against hot-spotting and seems to work well.
>>
>> My concern is that there no guard against collision when this is done - two
>> different strings or longs could produce the same row-key. Although this is
>> very unlikely, it is bothersome to consider this possibility for large
>> systems.
>>
>> So what I usually do is concatenate the MD5 with the original identifier...
>>
>> MD5(id) + id
>>
>> which assures that the rowkey is both randomly distributed and unique.
>>
>> Is this necessary, or is it the common practice to just use the MD5
>> checksum itself?
>>
>> Thanks,
>>
>> Jon
>
> Hello Jonathan,
>
> md5(id)+id is the good way to avoid hotspotting and insure uniqueness.
>
> md5(id)[0]+id could be an other way to limit randomness of the rowid on
> 16 values
> You can now combine (with OR logic) 16 filters in a scanner (on for each
> letter available in md5 digest)
> it limits the balance on 16 potentials regions olso.
>
> Cheers,
>
> --
> Damien
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB