Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Primary Key Design


Copy link to this message
-
Re: Primary Key Design
Two things go wrong:

1. Current version of Lily has "dot" as a special/reserved character
2. We shouldn't overload the same Slave-A just because everything starting
with "http://www.ABCD******" is stored on this Slave-A

We should randomize primary keysŠ

-Fuad
On 11-08-03 11:50 AM, "Michael Buckley" <[EMAIL PROTECTED]> wrote:

>What is wrong with using the url as PK?  Is it space?  Or query
>performance?
>
>Michael
>
>On 2011-08-03, at 11:32 AM, Fuad Efendi wrote:
>
>>
>> Such design is enforced for RAW: we need to keep history of HTMLs under
>> the same ID value, that's why first candidate for ID is URL, and finally
>> we use SHA(URL)
>>
>> For OIQ, it must be carefully planned. SHA(JSON) has benefit of implicit
>> "equals" implementation (JSON objects are not the same if ID :>>SHA(JSON)
>> is different)
>>
>> -Fuad
>>
>>
>>
>>
>>
>>
>> On 11-08-03 10:25 AM, "Fuad Efendi" <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>>
>>> I am starting to use following scheme for primary keys:
>>> SHA256(URL) + "-RAW" Primary Key Schema
>>> <https://outsideiq.jira.com/browse/CA-107>
>>>
>>>
>>>
>>> RATIONALE:
>>> * PKs  in Lily (user-defined) will be prepended "USER." and I can't use
>>> URI
>>> for instance (it contains dots which is special character in current
>>> version)
>>> * Additionally to SHA-256-generated PK, Lily will still use UUID
>>>(which is
>>> really unique) for versioning?
>>> * IMPORTANT: we need randomize Pks; it is best practice with Hbase
>>>(data
>>> will be randomly distributed in a cluster)
>>>
>>> and I suggest to use similar SHA256(JSON-Object-in-UTF8) + "-OIQ" (it
>>>is
>>> postfix so that we will have good "randomization"; in Hbase, all data
>>>is
>>> physically sorted by PK)
>>> - since all OIQ objects will be stored denormalized as JSON (string
>>>type
>>> Lily) (note, it will be UTF-8 encoded, I believe it is also part of
>>> ECMA-specs)
>>>
>>>
>>>
>>>
>>> /**
>>>
>>> * {@link
>>>
>>>http://stackoverflow.com/questions/221165/pros-and-cons-of-using-md5-has
>>>h-
>>> of
>>> -uri-as-the-primary-key-in-a-database}
>>>
>>> *
>>>
>>> * @author Fuad
>>>
>>> *
>>>
>>> */
>>>
>>> public class SHA256 {
>>>
>>>
>>>
>>> public static final String SHA256(byte[] bytes) throws
>>> NoSuchAlgorithmException {
>>>
>>> MessageDigest md = MessageDigest.getInstance("SHA-256");
>>>
>>> md.update(bytes);
>>>
>>> byte[] mdbytes = md.digest();
>>>
>>>
>>>
>>> // convert the byte to hex format
>>>
>>> StringBuffer hexString = new StringBuffer();
>>>
>>> for (int i = 0; i < mdbytes.length; i++) {
>>>
>>> String hex = Integer.toHexString(0xff & mdbytes[i]);
>>>
>>> if (hex.length() == 1)
>>>
>>> hexString.append('0');
>>>
>>> hexString.append(hex);
>>>
>>> }
>>>
>>>
>>>
>>> return hexString.toString();
>>>
>>> }
>>>
>>>
>>>
>>>
>>>
>>> public static final String SHA256(String text) throws
>>> NoSuchAlgorithmException, UnsupportedEncodingException  {
>>>
>>> return SHA256(text.getBytes("UTF-8"));
>>>
>>> }
>>>
>>>
>>>
>>> }
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Fuad Efendi
>>>
>>>
>>>
>>>
>>>
>>
>>
>