Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Primary Key Design


Copy link to this message
-
Re: Primary Key Design
Two things go wrong:

1. Current version of Lily has "dot" as a special/reserved character
2. We shouldn't overload the same Slave-A just because everything starting
with "http://www.ABCD******" is stored on this Slave-A

We should randomize primary keysŠ

-Fuad
On 11-08-03 11:50 AM, "Michael Buckley" <[EMAIL PROTECTED]> wrote:

>What is wrong with using the url as PK?  Is it space?  Or query
>performance?
>
>Michael
>
>On 2011-08-03, at 11:32 AM, Fuad Efendi wrote:
>
>>
>> Such design is enforced for RAW: we need to keep history of HTMLs under
>> the same ID value, that's why first candidate for ID is URL, and finally
>> we use SHA(URL)
>>
>> For OIQ, it must be carefully planned. SHA(JSON) has benefit of implicit
>> "equals" implementation (JSON objects are not the same if ID :>>SHA(JSON)
>> is different)
>>
>> -Fuad
>>
>>
>>
>>
>>
>>
>> On 11-08-03 10:25 AM, "Fuad Efendi" <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>>
>>> I am starting to use following scheme for primary keys:
>>> SHA256(URL) + "-RAW" Primary Key Schema
>>> <https://outsideiq.jira.com/browse/CA-107>
>>>
>>>
>>>
>>> RATIONALE:
>>> * PKs  in Lily (user-defined) will be prepended "USER." and I can't use
>>> URI
>>> for instance (it contains dots which is special character in current
>>> version)
>>> * Additionally to SHA-256-generated PK, Lily will still use UUID
>>>(which is
>>> really unique) for versioning?
>>> * IMPORTANT: we need randomize Pks; it is best practice with Hbase
>>>(data
>>> will be randomly distributed in a cluster)
>>>
>>> and I suggest to use similar SHA256(JSON-Object-in-UTF8) + "-OIQ" (it
>>>is
>>> postfix so that we will have good "randomization"; in Hbase, all data
>>>is
>>> physically sorted by PK)
>>> - since all OIQ objects will be stored denormalized as JSON (string
>>>type
>>> Lily) (note, it will be UTF-8 encoded, I believe it is also part of
>>> ECMA-specs)
>>>
>>>
>>>
>>>
>>> /**
>>>
>>> * {@link
>>>
>>>http://stackoverflow.com/questions/221165/pros-and-cons-of-using-md5-has
>>>h-
>>> of
>>> -uri-as-the-primary-key-in-a-database}
>>>
>>> *
>>>
>>> * @author Fuad
>>>
>>> *
>>>
>>> */
>>>
>>> public class SHA256 {
>>>
>>>
>>>
>>> public static final String SHA256(byte[] bytes) throws
>>> NoSuchAlgorithmException {
>>>
>>> MessageDigest md = MessageDigest.getInstance("SHA-256");
>>>
>>> md.update(bytes);
>>>
>>> byte[] mdbytes = md.digest();
>>>
>>>
>>>
>>> // convert the byte to hex format
>>>
>>> StringBuffer hexString = new StringBuffer();
>>>
>>> for (int i = 0; i < mdbytes.length; i++) {
>>>
>>> String hex = Integer.toHexString(0xff & mdbytes[i]);
>>>
>>> if (hex.length() == 1)
>>>
>>> hexString.append('0');
>>>
>>> hexString.append(hex);
>>>
>>> }
>>>
>>>
>>>
>>> return hexString.toString();
>>>
>>> }
>>>
>>>
>>>
>>>
>>>
>>> public static final String SHA256(String text) throws
>>> NoSuchAlgorithmException, UnsupportedEncodingException  {
>>>
>>> return SHA256(text.getBytes("UTF-8"));
>>>
>>> }
>>>
>>>
>>>
>>> }
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Fuad Efendi
>>>
>>>
>>>
>>>
>>>
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB