Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Hbase RowKey design schema


+
Wasim Karani 2013-08-29, 09:21
+
Shahab Yunus 2013-08-29, 14:18
Copy link to this message
-
Re: Hbase RowKey design schema

Hi there,

One thing to mention about the BigTable paper is they reverse the URL so
that scans work with subdomains.

www.subdomain1.cnn.com -> com.cnn.subdomain1.www
www.subdomain2.cnn.com -> com.cnn.subdomain2.www

If you don't reverse the URL there isn't an easy scan (short of creating
another table to act as an index) for all the URLs under a domain.

Regarding the good question below about use-cases, the RefGuide says in
6.3.2.3 "Keep them as short as is reasonable such that they can still be
useful for required data access".

Shorter rowkeys is usually a good thing, but shorter isn't better if it
doesn't work for what you are trying to do.   :-)

On 8/29/13 10:18 AM, "Shahab Yunus" <[EMAIL PROTECTED]> wrote:

>What advantage you will be gaining by compressing? Less space? But then it
>will add compression/decompression performance overhead. A trade-off but a
>especially significant as space is cheap and redundancy is OK with such
>data stores.
>
>Having said that, more importantly, what are your read use-cases or access
>patterns? That should drive your decision about row key design.
>
>Regards,
>Shahab
>
>
>On Thu, Aug 29, 2013 at 5:21 AM, Wasim Karani
><[EMAIL PROTECTED]>wrote:
>
>> I am using HBase to store webtable content like how google is using
>> bigtable.
>> For reference of google bigtable
>> My question is on RowKey, how we should be forming it.
>> What google is doing is saving the URL in a reverse order as you can
>>see in
>> the PDF document "com.cnn.www" so that all the links associated with
>> cnn.com
>> will be manages in same block of GFS which will be lot easier to scan.
>> I can use the same thing as google is using but wont it will be cool if
>>I
>> use
>> some algorithm to compress the url
>>
>> For eg.
>>
>> RewKey                               |  Google Bigtable
>> |  Algorithm output
>> www.cnn.com/index.php                |  com.cnn.www/index.php
>> |  12as/435
>> www.cnn.com/news/business/index.html |
>>  com.cnn.www/news/business/index.html
>> |  12as/2as/dcx/asd
>> www.cnn.com/news/sports/index.html   |
>>com.cnn.www/news/sports/index.html
>> |  12as/2as/eds/scf
>> Reason behind doing this is rowkey will be shorter as per the Hbase
>>design
>> schema (Mentioned in topic 6.3.2.3. Rowkey Length).
>>
>> So what do I need from you guys is to know am I correct over here....
>> Also if I am correct what Algorithm I should using. I am using python
>>over
>> thrift as a programming language so code will be overwhelming for me...
>>
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB