Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Hbase RowKey design schema


Copy link to this message
-
Re: Hbase RowKey design schema
Doug Meil 2013-08-29, 17:45

Hi there,

One thing to mention about the BigTable paper is they reverse the URL so
that scans work with subdomains.

www.subdomain1.cnn.com -> com.cnn.subdomain1.www
www.subdomain2.cnn.com -> com.cnn.subdomain2.www

If you don't reverse the URL there isn't an easy scan (short of creating
another table to act as an index) for all the URLs under a domain.

Regarding the good question below about use-cases, the RefGuide says in
6.3.2.3 "Keep them as short as is reasonable such that they can still be
useful for required data access".

Shorter rowkeys is usually a good thing, but shorter isn't better if it
doesn't work for what you are trying to do.   :-)

On 8/29/13 10:18 AM, "Shahab Yunus" <[EMAIL PROTECTED]> wrote:

>What advantage you will be gaining by compressing? Less space? But then it
>will add compression/decompression performance overhead. A trade-off but a
>especially significant as space is cheap and redundancy is OK with such
>data stores.
>
>Having said that, more importantly, what are your read use-cases or access
>patterns? That should drive your decision about row key design.
>
>Regards,
>Shahab
>
>
>On Thu, Aug 29, 2013 at 5:21 AM, Wasim Karani
><[EMAIL PROTECTED]>wrote:
>
>> I am using HBase to store webtable content like how google is using
>> bigtable.
>> For reference of google bigtable
>> My question is on RowKey, how we should be forming it.
>> What google is doing is saving the URL in a reverse order as you can
>>see in
>> the PDF document "com.cnn.www" so that all the links associated with
>> cnn.com
>> will be manages in same block of GFS which will be lot easier to scan.
>> I can use the same thing as google is using but wont it will be cool if
>>I
>> use
>> some algorithm to compress the url
>>
>> For eg.
>>
>> RewKey                               |  Google Bigtable
>> |  Algorithm output
>> www.cnn.com/index.php                |  com.cnn.www/index.php
>> |  12as/435
>> www.cnn.com/news/business/index.html |
>>  com.cnn.www/news/business/index.html
>> |  12as/2as/dcx/asd
>> www.cnn.com/news/sports/index.html   |
>>com.cnn.www/news/sports/index.html
>> |  12as/2as/eds/scf
>> Reason behind doing this is rowkey will be shorter as per the Hbase
>>design
>> schema (Mentioned in topic 6.3.2.3. Rowkey Length).
>>
>> So what do I need from you guys is to know am I correct over here....
>> Also if I am correct what Algorithm I should using. I am using python
>>over
>> thrift as a programming language so code will be overwhelming for me...
>>
>>