Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Read access pattern

Copy link to this message
Re: Read access pattern
bq. The downside that I see, is the bucket_number that we have to
maintain both at time or reading/writing and update it in case of
cluster restructuring.

I agree that this maintenance can be painful. However, Phoenix
(https://github.com/forcedotcom/phoenix) now supports salting,
automating this maintenance.  If you want to salt your table, just add a
SALT_BUCKETS = <n> property at the end of your DDL statement, where <n>
is the total number of buckets (up to a max of 256).  For example:

CREATE TABLE t (date_time DATE NOT NULL, event_id CHAR(15) NOT NULL
     CONSTRAINT pk PRIMARY KEY (date_time, event_id))

This will add one byte at the beginning of your row key whose value is
formed by hashing the row key and mod-ing with 10. This will
automatically be done for any upsert and queries will automatically be
distributed and the results combined as expected.



On 04/30/2013 09:17 AM, Shahab Yunus wrote:
> Well those are *some* words :) Anyway, can you explain a bit in detail that
> why you feel so strongly about this design/approach? The salting here is
> not the only option mentioned and static hashing can be used as well. Plus
> even in case of salting, wouldn't the distributed scan take care of it? The
> downside that I see, is the bucket_number that we have to maintain both at
> time or reading/writing and update it in case of cluster restructuring.
> Thanks,
> Shahab
> On Tue, Apr 30, 2013 at 11:57 AM, Michael Segel
>> Geez that's a bad article.
>> Never salt.
>> And yes there's a difference between using a salt and using the first 2-4
>> bytes from your MD5 hash.
>> (Hint: Salts are random. Your hash isn't. )
>> Sorry to be-itch but its a bad idea and it shouldn't be propagated.
>> On Apr 29, 2013, at 10:17 AM, Shahab Yunus <[EMAIL PROTECTED]> wrote:
>>> I think you cannot use the scanner simply to to a range scan here as your
>>> keys are not monotonically increasing. You need to apply logic to
>>> decode/reverse your mechanism that you have used to hash your keys at the
>>> time of writing. You might want to check out the SemaText library which
>>> does distributed scans and seem to handle the scenarios that you want to
>>> implement.
>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>> On Mon, Apr 29, 2013 at 11:03 AM, <[EMAIL PROTECTED]> wrote:
>>>> Hi,
>>>> I have a rowkey defined by :
>>>>         getMD5AsHex(Bytes.toBytes(myObjectId)) + String.format("%19d\n",
>>>> (Long.MAX_VALUE - changeDate.getTime()));
>>>> How could I get the previous and next row for a given rowkey ?
>>>> For instance, I have the following ordered keys :
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370673172227807
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468022807
>>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468862807
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674984237807
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674987271807
>>>> If I choose the rowkey :
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468862807, what would be the
>>>> correct scan to get the previous and next key ?
>>>> Result would be :
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468022807
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674984237807
>>>> Thank you !
>>>> R.
>>>> Une messagerie gratuite, garantie � vie et des services en plus, �a vous
>>>> tente ?
>>>> Je cr�e ma bo�te mail www.laposte.net