<< ...mod the hash with the number of machines I have... >>
This means that the data will change with the number of machines - so all
your data will map to different regions if you add a new machine to your
<< What I do not understand is the advantages/disasvantages of having
regions that are too big vs regions that are too thin. >>
The disadvantage is that some regions (and consequently nodes) will have a
lot of data which will adversely affect things like storage (if dfs is
local to that node), block cache hit ratio, etc.
In general - per our experience using Hbase, its much more desirable to
disperse data up-front. If you are building indexes using MR, then you
probably don¹t need range scan ability on your keys.
On 10/24/11 4:48 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote:
>According to my understanding, the way that HBase works is that on a
>brand new system, all keys will start going to a single region i.e. a
>single region server. Once that region
>reaches a max region size, it will split and then move to another
>region server, and so on and so forth.
>Initially hooking up HBase to a prod system, I am concerned about this
>behaviour, since a clean HBase cluster is going to experience a surge
>of traffic all going into one region server initially.
>This is the motivation behind pre-defining the regions, so the initial
>surge of traffic is distributed evenly.
>My strategy is to take the incoming data, calculate the hash and then
>mod the hash with the number of machines I have. I will split the
>regions according to the prefix # .
>This should , I think provide for better data distribution when the
>cluster first starts up with one region / region server.
>These regions should then grow fairly uniformly. Once they reach a
>size like ~ 5G, I can do a rolling split.
>Also, I want to make sure my regions do not grow too much in size that
>when I end up adding more machines, it does not take a very long time
>to perform a rolling split.
>What I do not understand is the advantages/disasvantages of having
>regions that are too big vs regions that are too thin. What does this
>impact ? Compaction time ? Split time ? What is the
>concern when it comes to how the architecture works. I think if I
>understand this better, I can manage my regions more efficiently.
>On Mon, Oct 24, 2011 at 3:23 PM, Nicolas Spiegelberg
><[EMAIL PROTECTED]> wrote:
>> Isn't a better strategy to create the HBase keys as
>> Key = hash(MySQL_key) + MySQL_key
>> That way you'll know your key distribution and can add new machines
>> seamlessly. I'm assuming that your rows don't overlap between any 2
>> machines. If so, you could append the MACHINE_ID to the key (not
>> prepend). I don't think you want the machine # as the first dimension
>> your rows, because you want the data from new machines to be evenly
>> out across the existing regions.
>> On 10/24/11 9:07 AM, "Stack" <[EMAIL PROTECTED]> wrote:
>>>On Mon, Oct 24, 2011 at 1:27 AM, Sam Seigal <[EMAIL PROTECTED]> wrote:
>>>> According to the HBase book , pre splitting tables and doing manual
>>>> splits is a better long term strategy than letting HBase handle it.
>>>Its good for getting a table off the ground, yes.
>>>> Since I do not know what the keys from the prod system are going to
>>>> look like , I am adding a machine number prefix to the the row keys
>>>> and pre splitting the tables based on the prefix (prefix 0 goes to
>>>> machine A, prefix 1 goes to machine b etc).
>>>You don't need to do inorder scan of the data? Whats the rest of your
>>>row key look like?
>>>> Once I decide to add more machines, I can always do a rolling split
>>>> and add more prefixes.
>>>> Is this a good strategy for pre splitting the tables ?
>>>So, you'll start out with one region per server?
>>>What do you think the rate of splitting will be like? Are you using
>>>default region size or have you bumped this up?