Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> pre splitting tables


Copy link to this message
-
Re: pre splitting tables


<< ...mod the hash with the number of machines I have... >>
This means that the data will change with the number of machines - so all
your data will map to different regions if you add a new machine to your
cluster.
<< What I do not understand is the advantages/disasvantages of having
regions that are too big vs regions that are too thin. >>
The disadvantage is that some regions (and consequently nodes) will have a
lot of data which will adversely affect things like storage (if dfs is
local to that node), block cache hit ratio, etc.

In general - per our experience using Hbase, its much more desirable to
disperse data up-front. If you are building indexes using MR, then you
probably don¹t need range scan ability on your keys.

Thanks
Karthik

On 10/24/11 4:48 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote:

>According to my understanding, the way that HBase works is that on a
>brand new system, all keys will start going to a single region i.e. a
>single region server. Once that region
>reaches a max region size, it will split and then move to another
>region server, and so on and so forth.
>
>Initially hooking up HBase to a prod system, I am concerned about this
>behaviour, since a clean HBase cluster is going to experience a surge
>of traffic all going into one region server initially.
>This is the motivation behind pre-defining the regions, so the initial
>surge of traffic is distributed evenly.
>
>My strategy is to take the incoming data, calculate the hash and then
>mod the hash with the number of machines I have. I will split the
>regions according to the prefix # .
>This should , I think provide for better data distribution when the
>cluster first starts up with one region / region server.
>
>These regions should then grow fairly uniformly. Once they reach a
>size like ~ 5G, I can do a rolling split.
>
>Also, I want to make sure my regions do not grow too much in size that
>when I end up adding more machines, it does not take a very long time
>to perform a rolling split.
>
>What I do not understand is the advantages/disasvantages of having
>regions that are too big vs regions that are too thin. What does this
>impact ? Compaction time ? Split time ? What is the
>concern when it comes to how the architecture works. I think if I
>understand this better, I can manage my regions more efficiently.
>
>
>
>On Mon, Oct 24, 2011 at 3:23 PM, Nicolas Spiegelberg
><[EMAIL PROTECTED]> wrote:
>> Isn't a better strategy to create the HBase keys as
>>
>> Key = hash(MySQL_key) + MySQL_key
>>
>> That way you'll know your key distribution and can add new machines
>> seamlessly.  I'm assuming that your rows don't overlap between any 2
>> machines.  If so, you could append the MACHINE_ID to the key (not
>> prepend).  I don't think you want the machine # as the first dimension
>>on
>> your rows, because you want the data from new machines to be evenly
>>spread
>> out across the existing regions.
>>
>>
>> On 10/24/11 9:07 AM, "Stack" <[EMAIL PROTECTED]> wrote:
>>
>>>On Mon, Oct 24, 2011 at 1:27 AM, Sam Seigal <[EMAIL PROTECTED]> wrote:
>>>> According to the HBase book , pre splitting tables and doing manual
>>>> splits is a better long term strategy than letting HBase handle it.
>>>>
>>>
>>>Its good for getting a table off the ground, yes.
>>>
>>>
>>>> Since I do not know what the keys from the prod system are going to
>>>> look like , I am adding a machine number prefix to the the row keys
>>>> and pre splitting the tables  based on the prefix (prefix 0 goes to
>>>> machine A, prefix 1 goes to machine b etc).
>>>>
>>>
>>>You don't need to do inorder scan of the data?  Whats the rest of your
>>>row key look like?
>>>
>>>
>>>> Once I decide to add more machines, I can always do a rolling split
>>>> and add more prefixes.
>>>>
>>>
>>>Yes.
>>>
>>>> Is this a good strategy for pre splitting the tables ?
>>>>
>>>
>>>So, you'll start out with one region per server?
>>>
>>>What do you think the rate of splitting will be like?  Are you using
>>>default region size or have you bumped this up?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB