Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store


Copy link to this message
-
Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store
Dear Nick,

Thanks a lot for the nice explanation. I just left yet with a small
confusion:

As you told: Your logical partitioning remains the same whether it's being
served by N, 2N, or 3.5N+2 RegionServers.

When a region is splitted into two, I guess the logical partitioning is
changed, right? Could you kindly clarify a bit more.
And my question is without making any change to my initial
row-key-generation function, how new writes will go to the new regions?
I assume it is hard to predict the number of RS initially as well as
creating pre-splitted regions in very large-scale production systems. I am
not worried about the default load-balancing behaviour of HBase. St.Ack and
you've also clearly explained that as well.

For an example: as indicated in the Lars George's book where he used <# of
RS> while finding the prefix, I guess <# of Regions> could be also used (if
regions are pre-splitted)
----------------------------------------------------------------------------------------------------------------------------
*Salting*
You can use a salting prefix to the key that guarantees a spread of all
rows across all region servers. For example:

byte prefix = (byte) (Long.hashCode(timestamp) % <number of region
servers>);
byte[] rowkey = Bytes.add(Bytes.toBytes(prefix), Bytes.toBytes(timestamp);

This formula will generate enough prefix numbers to ensure that rows are
sent to all region servers. Of course, the formula assumes a
*specific*number of servers, and if
you are planning to *grow your cluster* you should set this number to a *
multiple* instead.
----------------------------------------------------------------------------------------------------------------------------

Thanks again ...


Regards,
Joarder Kamal
On 3 July 2013 03:37, Nick Dimiduk <[EMAIL PROTECTED]> wrote:

> Hi Joarder,
>
> I think you're slightly confused about the impact of using a hashed (or
> sometimes called "salted") prefix for your rowkeys. This strategy for
> rowkey design has an impact on the logical ordering of your data, not
> necessarily the physical distribution of your data. In HBase, these are
> orthogonal concerns. It means that to execute a bucket-agnostic query, the
> client must initiate N scans. However, there's no guarantee that all
> regions starting with the same hash land on the same RegionServer. Region
> assignment is a complex beast; as I understand, it's based on a randomish,
> load-based assignment.
>
> Take a look at your existing table distributed on a size-N cluster. Do all
> regions that fall within the first bucket sit on the same RegionServer?
> Likely not. However, look at the number of regions assigned to each
> RegionServer. This should be close to even. Adding a new RegionServer to
> the cluster will result in some of those regions migrating from the other
> servers to the new one. The impact will be a decrease in the average number
> of regions served per RegionServer.
>
> Your logical partitioning remains the same whether it's being served by N,
> 2N, or 3.5N+2 RegionServers. Your client always needs to execute that
> bucket-agnostic query as N scans, touching each of the N buckets. Precisely
> which RegionServers are touched by any given scan depends entirely on how
> the balancer has distributed load on your cluster.
>
> Thanks,
> Nick
>
> On Thu, Jun 27, 2013 at 5:02 PM, Joarder KAMAL <[EMAIL PROTECTED]> wrote:
>
> > Thanks St.Ack for mentioning about the load-balancer.
> >
> > But my question was two folded:
> > Case-1. If a new RS is added, then the load-balancer will do it's job
> > considering no new region has been created in the meanwhile. // As you've
> > already answered.
> >
> > Case-2. Whether a new RS is added or not, an existing region is splitted
> > into two, then how the new writes will to the new region? Because, lets
> say
> > initially the hash function was calculated with *N* Regions and now there
> > are *N+1* Regions in the cluster.
> >
> > ​In that case, do I need to change the Hash function and reshuffle all
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB