Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Row distribution


Copy link to this message
-
Re: Row distribution
On Thu, Jul 26, 2012 at 10:34 AM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> > Is there any specific best practice on how many regions one
> > should split a table into?
>
> As always, "it depends". Usually you don't want your RegionServers to serve
> more than 50 regions or so. The fewer the better. But at the same time you
> usually want your regions to be distributed over the whole cluster (so that
> you use all power). So, it might make sense to start with one region per RS
> (if your writes are more or less evenly distributed across pre-splitted
> regions) if you don't know about you data size. If you know that you'll
> need to have more regions because of how big is your data, then you might
> create more regions at the start (with pre-splitting), so that you avoid
> region splits operations (you really want to avoid them if you can).
> Of course, you need to take into account other tables in your cluster as
> well. I.e. "usually not more than 50 regions" total per regionserver.
>
>
> Thanks for the detailed explanation. I understand the regions per
> regionserver, which is essentially range of rows distributed accross the
> cluster for a given table. But who decides on how many regionservers to
> have in the cluster?
>
> > Just one more question, in the split keys that you described below, is it
> > based on the first byte value of the Key?
>
> yes. And the first byte contains readable char, because of
> Bytes.ToBytes(String.valueOf(i)). If you want to prefix with (byte) 0, ...,
> (byte) 9 (i.e. with 0x00, 0x01, ..., 0x09) then no need to convert to
> String.
>
>
How different is this mechanism as compared to regionsplitter that uses
default string md5 split. Just trying to understand the difference in how
different the key range is.

> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> On Thu, Jul 26, 2012 at 11:43 AM, Mohit Anchlia <[EMAIL PROTECTED]
> >wrote:
>
> > On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Looks like you have only one region in your table. Right?
> > >
> > > If you want your writes to be distributed from the start (without
> waiting
> > > for HBase to fill table enough to split it in many regions), you should
> > > pre-split your table. In your case you can pre-split table with 10
> > regions
> > > (just an example, you can define more), with start keys: "", "1", "2",
> > ...,
> > > "9" [1].
> > >
> > > Just one more question, in the split keys that you described below, is
> it
> > based on the first byte value of the Key?
> >
> >
> > > Btw, since you are salting your keys to achieve distribution, you might
> > > also find this small lib helpful which implements most of the stuff for
> > you
> > > [2].
> > >
> > > Hope this helps.
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> > > [1]
> > >
> > >     byte[][] splitKeys = new byte[9][];
> > >     // the first region starting with empty key will be created
> > > automatically
> > >     for (int i = 1; i < splitKeys.length; i++) {
> > >       splitKeys[i] = Bytes.toBytes(String.valueOf(i));
> > >     }
> > >
> > >     HBaseAdmin admin = new HBaseAdmin(conf);
> > >     admin.createTable(tableDescriptor, splitKeys);
> > >
> > > [2]
> > > https://github.com/sematext/HBaseWD
> > >
> > >
> >
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> > >
> > > On Wed, Jul 25, 2012 at 7:54 PM, Mohit Anchlia <[EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau <
> > [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > Hi Mohit,
> > > > >
> > > > > 1. When talking about particular table:
> > > > >
> > > > > For viewing rows distribution you can check out how regions are
> > > > > distributed. And each region defined by the start/stop key, so
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB