Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Row distribution


Copy link to this message
-
Re: Row distribution
Mohit Anchlia 2012-07-26, 19:50
On Thu, Jul 26, 2012 at 10:34 AM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> > Is there any specific best practice on how many regions one
> > should split a table into?
>
> As always, "it depends". Usually you don't want your RegionServers to serve
> more than 50 regions or so. The fewer the better. But at the same time you
> usually want your regions to be distributed over the whole cluster (so that
> you use all power). So, it might make sense to start with one region per RS
> (if your writes are more or less evenly distributed across pre-splitted
> regions) if you don't know about you data size. If you know that you'll
> need to have more regions because of how big is your data, then you might
> create more regions at the start (with pre-splitting), so that you avoid
> region splits operations (you really want to avoid them if you can).
> Of course, you need to take into account other tables in your cluster as
> well. I.e. "usually not more than 50 regions" total per regionserver.
>
>
> Thanks for the detailed explanation. I understand the regions per
> regionserver, which is essentially range of rows distributed accross the
> cluster for a given table. But who decides on how many regionservers to
> have in the cluster?
>
> > Just one more question, in the split keys that you described below, is it
> > based on the first byte value of the Key?
>
> yes. And the first byte contains readable char, because of
> Bytes.ToBytes(String.valueOf(i)). If you want to prefix with (byte) 0, ...,
> (byte) 9 (i.e. with 0x00, 0x01, ..., 0x09) then no need to convert to
> String.
>
>
How different is this mechanism as compared to regionsplitter that uses
default string md5 split. Just trying to understand the difference in how
different the key range is.

> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> On Thu, Jul 26, 2012 at 11:43 AM, Mohit Anchlia <[EMAIL PROTECTED]
> >wrote:
>
> > On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Looks like you have only one region in your table. Right?
> > >
> > > If you want your writes to be distributed from the start (without
> waiting
> > > for HBase to fill table enough to split it in many regions), you should
> > > pre-split your table. In your case you can pre-split table with 10
> > regions
> > > (just an example, you can define more), with start keys: "", "1", "2",
> > ...,
> > > "9" [1].
> > >
> > > Just one more question, in the split keys that you described below, is
> it
> > based on the first byte value of the Key?
> >
> >
> > > Btw, since you are salting your keys to achieve distribution, you might
> > > also find this small lib helpful which implements most of the stuff for
> > you
> > > [2].
> > >
> > > Hope this helps.
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> > > [1]
> > >
> > >     byte[][] splitKeys = new byte[9][];
> > >     // the first region starting with empty key will be created
> > > automatically
> > >     for (int i = 1; i < splitKeys.length; i++) {
> > >       splitKeys[i] = Bytes.toBytes(String.valueOf(i));
> > >     }
> > >
> > >     HBaseAdmin admin = new HBaseAdmin(conf);
> > >     admin.createTable(tableDescriptor, splitKeys);
> > >
> > > [2]
> > > https://github.com/sematext/HBaseWD
> > >
> > >
> >
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> > >
> > > On Wed, Jul 25, 2012 at 7:54 PM, Mohit Anchlia <[EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau <
> > [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > Hi Mohit,
> > > > >
> > > > > 1. When talking about particular table:
> > > > >
> > > > > For viewing rows distribution you can check out how regions are
> > > > > distributed. And each region defined by the start/stop key, so