Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Row distribution


+
Mohit Anchlia 2012-07-25, 05:32
+
Adrien Mogenet 2012-07-25, 05:59
+
Alex Baranau 2012-07-25, 13:53
+
Mohit Anchlia 2012-07-25, 23:54
+
Alex Baranau 2012-07-26, 14:16
+
Mohit Anchlia 2012-07-26, 15:43
+
Alex Baranau 2012-07-26, 17:34
+
Mohit Anchlia 2012-07-26, 19:50
+
Alex Baranau 2012-07-26, 20:29
Copy link to this message
-
Re: Row distribution
On Thu, Jul 26, 2012 at 1:29 PM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> > But who decides on how many regionservers to
> > have in the cluster?
>
> RegionServer is a process started on each slave in your cluster. So the
> number of RS is the same as the number of slaves. You might want to take a
> look at one of Intro to HBase presentations (which have pictures!) [1]
>
> > How different is this mechanism as compared to regionsplitter that uses
> > default string md5 split. Just trying to understand the difference in how
> > different the key range is.
>
> You can use any of the splitter algorithm, but note that it probably will
> not take into account the row keys you are going to use. E.g.:
> * if your row keys have format <country><state><company><...> and
> * you know that you will have most of the data about US companies (if e.g.
> this is your target audience) then
> * based on the example I gave, you can create regions defined by these
> start keys:
> ""
> "US"
> "US_FL"
> "US_KN"
> "US_MS"
> "US_NC"
> "US_VM"
> "V"
> so that data is more or less evenly distributed (note: there's no need to
> split other countries in regions as they they will have small amount of
> data).
>

Thanks for great explanation!!
>
> No standard splitter will know what your data is (at the time of creation
> of the table).
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> [1]
> http://blog.sematext.com/2012/07/09/introduction-to-hbase/
>
> http://blog.sematext.com/2012/07/09/intro-to-hbase-internals-and-schema-desig/
> or any other "intro to hbase" presentations over the web.
>
> On Thu, Jul 26, 2012 at 3:50 PM, Mohit Anchlia <[EMAIL PROTECTED]
> >wrote:
>
> > On Thu, Jul 26, 2012 at 10:34 AM, Alex Baranau <[EMAIL PROTECTED]
> > >wrote:
> >
> > > > Is there any specific best practice on how many regions one
> > > > should split a table into?
> > >
> > > As always, "it depends". Usually you don't want your RegionServers to
> > serve
> > > more than 50 regions or so. The fewer the better. But at the same time
> > you
> > > usually want your regions to be distributed over the whole cluster (so
> > that
> > > you use all power). So, it might make sense to start with one region
> per
> > RS
> > > (if your writes are more or less evenly distributed across pre-splitted
> > > regions) if you don't know about you data size. If you know that you'll
> > > need to have more regions because of how big is your data, then you
> might
> > > create more regions at the start (with pre-splitting), so that you
> avoid
> > > region splits operations (you really want to avoid them if you can).
> > > Of course, you need to take into account other tables in your cluster
> as
> > > well. I.e. "usually not more than 50 regions" total per regionserver.
> > >
> > >
> >
> >
> > > Thanks for the detailed explanation. I understand the regions per
> > > regionserver, which is essentially range of rows distributed accross
> the
> > > cluster for a given table. But who decides on how many regionservers to
> > > have in the cluster?
> > >
> >
> >
> > > > Just one more question, in the split keys that you described below,
> is
> > it
> > > > based on the first byte value of the Key?
> > >
> > > yes. And the first byte contains readable char, because of
> > > Bytes.ToBytes(String.valueOf(i)). If you want to prefix with (byte) 0,
> > ...,
> > > (byte) 9 (i.e. with 0x00, 0x01, ..., 0x09) then no need to convert to
> > > String.
> > >
> > >
> > How different is this mechanism as compared to regionsplitter that uses
> > default string md5 split. Just trying to understand the difference in how
> > different the key range is.
> >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> > > On Thu, Jul 26, 2012 at 11:43 AM, Mohit Anchlia <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau <
+
Mohit Anchlia 2012-07-26, 15:41