Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Row distribution


+
Mohit Anchlia 2012-07-25, 05:32
+
Adrien Mogenet 2012-07-25, 05:59
+
Alex Baranau 2012-07-25, 13:53
+
Mohit Anchlia 2012-07-25, 23:54
+
Alex Baranau 2012-07-26, 14:16
+
Mohit Anchlia 2012-07-26, 15:43
+
Alex Baranau 2012-07-26, 17:34
+
Mohit Anchlia 2012-07-26, 19:50
+
Alex Baranau 2012-07-26, 20:29
Copy link to this message
-
Re: Row distribution
On Thu, Jul 26, 2012 at 1:29 PM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> > But who decides on how many regionservers to
> > have in the cluster?
>
> RegionServer is a process started on each slave in your cluster. So the
> number of RS is the same as the number of slaves. You might want to take a
> look at one of Intro to HBase presentations (which have pictures!) [1]
>
> > How different is this mechanism as compared to regionsplitter that uses
> > default string md5 split. Just trying to understand the difference in how
> > different the key range is.
>
> You can use any of the splitter algorithm, but note that it probably will
> not take into account the row keys you are going to use. E.g.:
> * if your row keys have format <country><state><company><...> and
> * you know that you will have most of the data about US companies (if e.g.
> this is your target audience) then
> * based on the example I gave, you can create regions defined by these
> start keys:
> ""
> "US"
> "US_FL"
> "US_KN"
> "US_MS"
> "US_NC"
> "US_VM"
> "V"
> so that data is more or less evenly distributed (note: there's no need to
> split other countries in regions as they they will have small amount of
> data).
>

Thanks for great explanation!!
>
> No standard splitter will know what your data is (at the time of creation
> of the table).
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> [1]
> http://blog.sematext.com/2012/07/09/introduction-to-hbase/
>
> http://blog.sematext.com/2012/07/09/intro-to-hbase-internals-and-schema-desig/
> or any other "intro to hbase" presentations over the web.
>
> On Thu, Jul 26, 2012 at 3:50 PM, Mohit Anchlia <[EMAIL PROTECTED]
> >wrote:
>
> > On Thu, Jul 26, 2012 at 10:34 AM, Alex Baranau <[EMAIL PROTECTED]
> > >wrote:
> >
> > > > Is there any specific best practice on how many regions one
> > > > should split a table into?
> > >
> > > As always, "it depends". Usually you don't want your RegionServers to
> > serve
> > > more than 50 regions or so. The fewer the better. But at the same time
> > you
> > > usually want your regions to be distributed over the whole cluster (so
> > that
> > > you use all power). So, it might make sense to start with one region
> per
> > RS
> > > (if your writes are more or less evenly distributed across pre-splitted
> > > regions) if you don't know about you data size. If you know that you'll
> > > need to have more regions because of how big is your data, then you
> might
> > > create more regions at the start (with pre-splitting), so that you
> avoid
> > > region splits operations (you really want to avoid them if you can).
> > > Of course, you need to take into account other tables in your cluster
> as
> > > well. I.e. "usually not more than 50 regions" total per regionserver.
> > >
> > >
> >
> >
> > > Thanks for the detailed explanation. I understand the regions per
> > > regionserver, which is essentially range of rows distributed accross
> the
> > > cluster for a given table. But who decides on how many regionservers to
> > > have in the cluster?
> > >
> >
> >
> > > > Just one more question, in the split keys that you described below,
> is
> > it
> > > > based on the first byte value of the Key?
> > >
> > > yes. And the first byte contains readable char, because of
> > > Bytes.ToBytes(String.valueOf(i)). If you want to prefix with (byte) 0,
> > ...,
> > > (byte) 9 (i.e. with 0x00, 0x01, ..., 0x09) then no need to convert to
> > > String.
> > >
> > >
> > How different is this mechanism as compared to regionsplitter that uses
> > default string md5 split. Just trying to understand the difference in how
> > different the key range is.
> >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> > > On Thu, Jul 26, 2012 at 11:43 AM, Mohit Anchlia <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau <
+
Mohit Anchlia 2012-07-26, 15:41
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB