|
Mohit Anchlia
2012-07-25, 05:32
Adrien Mogenet
2012-07-25, 05:59
Alex Baranau
2012-07-25, 13:53
Mohit Anchlia
2012-07-25, 23:54
Alex Baranau
2012-07-26, 14:16
Mohit Anchlia
2012-07-26, 15:41
Mohit Anchlia
2012-07-26, 15:43
Alex Baranau
2012-07-26, 17:34
Mohit Anchlia
2012-07-26, 19:50
Alex Baranau
2012-07-26, 20:29
Mohit Anchlia
2012-07-26, 20:31
|
-
Row distributionMohit Anchlia 2012-07-25, 05:32
Is there an easy way to tell how my nodes are balanced and how the rows are
distributed in the cluster?
-
Re: Row distributionAdrien Mogenet 2012-07-25, 05:59
>From the web-interface, you can have such statistics when viewing the
details of a table. You can also develop your own "balance viewer" through the HBase API (list of RS, regions, storeFiles, their size, etc.) On Wed, Jul 25, 2012 at 7:32 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Is there an easy way to tell how my nodes are balanced and how the rows are > distributed in the cluster? > -- Adrien Mogenet 06.59.16.64.22 http://www.mogenet.me
-
Re: Row distributionAlex Baranau 2012-07-25, 13:53
Hi Mohit,
1. When talking about particular table: For viewing rows distribution you can check out how regions are distributed. And each region defined by the start/stop key, so depending on your key format, etc. you can see which records go into each region. You can see the regions distribution in web ui as Adrien mentioned. It may also be handy for you to query .META. table [1] which holds regions info. In cases when you use random keys or when you just not sure how data is distributed in key buckets (which are regions), you may also want to look at HBase data on HDFS [2]. Since data is stored for each region separately, you can see the size on the HDFS each one occupies. 2. When talking about whole cluster, it makes sense to use cluster monitoring tool [3], to find out more about overall load distribution, regions of multiple tables distribution, requests amount, and many more such things. And of course, you can use HBase Java API to fetch some data of the cluster state as well. I guess you should start looking at it from HBaseAdmin class. Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr [1] hbase(main):001:0> scan '.META.', {LIMIT=>1, STARTROW=>"mytable,,"} ROW COLUMN+CELL mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845. column=info:regioninfo, timestamp=1341279432625, value=REGION => {NAME => 'mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.', STARTKEY => 'chicago', ENDKEY => 'new_york', ENCODED => fd61cd7ef426d2f233a4cd7e8b73845, TABLE => {{NAME => 'mytable', FAMILIES => [{NAME => 'job', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845. column=info:server, timestamp=1341279432673, value=myserver:60020 mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845. column=info:serverstartcode, timestamp=1341279432673, value=1341267474257 1 row(s) in 0.1980 seconds [2] ubuntu@ip-10-80-47-73:~$ sudo -u hdfs hadoop fs -du /hbase/mytable Found 130 items 3397 hdfs://hbase.master/hbase/mytable /02925d3c335bff7e273f392324f16dca 2682163424 hdfs://hbase.master/hbase/mytable /03231b8ae2b73317c4858b1a85c09ad2 1038862956 hdfs://hbase.master/hbase/mytable /04f911571593e931a9a3d9e2a6616236 1039181555 hdfs://hbase.master/hbase/mytable /0a177633196cae7b158836181d69dc0f 1076888812 hdfs://hbase.master/hbase/mytable /0d52fc477c41a9a236803234d44c7c06 [3] You can get data from JMX directly using any tool you like or use: * Ganglia * SPM monitoring ( http://sematext.com/spm/hbase-performance-monitoring/index.html) * others On Wed, Jul 25, 2012 at 1:59 AM, Adrien Mogenet <[EMAIL PROTECTED]>wrote: > From the web-interface, you can have such statistics when viewing the > details of a table. > You can also develop your own "balance viewer" through the HBase API (list > of RS, regions, storeFiles, their size, etc.) > > On Wed, Jul 25, 2012 at 7:32 AM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > Is there an easy way to tell how my nodes are balanced and how the rows > are > > distributed in the cluster? > > > > > > -- > Adrien Mogenet > 06.59.16.64.22 > http://www.mogenet.me > -- Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: Row distributionMohit Anchlia 2012-07-25, 23:54
On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau <[EMAIL PROTECTED]>wrote:
> Hi Mohit, > > 1. When talking about particular table: > > For viewing rows distribution you can check out how regions are > distributed. And each region defined by the start/stop key, so depending on > your key format, etc. you can see which records go into each region. You > can see the regions distribution in web ui as Adrien mentioned. It may also > be handy for you to query .META. table [1] which holds regions info. > > In cases when you use random keys or when you just not sure how data is > distributed in key buckets (which are regions), you may also want to look > at HBase data on HDFS [2]. Since data is stored for each region separately, > you can see the size on the HDFS each one occupies. > > I did a scan and the data looks like as pasted below. It appears all my writes are going to just one server. My keys are of this type [0-9]:[current timestamp]. Number between 0-9 is generated randomly. I thought by having this random number I'll be able to place my keys on multiple nodes. How should I approach this such that I am able to use other nodes as well? SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:regioninfo, timestamp=1343170773523, value=REGION => {NAME => 'SESSION_TIMELINE1,,1343074465420.5831bbac53e591c609918c0e2d7da7 1c609918c0e2d7da7bf. bf.', STARTKEY => '', ENDKEY => '', ENCODED => 5831bbac53e591c609918c0e2d7da7bf, TABLE => {{NAME => 'SESSION_TIMELINE1', FAMILIES => [{NAM E => 'S_T_MTX', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'GZ', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => ' 65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:server, timestamp=1343178912655, value=dsdb3.:60020 1c609918c0e2d7da7bf. > 2. When talking about whole cluster, it makes sense to use cluster > monitoring tool [3], to find out more about overall load distribution, > regions of multiple tables distribution, requests amount, and many more > such things. > > And of course, you can use HBase Java API to fetch some data of the cluster > state as well. I guess you should start looking at it from HBaseAdmin > class. > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > [1] > > hbase(main):001:0> scan '.META.', {LIMIT=>1, STARTROW=>"mytable,,"} > ROW > COLUMN+CELL > > > mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845. > column=info:regioninfo, timestamp=1341279432625, value=REGION => {NAME => > 'mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.', STARTKEY => > 'chicago', ENDKEY => 'new_york', ENCODED => > fd61cd7ef426d2f233a4cd7e8b73845, TABLE => {{NAME => 'mytable', FAMILIES => > [{NAME => 'job', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', > COMPRESSION => 'NONE', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} > > > > mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845. > column=info:server, timestamp=1341279432673, value=myserver:60020 > > > mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845. > column=info:serverstartcode, timestamp=1341279432673, value=1341267474257 > > > 1 row(s) in 0.1980 seconds > > [2] > > ubuntu@ip-10-80-47-73:~$ sudo -u hdfs hadoop fs -du /hbase/mytable > Found 130 items > 3397 hdfs://hbase.master/hbase/mytable > /02925d3c335bff7e273f392324f16dca > 2682163424 hdfs://hbase.master/hbase/mytable > /03231b8ae2b73317c4858b1a85c09ad2 > 1038862956 hdfs://hbase.master/hbase/mytable > /04f911571593e931a9a3d9e2a6616236 > 1039181555 hdfs://hbase.master/hbase/mytable > /0a177633196cae7b158836181d69dc0f > 1076888812 hdfs://hbase.master/hbase/mytable > /0d52fc477c41a9a236803234d44c7c06 > > [3] > You can get data from JMX directly using any tool you like or use:
-
Re: Row distributionAlex Baranau 2012-07-26, 14:16
Looks like you have only one region in your table. Right?
If you want your writes to be distributed from the start (without waiting for HBase to fill table enough to split it in many regions), you should pre-split your table. In your case you can pre-split table with 10 regions (just an example, you can define more), with start keys: "", "1", "2", ..., "9" [1]. Btw, since you are salting your keys to achieve distribution, you might also find this small lib helpful which implements most of the stuff for you [2]. Hope this helps. Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr [1] byte[][] splitKeys = new byte[9][]; // the first region starting with empty key will be created automatically for (int i = 1; i < splitKeys.length; i++) { splitKeys[i] = Bytes.toBytes(String.valueOf(i)); } HBaseAdmin admin = new HBaseAdmin(conf); admin.createTable(tableDescriptor, splitKeys); [2] https://github.com/sematext/HBaseWD http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ On Wed, Jul 25, 2012 at 7:54 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau <[EMAIL PROTECTED] > >wrote: > > > Hi Mohit, > > > > 1. When talking about particular table: > > > > For viewing rows distribution you can check out how regions are > > distributed. And each region defined by the start/stop key, so depending > on > > your key format, etc. you can see which records go into each region. You > > can see the regions distribution in web ui as Adrien mentioned. It may > also > > be handy for you to query .META. table [1] which holds regions info. > > > > In cases when you use random keys or when you just not sure how data is > > distributed in key buckets (which are regions), you may also want to look > > at HBase data on HDFS [2]. Since data is stored for each region > separately, > > you can see the size on the HDFS each one occupies. > > > > I did a scan and the data looks like as pasted below. It appears all my > writes are going to just one server. My keys are of this type > [0-9]:[current timestamp]. Number between 0-9 is generated randomly. I > thought by having this random number I'll be able to place my keys on > multiple nodes. How should I approach this such that I am able to use other > nodes as well? > > > > SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:regioninfo, > timestamp=1343170773523, value=REGION => {NAME => > 'SESSION_TIMELINE1,,1343074465420.5831bbac53e591c609918c0e2d7da7 > 1c609918c0e2d7da7bf. bf.', STARTKEY => '', > ENDKEY => '', ENCODED => 5831bbac53e591c609918c0e2d7da7bf, TABLE => {{NAME > => 'SESSION_TIMELINE1', FAMILIES => [{NAM > E => 'S_T_MTX', BLOOMFILTER > => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'GZ', VERSIONS => '1', > TTL => '2147483647', BLOCKSIZE => ' > 65536', IN_MEMORY => > 'false', BLOCKCACHE => 'true'}]}} > SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:server, > timestamp=1343178912655, value=dsdb3.:60020 > 1c609918c0e2d7da7bf. > > > 2. When talking about whole cluster, it makes sense to use cluster > > monitoring tool [3], to find out more about overall load distribution, > > regions of multiple tables distribution, requests amount, and many more > > such things. > > > > And of course, you can use HBase Java API to fetch some data of the > cluster > > state as well. I guess you should start looking at it from HBaseAdmin > > class. > > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - > > Solr > > > > [1] > > > > hbase(main):001:0> scan '.META.', {LIMIT=>1, STARTROW=>"mytable,,"} > > ROW > > COLUMN+CELL > > > > > > mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845. > > column=info:regioninfo, timestamp=1341279432625, value=REGION => {NAME Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: Row distributionMohit Anchlia 2012-07-26, 15:41
On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau <[EMAIL PROTECTED]>wrote:
> Looks like you have only one region in your table. Right? > > If you want your writes to be distributed from the start (without waiting > for HBase to fill table enough to split it in many regions), you should > pre-split your table. In your case you can pre-split table with 10 regions > (just an example, you can define more), with start keys: "", "1", "2", ..., > "9" [1]. > > Thanks a lot! Is there any specific best practice on how many regions one should split a table into? > Btw, since you are salting your keys to achieve distribution, you might > also find this small lib helpful which implements most of the stuff for you > [2]. > > I'll take a look > Hope this helps. > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > [1] > > byte[][] splitKeys = new byte[9][]; > // the first region starting with empty key will be created > automatically > for (int i = 1; i < splitKeys.length; i++) { > splitKeys[i] = Bytes.toBytes(String.valueOf(i)); > } > > HBaseAdmin admin = new HBaseAdmin(conf); > admin.createTable(tableDescriptor, splitKeys); > > [2] > https://github.com/sematext/HBaseWD > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > On Wed, Jul 25, 2012 at 7:54 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau <[EMAIL PROTECTED] > > >wrote: > > > > > Hi Mohit, > > > > > > 1. When talking about particular table: > > > > > > For viewing rows distribution you can check out how regions are > > > distributed. And each region defined by the start/stop key, so > depending > > on > > > your key format, etc. you can see which records go into each region. > You > > > can see the regions distribution in web ui as Adrien mentioned. It may > > also > > > be handy for you to query .META. table [1] which holds regions info. > > > > > > In cases when you use random keys or when you just not sure how data is > > > distributed in key buckets (which are regions), you may also want to > look > > > at HBase data on HDFS [2]. Since data is stored for each region > > separately, > > > you can see the size on the HDFS each one occupies. > > > > > > I did a scan and the data looks like as pasted below. It appears all my > > writes are going to just one server. My keys are of this type > > [0-9]:[current timestamp]. Number between 0-9 is generated randomly. I > > thought by having this random number I'll be able to place my keys on > > multiple nodes. How should I approach this such that I am able to use > other > > nodes as well? > > > > > > > > SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:regioninfo, > > timestamp=1343170773523, value=REGION => {NAME => > > 'SESSION_TIMELINE1,,1343074465420.5831bbac53e591c609918c0e2d7da7 > > 1c609918c0e2d7da7bf. bf.', STARTKEY => '', > > ENDKEY => '', ENCODED => 5831bbac53e591c609918c0e2d7da7bf, TABLE => > {{NAME > > => 'SESSION_TIMELINE1', FAMILIES => [{NAM > > E => 'S_T_MTX', > BLOOMFILTER > > => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'GZ', VERSIONS => > '1', > > TTL => '2147483647', BLOCKSIZE => ' > > 65536', IN_MEMORY => > > 'false', BLOCKCACHE => 'true'}]}} > > SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:server, > > timestamp=1343178912655, value=dsdb3.:60020 > > 1c609918c0e2d7da7bf. > > > > > 2. When talking about whole cluster, it makes sense to use cluster > > > monitoring tool [3], to find out more about overall load distribution, > > > regions of multiple tables distribution, requests amount, and many more > > > such things. > > > > > > And of course, you can use HBase Java API to fetch some data of the > > cluster > > > state as well. I guess you should start looking at it from HBaseAdmin
-
Re: Row distributionMohit Anchlia 2012-07-26, 15:43
On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau <[EMAIL PROTECTED]>wrote:
> Looks like you have only one region in your table. Right? > > If you want your writes to be distributed from the start (without waiting > for HBase to fill table enough to split it in many regions), you should > pre-split your table. In your case you can pre-split table with 10 regions > (just an example, you can define more), with start keys: "", "1", "2", ..., > "9" [1]. > > Just one more question, in the split keys that you described below, is it based on the first byte value of the Key? > Btw, since you are salting your keys to achieve distribution, you might > also find this small lib helpful which implements most of the stuff for you > [2]. > > Hope this helps. > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > [1] > > byte[][] splitKeys = new byte[9][]; > // the first region starting with empty key will be created > automatically > for (int i = 1; i < splitKeys.length; i++) { > splitKeys[i] = Bytes.toBytes(String.valueOf(i)); > } > > HBaseAdmin admin = new HBaseAdmin(conf); > admin.createTable(tableDescriptor, splitKeys); > > [2] > https://github.com/sematext/HBaseWD > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > On Wed, Jul 25, 2012 at 7:54 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau <[EMAIL PROTECTED] > > >wrote: > > > > > Hi Mohit, > > > > > > 1. When talking about particular table: > > > > > > For viewing rows distribution you can check out how regions are > > > distributed. And each region defined by the start/stop key, so > depending > > on > > > your key format, etc. you can see which records go into each region. > You > > > can see the regions distribution in web ui as Adrien mentioned. It may > > also > > > be handy for you to query .META. table [1] which holds regions info. > > > > > > In cases when you use random keys or when you just not sure how data is > > > distributed in key buckets (which are regions), you may also want to > look > > > at HBase data on HDFS [2]. Since data is stored for each region > > separately, > > > you can see the size on the HDFS each one occupies. > > > > > > I did a scan and the data looks like as pasted below. It appears all my > > writes are going to just one server. My keys are of this type > > [0-9]:[current timestamp]. Number between 0-9 is generated randomly. I > > thought by having this random number I'll be able to place my keys on > > multiple nodes. How should I approach this such that I am able to use > other > > nodes as well? > > > > > > > > SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:regioninfo, > > timestamp=1343170773523, value=REGION => {NAME => > > 'SESSION_TIMELINE1,,1343074465420.5831bbac53e591c609918c0e2d7da7 > > 1c609918c0e2d7da7bf. bf.', STARTKEY => '', > > ENDKEY => '', ENCODED => 5831bbac53e591c609918c0e2d7da7bf, TABLE => > {{NAME > > => 'SESSION_TIMELINE1', FAMILIES => [{NAM > > E => 'S_T_MTX', > BLOOMFILTER > > => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'GZ', VERSIONS => > '1', > > TTL => '2147483647', BLOCKSIZE => ' > > 65536', IN_MEMORY => > > 'false', BLOCKCACHE => 'true'}]}} > > SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:server, > > timestamp=1343178912655, value=dsdb3.:60020 > > 1c609918c0e2d7da7bf. > > > > > 2. When talking about whole cluster, it makes sense to use cluster > > > monitoring tool [3], to find out more about overall load distribution, > > > regions of multiple tables distribution, requests amount, and many more > > > such things. > > > > > > And of course, you can use HBase Java API to fetch some data of the > > cluster > > > state as well. I guess you should start looking at it from HBaseAdmin
-
Re: Row distributionAlex Baranau 2012-07-26, 17:34
> Is there any specific best practice on how many regions one
> should split a table into? As always, "it depends". Usually you don't want your RegionServers to serve more than 50 regions or so. The fewer the better. But at the same time you usually want your regions to be distributed over the whole cluster (so that you use all power). So, it might make sense to start with one region per RS (if your writes are more or less evenly distributed across pre-splitted regions) if you don't know about you data size. If you know that you'll need to have more regions because of how big is your data, then you might create more regions at the start (with pre-splitting), so that you avoid region splits operations (you really want to avoid them if you can). Of course, you need to take into account other tables in your cluster as well. I.e. "usually not more than 50 regions" total per regionserver. > Just one more question, in the split keys that you described below, is it > based on the first byte value of the Key? yes. And the first byte contains readable char, because of Bytes.ToBytes(String.valueOf(i)). If you want to prefix with (byte) 0, ..., (byte) 9 (i.e. with 0x00, 0x01, ..., 0x09) then no need to convert to String. Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Jul 26, 2012 at 11:43 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau <[EMAIL PROTECTED] > >wrote: > > > Looks like you have only one region in your table. Right? > > > > If you want your writes to be distributed from the start (without waiting > > for HBase to fill table enough to split it in many regions), you should > > pre-split your table. In your case you can pre-split table with 10 > regions > > (just an example, you can define more), with start keys: "", "1", "2", > ..., > > "9" [1]. > > > > Just one more question, in the split keys that you described below, is it > based on the first byte value of the Key? > > > > Btw, since you are salting your keys to achieve distribution, you might > > also find this small lib helpful which implements most of the stuff for > you > > [2]. > > > > Hope this helps. > > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - > > Solr > > > > [1] > > > > byte[][] splitKeys = new byte[9][]; > > // the first region starting with empty key will be created > > automatically > > for (int i = 1; i < splitKeys.length; i++) { > > splitKeys[i] = Bytes.toBytes(String.valueOf(i)); > > } > > > > HBaseAdmin admin = new HBaseAdmin(conf); > > admin.createTable(tableDescriptor, splitKeys); > > > > [2] > > https://github.com/sematext/HBaseWD > > > > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > > > On Wed, Jul 25, 2012 at 7:54 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Hi Mohit, > > > > > > > > 1. When talking about particular table: > > > > > > > > For viewing rows distribution you can check out how regions are > > > > distributed. And each region defined by the start/stop key, so > > depending > > > on > > > > your key format, etc. you can see which records go into each region. > > You > > > > can see the regions distribution in web ui as Adrien mentioned. It > may > > > also > > > > be handy for you to query .META. table [1] which holds regions info. > > > > > > > > In cases when you use random keys or when you just not sure how data > is > > > > distributed in key buckets (which are regions), you may also want to > > look > > > > at HBase data on HDFS [2]. Since data is stored for each region > > > separately, > > > > you can see the size on the HDFS each one occupies. > > > > > > > > I did a scan and the data looks like as pasted below. It appears all Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: Row distributionMohit Anchlia 2012-07-26, 19:50
On Thu, Jul 26, 2012 at 10:34 AM, Alex Baranau <[EMAIL PROTECTED]>wrote:
> > Is there any specific best practice on how many regions one > > should split a table into? > > As always, "it depends". Usually you don't want your RegionServers to serve > more than 50 regions or so. The fewer the better. But at the same time you > usually want your regions to be distributed over the whole cluster (so that > you use all power). So, it might make sense to start with one region per RS > (if your writes are more or less evenly distributed across pre-splitted > regions) if you don't know about you data size. If you know that you'll > need to have more regions because of how big is your data, then you might > create more regions at the start (with pre-splitting), so that you avoid > region splits operations (you really want to avoid them if you can). > Of course, you need to take into account other tables in your cluster as > well. I.e. "usually not more than 50 regions" total per regionserver. > > > Thanks for the detailed explanation. I understand the regions per > regionserver, which is essentially range of rows distributed accross the > cluster for a given table. But who decides on how many regionservers to > have in the cluster? > > > Just one more question, in the split keys that you described below, is it > > based on the first byte value of the Key? > > yes. And the first byte contains readable char, because of > Bytes.ToBytes(String.valueOf(i)). If you want to prefix with (byte) 0, ..., > (byte) 9 (i.e. with 0x00, 0x01, ..., 0x09) then no need to convert to > String. > > How different is this mechanism as compared to regionsplitter that uses default string md5 split. Just trying to understand the difference in how different the key range is. > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > On Thu, Jul 26, 2012 at 11:43 AM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau <[EMAIL PROTECTED] > > >wrote: > > > > > Looks like you have only one region in your table. Right? > > > > > > If you want your writes to be distributed from the start (without > waiting > > > for HBase to fill table enough to split it in many regions), you should > > > pre-split your table. In your case you can pre-split table with 10 > > regions > > > (just an example, you can define more), with start keys: "", "1", "2", > > ..., > > > "9" [1]. > > > > > > Just one more question, in the split keys that you described below, is > it > > based on the first byte value of the Key? > > > > > > > Btw, since you are salting your keys to achieve distribution, you might > > > also find this small lib helpful which implements most of the stuff for > > you > > > [2]. > > > > > > Hope this helps. > > > > > > Alex Baranau > > > ------ > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > ElasticSearch > > - > > > Solr > > > > > > [1] > > > > > > byte[][] splitKeys = new byte[9][]; > > > // the first region starting with empty key will be created > > > automatically > > > for (int i = 1; i < splitKeys.length; i++) { > > > splitKeys[i] = Bytes.toBytes(String.valueOf(i)); > > > } > > > > > > HBaseAdmin admin = new HBaseAdmin(conf); > > > admin.createTable(tableDescriptor, splitKeys); > > > > > > [2] > > > https://github.com/sematext/HBaseWD > > > > > > > > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > > > > > On Wed, Jul 25, 2012 at 7:54 PM, Mohit Anchlia <[EMAIL PROTECTED] > > > >wrote: > > > > > > > On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau < > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Hi Mohit, > > > > > > > > > > 1. When talking about particular table: > > > > > > > > > > For viewing rows distribution you can check out how regions are > > > > > distributed. And each region defined by the start/stop key, so
-
Re: Row distributionAlex Baranau 2012-07-26, 20:29
> But who decides on how many regionservers to
> have in the cluster? RegionServer is a process started on each slave in your cluster. So the number of RS is the same as the number of slaves. You might want to take a look at one of Intro to HBase presentations (which have pictures!) [1] > How different is this mechanism as compared to regionsplitter that uses > default string md5 split. Just trying to understand the difference in how > different the key range is. You can use any of the splitter algorithm, but note that it probably will not take into account the row keys you are going to use. E.g.: * if your row keys have format <country><state><company><...> and * you know that you will have most of the data about US companies (if e.g. this is your target audience) then * based on the example I gave, you can create regions defined by these start keys: "" "US" "US_FL" "US_KN" "US_MS" "US_NC" "US_VM" "V" so that data is more or less evenly distributed (note: there's no need to split other countries in regions as they they will have small amount of data). No standard splitter will know what your data is (at the time of creation of the table). Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr [1] http://blog.sematext.com/2012/07/09/introduction-to-hbase/ http://blog.sematext.com/2012/07/09/intro-to-hbase-internals-and-schema-desig/ or any other "intro to hbase" presentations over the web. On Thu, Jul 26, 2012 at 3:50 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > On Thu, Jul 26, 2012 at 10:34 AM, Alex Baranau <[EMAIL PROTECTED] > >wrote: > > > > Is there any specific best practice on how many regions one > > > should split a table into? > > > > As always, "it depends". Usually you don't want your RegionServers to > serve > > more than 50 regions or so. The fewer the better. But at the same time > you > > usually want your regions to be distributed over the whole cluster (so > that > > you use all power). So, it might make sense to start with one region per > RS > > (if your writes are more or less evenly distributed across pre-splitted > > regions) if you don't know about you data size. If you know that you'll > > need to have more regions because of how big is your data, then you might > > create more regions at the start (with pre-splitting), so that you avoid > > region splits operations (you really want to avoid them if you can). > > Of course, you need to take into account other tables in your cluster as > > well. I.e. "usually not more than 50 regions" total per regionserver. > > > > > > > > Thanks for the detailed explanation. I understand the regions per > > regionserver, which is essentially range of rows distributed accross the > > cluster for a given table. But who decides on how many regionservers to > > have in the cluster? > > > > > > > Just one more question, in the split keys that you described below, is > it > > > based on the first byte value of the Key? > > > > yes. And the first byte contains readable char, because of > > Bytes.ToBytes(String.valueOf(i)). If you want to prefix with (byte) 0, > ..., > > (byte) 9 (i.e. with 0x00, 0x01, ..., 0x09) then no need to convert to > > String. > > > > > How different is this mechanism as compared to regionsplitter that uses > default string md5 split. Just trying to understand the difference in how > different the key range is. > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - > > Solr > > > > On Thu, Jul 26, 2012 at 11:43 AM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Looks like you have only one region in your table. Right? > > > > > > > > If you want your writes to be distributed from the start (without > > waiting > > > > for HBase to fill table enough to split it in many regions), you > should > > > > pre-split your table. In your case you can pre-split table with 10 Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
-
Re: Row distributionMohit Anchlia 2012-07-26, 20:31
On Thu, Jul 26, 2012 at 1:29 PM, Alex Baranau <[EMAIL PROTECTED]>wrote:
> > But who decides on how many regionservers to > > have in the cluster? > > RegionServer is a process started on each slave in your cluster. So the > number of RS is the same as the number of slaves. You might want to take a > look at one of Intro to HBase presentations (which have pictures!) [1] > > > How different is this mechanism as compared to regionsplitter that uses > > default string md5 split. Just trying to understand the difference in how > > different the key range is. > > You can use any of the splitter algorithm, but note that it probably will > not take into account the row keys you are going to use. E.g.: > * if your row keys have format <country><state><company><...> and > * you know that you will have most of the data about US companies (if e.g. > this is your target audience) then > * based on the example I gave, you can create regions defined by these > start keys: > "" > "US" > "US_FL" > "US_KN" > "US_MS" > "US_NC" > "US_VM" > "V" > so that data is more or less evenly distributed (note: there's no need to > split other countries in regions as they they will have small amount of > data). > Thanks for great explanation!! > > No standard splitter will know what your data is (at the time of creation > of the table). > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > [1] > http://blog.sematext.com/2012/07/09/introduction-to-hbase/ > > http://blog.sematext.com/2012/07/09/intro-to-hbase-internals-and-schema-desig/ > or any other "intro to hbase" presentations over the web. > > On Thu, Jul 26, 2012 at 3:50 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > On Thu, Jul 26, 2012 at 10:34 AM, Alex Baranau <[EMAIL PROTECTED] > > >wrote: > > > > > > Is there any specific best practice on how many regions one > > > > should split a table into? > > > > > > As always, "it depends". Usually you don't want your RegionServers to > > serve > > > more than 50 regions or so. The fewer the better. But at the same time > > you > > > usually want your regions to be distributed over the whole cluster (so > > that > > > you use all power). So, it might make sense to start with one region > per > > RS > > > (if your writes are more or less evenly distributed across pre-splitted > > > regions) if you don't know about you data size. If you know that you'll > > > need to have more regions because of how big is your data, then you > might > > > create more regions at the start (with pre-splitting), so that you > avoid > > > region splits operations (you really want to avoid them if you can). > > > Of course, you need to take into account other tables in your cluster > as > > > well. I.e. "usually not more than 50 regions" total per regionserver. > > > > > > > > > > > > > Thanks for the detailed explanation. I understand the regions per > > > regionserver, which is essentially range of rows distributed accross > the > > > cluster for a given table. But who decides on how many regionservers to > > > have in the cluster? > > > > > > > > > > > Just one more question, in the split keys that you described below, > is > > it > > > > based on the first byte value of the Key? > > > > > > yes. And the first byte contains readable char, because of > > > Bytes.ToBytes(String.valueOf(i)). If you want to prefix with (byte) 0, > > ..., > > > (byte) 9 (i.e. with 0x00, 0x01, ..., 0x09) then no need to convert to > > > String. > > > > > > > > How different is this mechanism as compared to regionsplitter that uses > > default string md5 split. Just trying to understand the difference in how > > different the key range is. > > > > > Alex Baranau > > > ------ > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > ElasticSearch > > - > > > Solr > > > > > > On Thu, Jul 26, 2012 at 11:43 AM, Mohit Anchlia < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau < |