Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Row distribution


+
Mohit Anchlia 2012-07-25, 05:32
+
Adrien Mogenet 2012-07-25, 05:59
+
Alex Baranau 2012-07-25, 13:53
Copy link to this message
-
Re: Row distribution
On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> Hi Mohit,
>
> 1. When talking about particular table:
>
> For viewing rows distribution you can check out how regions are
> distributed. And each region defined by the start/stop key, so depending on
> your key format, etc. you can see which records go into each region. You
> can see the regions distribution in web ui as Adrien mentioned. It may also
> be handy for you to query .META. table [1] which holds regions info.
>
> In cases when you use random keys or when you just not sure how data is
> distributed in key buckets (which are regions), you may also want to look
> at HBase data on HDFS [2]. Since data is stored for each region separately,
> you can see the size on the HDFS each one occupies.
>
> I did a scan and the data looks like as pasted below. It appears all my
writes are going to just one server. My keys are of this type
[0-9]:[current timestamp]. Number between 0-9 is generated randomly. I
thought by having this random number I'll be able to place my keys on
multiple nodes. How should I approach this such that I am able to use other
nodes as well?

 SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:regioninfo,
timestamp=1343170773523, value=REGION => {NAME =>
'SESSION_TIMELINE1,,1343074465420.5831bbac53e591c609918c0e2d7da7
 1c609918c0e2d7da7bf.                           bf.', STARTKEY => '',
ENDKEY => '', ENCODED => 5831bbac53e591c609918c0e2d7da7bf, TABLE => {{NAME
=> 'SESSION_TIMELINE1', FAMILIES => [{NAM
                                                E => 'S_T_MTX', BLOOMFILTER
=> 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'GZ', VERSIONS => '1',
TTL => '2147483647', BLOCKSIZE => '
                                                65536', IN_MEMORY =>
'false', BLOCKCACHE => 'true'}]}}
 SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:server,
timestamp=1343178912655, value=dsdb3.:60020
 1c609918c0e2d7da7bf.

> 2. When talking about whole cluster, it makes sense to use cluster
> monitoring tool [3], to find out more about overall load distribution,
> regions of multiple tables distribution, requests amount, and many more
> such things.
>
> And of course, you can use HBase Java API to fetch some data of the cluster
> state as well. I guess you should start looking at it from HBaseAdmin
> class.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> [1]
>
> hbase(main):001:0> scan '.META.', {LIMIT=>1, STARTROW=>"mytable,,"}
> ROW
> COLUMN+CELL
>
>
>  mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.
>  column=info:regioninfo, timestamp=1341279432625, value=REGION => {NAME =>
> 'mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.', STARTKEY =>
> 'chicago', ENDKEY => 'new_york', ENCODED =>
> fd61cd7ef426d2f233a4cd7e8b73845, TABLE => {{NAME => 'mytable', FAMILIES =>
> [{NAME => 'job', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0',
> COMPRESSION => 'NONE', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE =>
> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
>
>
>
>  mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.
>  column=info:server, timestamp=1341279432673, value=myserver:60020
>
>
>  mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.
>  column=info:serverstartcode, timestamp=1341279432673, value=1341267474257
>
>
> 1 row(s) in 0.1980 seconds
>
> [2]
>
> ubuntu@ip-10-80-47-73:~$ sudo -u hdfs hadoop fs -du /hbase/mytable
> Found 130 items
> 3397        hdfs://hbase.master/hbase/mytable
> /02925d3c335bff7e273f392324f16dca
> 2682163424  hdfs://hbase.master/hbase/mytable
> /03231b8ae2b73317c4858b1a85c09ad2
> 1038862956  hdfs://hbase.master/hbase/mytable
> /04f911571593e931a9a3d9e2a6616236
> 1039181555  hdfs://hbase.master/hbase/mytable
> /0a177633196cae7b158836181d69dc0f
> 1076888812  hdfs://hbase.master/hbase/mytable
> /0d52fc477c41a9a236803234d44c7c06
>
> [3]
> You can get data from JMX directly using any tool you like or use:
+
Alex Baranau 2012-07-26, 14:16
+
Mohit Anchlia 2012-07-26, 15:43
+
Alex Baranau 2012-07-26, 17:34
+
Mohit Anchlia 2012-07-26, 19:50
+
Alex Baranau 2012-07-26, 20:29
+
Mohit Anchlia 2012-07-26, 20:31
+
Mohit Anchlia 2012-07-26, 15:41
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB