Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Row distribution


Copy link to this message
-
Re: Row distribution
On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> Looks like you have only one region in your table. Right?
>
> If you want your writes to be distributed from the start (without waiting
> for HBase to fill table enough to split it in many regions), you should
> pre-split your table. In your case you can pre-split table with 10 regions
> (just an example, you can define more), with start keys: "", "1", "2", ...,
> "9" [1].
>
> Just one more question, in the split keys that you described below, is it
based on the first byte value of the Key?
> Btw, since you are salting your keys to achieve distribution, you might
> also find this small lib helpful which implements most of the stuff for you
> [2].
>
> Hope this helps.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> [1]
>
>     byte[][] splitKeys = new byte[9][];
>     // the first region starting with empty key will be created
> automatically
>     for (int i = 1; i < splitKeys.length; i++) {
>       splitKeys[i] = Bytes.toBytes(String.valueOf(i));
>     }
>
>     HBaseAdmin admin = new HBaseAdmin(conf);
>     admin.createTable(tableDescriptor, splitKeys);
>
> [2]
> https://github.com/sematext/HBaseWD
>
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>
> On Wed, Jul 25, 2012 at 7:54 PM, Mohit Anchlia <[EMAIL PROTECTED]
> >wrote:
>
> > On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi Mohit,
> > >
> > > 1. When talking about particular table:
> > >
> > > For viewing rows distribution you can check out how regions are
> > > distributed. And each region defined by the start/stop key, so
> depending
> > on
> > > your key format, etc. you can see which records go into each region.
> You
> > > can see the regions distribution in web ui as Adrien mentioned. It may
> > also
> > > be handy for you to query .META. table [1] which holds regions info.
> > >
> > > In cases when you use random keys or when you just not sure how data is
> > > distributed in key buckets (which are regions), you may also want to
> look
> > > at HBase data on HDFS [2]. Since data is stored for each region
> > separately,
> > > you can see the size on the HDFS each one occupies.
> > >
> > > I did a scan and the data looks like as pasted below. It appears all my
> > writes are going to just one server. My keys are of this type
> > [0-9]:[current timestamp]. Number between 0-9 is generated randomly. I
> > thought by having this random number I'll be able to place my keys on
> > multiple nodes. How should I approach this such that I am able to use
> other
> > nodes as well?
> >
> >
> >
> >  SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:regioninfo,
> > timestamp=1343170773523, value=REGION => {NAME =>
> > 'SESSION_TIMELINE1,,1343074465420.5831bbac53e591c609918c0e2d7da7
> >  1c609918c0e2d7da7bf.                           bf.', STARTKEY => '',
> > ENDKEY => '', ENCODED => 5831bbac53e591c609918c0e2d7da7bf, TABLE =>
> {{NAME
> > => 'SESSION_TIMELINE1', FAMILIES => [{NAM
> >                                                 E => 'S_T_MTX',
> BLOOMFILTER
> > => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'GZ', VERSIONS =>
> '1',
> > TTL => '2147483647', BLOCKSIZE => '
> >                                                 65536', IN_MEMORY =>
> > 'false', BLOCKCACHE => 'true'}]}}
> >  SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:server,
> > timestamp=1343178912655, value=dsdb3.:60020
> >  1c609918c0e2d7da7bf.
> >
> > > 2. When talking about whole cluster, it makes sense to use cluster
> > > monitoring tool [3], to find out more about overall load distribution,
> > > regions of multiple tables distribution, requests amount, and many more
> > > such things.
> > >
> > > And of course, you can use HBase Java API to fetch some data of the
> > cluster
> > > state as well. I guess you should start looking at it from HBaseAdmin
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB