Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Uneven write request to regions


Copy link to this message
-
Re: Uneven write request to regions
Hi tom ,
Isn't the Bucket we use the same thing?
So if I understand correctly you are not using automatic splitting, but do
this throughout a manual process or a process running along HBase?
Regarding recommendation of merge of empty regions - how did merge regions?
I thought this capability exists only in 0.96?
On Saturday, November 16, 2013, Tom Brown wrote:

> We have solved this by prefixing each key with a single byte. The byte is
> based on a very simple 8-bit hash of the record. If you know exactly which
> row you are looking for you can rehash your row to create the true key.
>
> Scans are a little more complex because you have to issue 256 scans instead
> of 1 scan, and interpolate the results.
>
> Another thing we did us write a utility to compute all the region sizes in
> a list, and recommend merges of now-empty regions, and splits of hot
> regions.
>
> Together, those two items solve the problem quite nicely for us. We haven't
> quite got to your scale yet, so YMMV.
>
> --Tom
>
> On Friday, November 15, 2013, Ted Yu wrote:
>
> > bq. you must have your customerId, timestamp in the rowkey since you
> query
> > on it
> >
> > Have you looked at this API in Scan ?
> >
> >   public Scan setTimeRange(long minStamp, long maxStamp)
> >
> >
> > Cheers
> >
> >
> > On Fri, Nov 15, 2013 at 1:28 PM, Asaf Mesika <[EMAIL PROTECTED]>
> > wrote:
> >
> > > The problem is that I do know my rowkey design, and it follows people's
> > > best practice, but generates a really bad use case which I can't seem
> to
> > > know how to solve yet.
> > >
> > > The rowkey as I said earlier is:
> > > <customerId><bucket><timestampInMs><uniqueId>
> > > So when ,for example, you have 1000 customer, and bucket ranges from 1
> to
> > > 16, you eventually end up with:
> > > * 30k regions - What happens, as I presume: you start with one region
> > > hosting ALL customers, which is just one. As you pour in more customers
> > and
> > > more data, the region splitting kicks in. So, after a while, you get
> to a
> > > situation in which most regions hosts a specific customerId, bucket and
> > > time duration. For example: customer #10001, bucket 6, 01/07/2013
> 00:00 -
> > > 02/07/2013 17:00.
> > > * Empty regions - the first really bad consequence of what I told
> before
> > is
> > > that when the time duration is over, no data will ever be written to
> this
> > > region. and Worst - when the TTL you set (lets say 1 month) is over and
> > > it's 03/08/2013, this region gets empty!
> > >
> > > The thing is that you must have your customerId, timestamp in the
> rowkey
> > > since you query on it, but when you do, you will essentially get
> regions
> > > which will not get any more writes to them, and after TTL become zombie
> > > regions :)
> > >
> > > The second bad part of this rowkey design is that some customer will
> have
> > > significantly less traffic than other customers, thus in essence their
> > > regions will get written in a very slow rate compared with the high
> > traffic
> > > customer. When this happens on the same RS - bam: the slow region Puts
> > are
> > > causing the WAL Queue to get bigger over time, since its region never
> > gets
> > > to Max Region Size (256MB in our case) thus never gets flushed, thus
> > stays
> > > in the 1st WAL file. Until when? Until we hit max logs file permitted
> > (32)
> > > and then regions are flushed forcely. When this happen, we get about
> 100
> > > regions with 3k-3mb store files. You can imagine what happens next.
> > >
> > > The weirdest thing here is that this rowkey design is very common -
> > nothing
> > > fancy here, so in essence this phenomenon should have happened to a lot
> > of
> > > people - but from some reason, I don't see that much writing about it.
> > >
> > > Thanks!
> > >
> > > Asaf
> > >
> > >
> > >
> > > On Fri, Nov 15, 2013 at 3:51 AM, Jia Wang <[EMAIL PROTECTED]> wrote:
> > >
> > > > Then the case is simple, as i said "check your row key design, you
> can
> > > find
> > > > the start and end row key for each region, from which you can know
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB