Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Uneven write request to regions


Copy link to this message
-
Re: Uneven write request to regions
Hi tom ,
Isn't the Bucket we use the same thing?
So if I understand correctly you are not using automatic splitting, but do
this throughout a manual process or a process running along HBase?
Regarding recommendation of merge of empty regions - how did merge regions?
I thought this capability exists only in 0.96?
On Saturday, November 16, 2013, Tom Brown wrote:

> We have solved this by prefixing each key with a single byte. The byte is
> based on a very simple 8-bit hash of the record. If you know exactly which
> row you are looking for you can rehash your row to create the true key.
>
> Scans are a little more complex because you have to issue 256 scans instead
> of 1 scan, and interpolate the results.
>
> Another thing we did us write a utility to compute all the region sizes in
> a list, and recommend merges of now-empty regions, and splits of hot
> regions.
>
> Together, those two items solve the problem quite nicely for us. We haven't
> quite got to your scale yet, so YMMV.
>
> --Tom
>
> On Friday, November 15, 2013, Ted Yu wrote:
>
> > bq. you must have your customerId, timestamp in the rowkey since you
> query
> > on it
> >
> > Have you looked at this API in Scan ?
> >
> >   public Scan setTimeRange(long minStamp, long maxStamp)
> >
> >
> > Cheers
> >
> >
> > On Fri, Nov 15, 2013 at 1:28 PM, Asaf Mesika <[EMAIL PROTECTED]>
> > wrote:
> >
> > > The problem is that I do know my rowkey design, and it follows people's
> > > best practice, but generates a really bad use case which I can't seem
> to
> > > know how to solve yet.
> > >
> > > The rowkey as I said earlier is:
> > > <customerId><bucket><timestampInMs><uniqueId>
> > > So when ,for example, you have 1000 customer, and bucket ranges from 1
> to
> > > 16, you eventually end up with:
> > > * 30k regions - What happens, as I presume: you start with one region
> > > hosting ALL customers, which is just one. As you pour in more customers
> > and
> > > more data, the region splitting kicks in. So, after a while, you get
> to a
> > > situation in which most regions hosts a specific customerId, bucket and
> > > time duration. For example: customer #10001, bucket 6, 01/07/2013
> 00:00 -
> > > 02/07/2013 17:00.
> > > * Empty regions - the first really bad consequence of what I told
> before
> > is
> > > that when the time duration is over, no data will ever be written to
> this
> > > region. and Worst - when the TTL you set (lets say 1 month) is over and
> > > it's 03/08/2013, this region gets empty!
> > >
> > > The thing is that you must have your customerId, timestamp in the
> rowkey
> > > since you query on it, but when you do, you will essentially get
> regions
> > > which will not get any more writes to them, and after TTL become zombie
> > > regions :)
> > >
> > > The second bad part of this rowkey design is that some customer will
> have
> > > significantly less traffic than other customers, thus in essence their
> > > regions will get written in a very slow rate compared with the high
> > traffic
> > > customer. When this happens on the same RS - bam: the slow region Puts
> > are
> > > causing the WAL Queue to get bigger over time, since its region never
> > gets
> > > to Max Region Size (256MB in our case) thus never gets flushed, thus
> > stays
> > > in the 1st WAL file. Until when? Until we hit max logs file permitted
> > (32)
> > > and then regions are flushed forcely. When this happen, we get about
> 100
> > > regions with 3k-3mb store files. You can imagine what happens next.
> > >
> > > The weirdest thing here is that this rowkey design is very common -
> > nothing
> > > fancy here, so in essence this phenomenon should have happened to a lot
> > of
> > > people - but from some reason, I don't see that much writing about it.
> > >
> > > Thanks!
> > >
> > > Asaf
> > >
> > >
> > >
> > > On Fri, Nov 15, 2013 at 3:51 AM, Jia Wang <[EMAIL PROTECTED]> wrote:
> > >
> > > > Then the case is simple, as i said "check your row key design, you
> can
> > > find
> > > > the start and end row key for each region, from which you can know