Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Uneven write request to regions


+
Asaf Mesika 2013-11-14, 08:59
+
Jia Wang 2013-11-14, 10:06
+
Asaf Mesika 2013-11-14, 12:47
+
Jia Wang 2013-11-15, 01:51
+
Bharath Vissapragada 2013-11-15, 03:39
+
Asaf Mesika 2013-11-15, 21:28
+
Ted Yu 2013-11-15, 21:34
+
Asaf Mesika 2013-11-16, 05:56
+
Ted Yu 2013-11-16, 06:16
+
Asaf Mesika 2013-11-16, 18:41
+
Mike Axiak 2013-11-16, 17:25
+
Asaf Mesika 2013-11-16, 19:16
+
Himanshu Vashishtha 2013-11-20, 01:05
+
Asaf Mesika 2013-11-20, 06:00
+
Otis Gospodnetic 2013-11-20, 15:43
+
Tom Brown 2013-11-20, 17:04
+
Asaf Mesika 2013-11-20, 17:14
+
Ted Yu 2013-11-20, 17:17
+
Asaf Mesika 2013-11-20, 17:01
Copy link to this message
-
Re: Uneven write request to regions
We have solved this by prefixing each key with a single byte. The byte is
based on a very simple 8-bit hash of the record. If you know exactly which
row you are looking for you can rehash your row to create the true key.

Scans are a little more complex because you have to issue 256 scans instead
of 1 scan, and interpolate the results.

Another thing we did us write a utility to compute all the region sizes in
a list, and recommend merges of now-empty regions, and splits of hot
regions.

Together, those two items solve the problem quite nicely for us. We haven't
quite got to your scale yet, so YMMV.

--Tom

On Friday, November 15, 2013, Ted Yu wrote:

> bq. you must have your customerId, timestamp in the rowkey since you query
> on it
>
> Have you looked at this API in Scan ?
>
>   public Scan setTimeRange(long minStamp, long maxStamp)
>
>
> Cheers
>
>
> On Fri, Nov 15, 2013 at 1:28 PM, Asaf Mesika <[EMAIL PROTECTED]>
> wrote:
>
> > The problem is that I do know my rowkey design, and it follows people's
> > best practice, but generates a really bad use case which I can't seem to
> > know how to solve yet.
> >
> > The rowkey as I said earlier is:
> > <customerId><bucket><timestampInMs><uniqueId>
> > So when ,for example, you have 1000 customer, and bucket ranges from 1 to
> > 16, you eventually end up with:
> > * 30k regions - What happens, as I presume: you start with one region
> > hosting ALL customers, which is just one. As you pour in more customers
> and
> > more data, the region splitting kicks in. So, after a while, you get to a
> > situation in which most regions hosts a specific customerId, bucket and
> > time duration. For example: customer #10001, bucket 6, 01/07/2013 00:00 -
> > 02/07/2013 17:00.
> > * Empty regions - the first really bad consequence of what I told before
> is
> > that when the time duration is over, no data will ever be written to this
> > region. and Worst - when the TTL you set (lets say 1 month) is over and
> > it's 03/08/2013, this region gets empty!
> >
> > The thing is that you must have your customerId, timestamp in the rowkey
> > since you query on it, but when you do, you will essentially get regions
> > which will not get any more writes to them, and after TTL become zombie
> > regions :)
> >
> > The second bad part of this rowkey design is that some customer will have
> > significantly less traffic than other customers, thus in essence their
> > regions will get written in a very slow rate compared with the high
> traffic
> > customer. When this happens on the same RS - bam: the slow region Puts
> are
> > causing the WAL Queue to get bigger over time, since its region never
> gets
> > to Max Region Size (256MB in our case) thus never gets flushed, thus
> stays
> > in the 1st WAL file. Until when? Until we hit max logs file permitted
> (32)
> > and then regions are flushed forcely. When this happen, we get about 100
> > regions with 3k-3mb store files. You can imagine what happens next.
> >
> > The weirdest thing here is that this rowkey design is very common -
> nothing
> > fancy here, so in essence this phenomenon should have happened to a lot
> of
> > people - but from some reason, I don't see that much writing about it.
> >
> > Thanks!
> >
> > Asaf
> >
> >
> >
> > On Fri, Nov 15, 2013 at 3:51 AM, Jia Wang <[EMAIL PROTECTED]> wrote:
> >
> > > Then the case is simple, as i said "check your row key design, you can
> > find
> > > the start and end row key for each region, from which you can know why
> > your
> > > request with a specific row key doesn't hit a specified region"
> > >
> > > Cheers
> > > Ramon
> > >
> > >
> > > On Thu, Nov 14, 2013 at 8:47 PM, Asaf Mesika <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > It's from the same table.
> > > > The thing is that some <customerId> simply have less data saved in
> > HBase,
> > > > while others have x50 (max) data.
> > > > I'm trying to check how people designed their rowkey around it, or
> had
> > > > other out-of-the-box solution for it.
+
Asaf Mesika 2013-11-16, 05:59
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB