Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Uneven write request to regions


Copy link to this message
-
Re: Uneven write request to regions
Not on the amount of data we have. We store roughly 50TB of data for this
table over 30RS. Since we use default max region size (10GB) and default
split policy, we get roughly 10k regions containing data, and 20k empty
regions (due to the duration issue in rowkey which has passed as explained
in previous replies).
So I guess when we started ingesting data, we came to the situation we had
1 region per customer, but due to size of it all, we quickly got the
situation a region was a specific customer id and bucket (out of 16
buckets) and then after a while, a specific date range within this bucket.

On Sat, Nov 16, 2013 at 8:16 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> bq. all regions of that customer
>
> Since the rowkey starts with <customerId>, any single customer would only
> span few regions (normally 1 region), right ?
>
>
> On Fri, Nov 15, 2013 at 9:56 PM, Asaf Mesika <[EMAIL PROTECTED]>
> wrote:
>
> > But when you read, you have to approach all regions of that customer,
> > instead of pinpointing just one which contains that hour you want for
> > example.
> >
> > On Friday, November 15, 2013, Ted Yu wrote:
> >
> > > bq. you must have your customerId, timestamp in the rowkey since you
> > query
> > > on it
> > >
> > > Have you looked at this API in Scan ?
> > >
> > >   public Scan setTimeRange(long minStamp, long maxStamp)
> > >
> > >
> > > Cheers
> > >
> > >
> > > On Fri, Nov 15, 2013 at 1:28 PM, Asaf Mesika <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > The problem is that I do know my rowkey design, and it follows
> people's
> > > > best practice, but generates a really bad use case which I can't seem
> > to
> > > > know how to solve yet.
> > > >
> > > > The rowkey as I said earlier is:
> > > > <customerId><bucket><timestampInMs><uniqueId>
> > > > So when ,for example, you have 1000 customer, and bucket ranges from
> 1
> > to
> > > > 16, you eventually end up with:
> > > > * 30k regions - What happens, as I presume: you start with one region
> > > > hosting ALL customers, which is just one. As you pour in more
> customers
> > > and
> > > > more data, the region splitting kicks in. So, after a while, you get
> > to a
> > > > situation in which most regions hosts a specific customerId, bucket
> and
> > > > time duration. For example: customer #10001, bucket 6, 01/07/2013
> > 00:00 -
> > > > 02/07/2013 17:00.
> > > > * Empty regions - the first really bad consequence of what I told
> > before
> > > is
> > > > that when the time duration is over, no data will ever be written to
> > this
> > > > region. and Worst - when the TTL you set (lets say 1 month) is over
> and
> > > > it's 03/08/2013, this region gets empty!
> > > >
> > > > The thing is that you must have your customerId, timestamp in the
> > rowkey
> > > > since you query on it, but when you do, you will essentially get
> > regions
> > > > which will not get any more writes to them, and after TTL become
> zombie
> > > > regions :)
> > > >
> > > > The second bad part of this rowkey design is that some customer will
> > have
> > > > significantly less traffic than other customers, thus in essence
> their
> > > > regions will get written in a very slow rate compared with the high
> > > traffic
> > > > customer. When this happens on the same RS - bam: the slow region
> Puts
> > > are
> > > > causing the WAL Queue to get bigger over time, since its region never
> > > gets
> > > > to Max Region Size (256MB in our case) thus never gets flushed, thus
> > > stays
> > > > in the 1st WAL file. Until when? Until we hit max logs file permitted
> > > (32)
> > > > and then regions are flushed forcely. When this happen, we get about
> > 100
> > > > regions with 3k-3mb store files. You can imagine what happens next.
> > > >
> > > > The weirdest thing here is that this rowkey design is very common -
> > > nothing
> > > > fancy here, so in essence this phenomenon should have happened to a
> lot
> > > of
> > > > people - but from some reason, I don't see that much writing about
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB