-Re: [Shadow Regions / Read Replicas ] Block Affinity
Jonathan Hsieh 2013-12-04, 00:53
On Tue, Dec 3, 2013 at 11:37 AM, Enis Söztutar <[EMAIL PROTECTED]> wrote:
> Responses inlined.
> On Mon, Dec 2, 2013 at 10:00 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote:
> > For the most efficient consistent read-recovery (shadow
> > it would make sense to have them assigned to the rs's where the Hlogs are
> > local. Thus this approach would want to assign shadow regions for regions
> > X, Y, and Z on RS-L and RS-M.
> I don't this this is the case.
Clarification: For achieving the goal of low latency for recovery using the
fewest resources (efficiency for this goal) this is the design point that
best achieves the goal.
> Recovery is a multi step process, and
> reading and
> applying the log is only one step.
Yes and the replay to recover read consistency seems to be one of the more
> After the region is opened, you
> definitely want
> the data files to be local as much as possible.
Ideal but not necessary for correctness or faster consistent read recovery.
Today we don't have data files local as much as possible and normal hbase
user's can't use the feature yet.
> Considering the relative
> sizes of
> the files and the WALs, I think we will always want to use hdfs affinity
> groups for
> hfiles rather than hlogs to assign secondary replicas. This will help both
> stale reads
> and local reads in case of a promotion to primary.
This is why i'm suggesting separating the two functions (read replica and
shadow memstore) to separate logical types with different placement
I agree with you on selecting hfile related region servers for read
replicas, (which I hadn't fully considered when I was working on the shadow
memstore writeup). However for minimizing recovery time replay is more
costly (either the n^2 tailers).
I don't think hlog tailers should be coupled to the hfile affinity groups
for the same reasons as you -- n^2 tailers where n is # of regions.
> > A simple optimal solution for both read replicas and shadow regions would
> > be to assign the regions and the HLog to the same set of machines so that
> > the RS's for the logs and region x, y, and z hosted are on the same
> > machines -- let's say RS-A, RS-H, and RS-I. This has some non-optimal
> > balancing ramifications upon machine failure -- the work of RS-A would be
> > split between to RS-H and RS-I.
> I don't think we want this. This implies that we are creating region
> assignment groups ( group-based
> assignment as described in the doc). The problem is that in case of a
> crash, you cannot evenly
> distribute out the regions from the primary otherwise you will still end up
> tailing all the logs for
> all the region servers. Plus if you want to load balance, it will be even
> harder to satisfy the constraints while
> keeping the balance.
> In your example, if you have replication=2 for example, we cannot simply
> move all the primary regions
> of RS-A to RS-H, which will then suddenly have twice the number of regions.
I think a realistic use would be to set replication 3 since we have three
replicas of the logs. Instead the client would just choose to hit the
first two replicas (rep0=primary and rep1=secondary) to reduce the memory
pressure on the 3rd node (rep=secondary, not read from).
Here's an extension to this hybrid approach which potentially buys us both
good recency and high availability (at the cost of poor balance if we
enforce optimal locality). We essentially group a set of regions to the
same three RS's for all regions, and hlogs in that group in a little cycle.
Region X on RS-A (rep=0), RS-B(rep=1), RS-C(rep=2).
Region Y on RS-A (rep=1), RS-B(rep=2), RS-C(rep=0).
Region Z on RS-A (rep=2), RS-B(rep=0), RS-C(rep=1).
RS-A's log on RS-A, RS-B, RS-C.
RS-B's log on RS-B, RS-C, RS-A.
RS-C's log on RS-C, RS-A, RS-B.
If any RS's go down the load is spread between the other two in the group.
> > A more complex solution for both would be to choose machines for the
I think there is a misunderstanding here -- clarifying. In this combined
approach, we have a pool of RS's, each of which will can host a combination
of primary regions, secondary read replica regions, and shadow memstore
regions. If we don't have separate out the shadow memstore regions from
the secondary read replicas, we end up with the inefficient design implied
in the read replica write up -- where all region servers need to
essentially read all hlogs from the other region servers causing n^2
tailers across the region instead of n or 2n tailers.
We will want to have different metrics for monitoring and log for the read
replicas and the shadow memstore. Ideally we'd know how far behind we are.
In the combined approach, this isn't a move here -- the flush is completed
by the shadow to the nodes that are assigned as the secondaries than we
promote the secondary to primary by closing and then opening the region as
Region X on RS-A (rep=0), RS-B(rep=1), RS-C(rep=2).
Region Y on RS-A (rep=0), RS-D(rep=1), RS-E(rep=2).
RS-A's log on RS-A, RS-F, RS-G.
RS-F and RS-G shadow RS-A's Hlog.
RS-A goes down.
RS-F and RS-G catch up to the end of RS-A's HLog.
RS-F flushes the region X shadow memstore to nodes RS-B, RS-C (and some
RS-G flushes the region Y shadow memstore to nodes RS-D, RS-E (and some
Master promotes region X on secondary RS-B (closeReplica, open X) with all
its stores local to RS-B and RS-C.
Master promotes region Y on secondary RS-D (closeReplica, open Y) with all
its stores local to RS-D and RS-E.
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]