> > > I am late to the game so take my comments w/ a grain of salt -- I'll
> > take a
> > > look at HBASE-10070 -- but high-level do we have to go the read
> > > route? IMO, having our current already-strained AssignmentManager code
> > > base manage three replicas instead of one will ensure that Jimmy Xiang
> > and
> > > Jeffrey Zhong do nothing else for the next year or two but work on the
> > new
> > > interesting use cases introduced by this new level of complexity put
> > upon a
> > > system that has just achieved a hard-won stability.
> > >
> > Stack, the model is that the replicas (HRegionInfo with an added field
> > 'replicaId') are treated just as any other region in the AM. You can
> > see the code - it's not adding much at all in terms of new code to
> > handle replicas.
Adding to what Devaraj said, we opted for actually creating one more
per region per replica count so that the assignment state machine is not
affected. The high level
change is that we are creating replica x num regions many regions, and
assign them. The LB
ensures that replica's are placed with high availability across hosts and
> > > A few of us chatting offline -- Jimmy, Jon, Elliott, and I -- were
> > > wondering if you couldn't solve this read replicas in a more hbase
> > 'native'
> > > way* by just bringing up three tables -- a main table and then two
> > snapshot
> > > clones with the clones refreshed on a period (via snapshot or via
> > > in-cluster replication) -- and then a shim on top of an HBase client
> > would
> > > read from the main table until failure and then from a snapshot until
> > > main came back. Reads from snapshot tables could be marked 'stale'.
> > You'd
> > > have to modify the balancer so the tables -- or at least their regions
> > > were physically distinct... you might be able just have the three
> > > each in a different namespace.
Doing region replicas via tables vs multiplying the num regions will
involve a very similar
amount of code changes. The LB still have to be aware of the fact that
different tables should not be co-hosted. As per above, in neither case,
the assignment state
machine is altered. However, with different tables, it will be unintuitive
since the meta, and the
client side would have to bring different regions of different tables to
make sense. Those tables
will not have any associated data, but refer to the other tables etc.
> > >
> > At a high level, considering all the work that would be needed in the
> > client (for it to be able to be aware of the primary and the snapshot
> > regions)
> Minor. Right? Snapshot tables would have a _snapshot suffix?
> > and in the master (to do with managing the placements of the
> > regions),
> Balancer already factors myriad attributes. Adding one more rule seems
> like it would be near-in scope.
> And this would be work not in the client but in layer above the client.
> > I am not convinced. Also, consider that you will be taking a
> > lot of snapshots and adding to the filesystem's load for the file
> > creations.
> Snapshotting is a well-worn and tested code path. Making them is pretty
> lightweight op. Frequency would depend on what the app needs.
> Could go the replication route too, another well-worn and tested code
> Trying to minimize the new code getting to the objective.
I think these should be addressed by region changes section in the design
doc. In region-snapshots
section, we detail how this will be like single-region snapshots. We do not
need table snapshots per se,
since we are opening the region replica from the files of the primary.
There is already a working patch for this
in the branch. In async-wal replication section, we mention how this can be
build using the existing replication
mechanism. We cannot directly replicate to a different table since we do
not want to multiply the actual data in hdfs.
But we will tap into the replica sink to do the in-cluster replication.
That won't happen without a major architecture surgery in HBase.
HBASE-10070 is some
major work, but is in no way a major arch change I would say. Hydrabase /
megastore is also
across DC, while we are mostly interested in intra-DC availability right
I think the Consistency API from the client and the shell is intuitive and
can be configured per
request, which is the expected behavior. (