Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> Re: Efficient backup and a reasonable restore of an ensemble

Copy link to this message
Re: Efficient backup and a reasonable restore of an ensemble
Does someone have the answers for Sergey's questions?

I want to make sure I am fully understanding the procedures of zookeeper
backup and disaster recovery:

For the backup procedures at zookeeper assemble:
(1) Login to any host which state is "Serving"
                  Do I have to login to leader node, or any node is ok?
(2) Copy latest snapshot file and transaction log from version-2 directory.
                  How to make sure we do not copy corrupt files if the
snapshot/transaction log is in the middle of update? Do we have to shutdown
the node to make the copy?
                  besides the transaction log and snapshot, do we have to
copy other files such as the ecoch files

For the disaster recovery procedures at zookeeper assemble:
(1) recreate the machines for the zookeeper ensemble
(2) copy snapshot/transaction log we backed up into the zookeeper
dataDir\version-2 and logDir\version2.
                 Do we have to copy the epoch files?
                 Do we have to copy snapshot/transaction log backed up to
all the zookeeper node, or just the first node we starts?

Appreciate your time and help.
On Mon, Jul 8, 2013 at 9:25 PM, Sergey Maslyakov <[EMAIL PROTECTED]> wrote:

> These are interesting points, Thawan. I'd like to make sure that I get them
> right.
> 1. Are you saying that a snapshot file may not be sufficient to restore
> Zookeeper to a consistent state? Does it always require a transaction log
> file or is it required to get to the most current state? I was hoping that
> a snapshot is self-sufficient to do a restore to recent but not necessarily
> most current state. Was I wrong?
> 2. Do you suggest that the same pair of a snapshot (and a transaction log)
> needs to be copied on all servers before they are brought online? The what
> about the "epoch" files? Do they need to be purged, preserved, or same one
> populated through the whole ensemble?
> On Mon, Jul 8, 2013 at 7:53 PM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:
> > Just saw that  this is the corresponding use case to the question posted
> > in dev list.
> >
> > In order to restore the data to a given point in time correctly, you need
> > both snapshot and txnlog. This is because zookeeper snapshot is fuzzy and
> > snapshot alone may not represent a valid state of the server if there are
> > in-flight requests.
> >
> > The 4wl command should cause the server to roll the log and take a
> > snapshot similar to periodic snapshotting operation. Your backup script
> > need grap the snapshot and corresponding txnlog file from the data dir.
> >
> > To restore, just shutdown all hosts, clear the data dir, copy over the
> > snapshot and txnlog, and restart them.
> >
> >
> > --
> > Thawan Kooburat
> >
> >
> >
> >
> >
> > On 7/8/13 3:28 PM, "Sergey Maslyakov" <[EMAIL PROTECTED]> wrote:
> >
> > >Thank you for your response, Flavio. I apologize, I did not provide a
> > >clear
> > >explanation of the use case.
> > >
> > >This backup/restore is not intended to be tied to any write event,
> > >instead,
> > >it is expected to run as a periodic (daily?) cron job on one of the
> > >servers, which is not guaranteed to be the leader of the ensemble. There
> > >is
> > >no expectation that all recent changes are committed and persisted to
> > >disk.
> > >The system can sustain the loss of several hours worth of recent changes
> > >in
> > >the event of restore.
> > >
> > >As for finding the leader dynamically and performing backup on it, this
> > >approach could be more difficult as the leader can change time to time
> and
> > >I still need to fetch the file to store it in my designated backup
> > >location. Taking backup on one server and picking it up from a local
> file
> > >system looks less error-prone. Even if I went the fancy route and had
> > >Zookeeper send me the serialized DataTree in response to the 4wl, this
> > >approach would involve a lot of moving parts.
> > >
> > >I have already made a PoC for a new 4wl that invokes takeSnapshot() and