Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> Re: Efficient backup and a reasonable restore of an ensemble


Copy link to this message
-
Re: Efficient backup and a reasonable restore of an ensemble
Agree, we already have such a tool. In fact we use it to reconstruct the
sequence of events that led to a failure and actually restore the system to
a previous stable point and replay the events. Unfortunately this is tied
closely with Helix but it should be easy to make this a generic tool.

Sergey is this something that will be useful in your case.

Thanks,
Kishore G
On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:

> On restore part, I think having a separate utility to manipulate the
> data/snap dir (by truncating the log/removing snapshot to a given zxid)
> would be easier than modifying the server.
>
>
> --
> Thawan Kooburat
>
>
>
>
>
> On 7/8/13 6:34 PM, "kishore g" <[EMAIL PROTECTED]> wrote:
>
> >I think what we are looking at is a  point in time restore functionality.
> >How about adding a feature that says go back to a specific zxid/timestamp.
> >This way before doing any change to zookeeper simply note down the
> >timestamp/zxid on leader. If things go wrong after making changes, bring
> >down zookeepers and provide additional parameter of a zxid/timestamp while
> >restarting. The server can go the exact point and make it current. The
> >followers can be started blank.
> >
> >
> >
> >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:
> >
> >> Just saw that  this is the corresponding use case to the question posted
> >> in dev list.
> >>
> >> In order to restore the data to a given point in time correctly, you
> >>need
> >> both snapshot and txnlog. This is because zookeeper snapshot is fuzzy
> >>and
> >> snapshot alone may not represent a valid state of the server if there
> >>are
> >> in-flight requests.
> >>
> >> The 4wl command should cause the server to roll the log and take a
> >> snapshot similar to periodic snapshotting operation. Your backup script
> >> need grap the snapshot and corresponding txnlog file from the data dir.
> >>
> >> To restore, just shutdown all hosts, clear the data dir, copy over the
> >> snapshot and txnlog, and restart them.
> >>
> >>
> >> --
> >> Thawan Kooburat
> >>
> >>
> >>
> >>
> >>
> >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <[EMAIL PROTECTED]> wrote:
> >>
> >> >Thank you for your response, Flavio. I apologize, I did not provide a
> >> >clear
> >> >explanation of the use case.
> >> >
> >> >This backup/restore is not intended to be tied to any write event,
> >> >instead,
> >> >it is expected to run as a periodic (daily?) cron job on one of the
> >> >servers, which is not guaranteed to be the leader of the ensemble.
> >>There
> >> >is
> >> >no expectation that all recent changes are committed and persisted to
> >> >disk.
> >> >The system can sustain the loss of several hours worth of recent
> >>changes
> >> >in
> >> >the event of restore.
> >> >
> >> >As for finding the leader dynamically and performing backup on it, this
> >> >approach could be more difficult as the leader can change time to time
> >>and
> >> >I still need to fetch the file to store it in my designated backup
> >> >location. Taking backup on one server and picking it up from a local
> >>file
> >> >system looks less error-prone. Even if I went the fancy route and had
> >> >Zookeeper send me the serialized DataTree in response to the 4wl, this
> >> >approach would involve a lot of moving parts.
> >> >
> >> >I have already made a PoC for a new 4wl that invokes takeSnapshot() and
> >> >returns an absolute path to the snapshot it drops on disk. I have
> >>already
> >> >protected takeSnapshot() from concurrent invocation, which is likely to
> >> >corrupt the snapshot file on disk. This approach works but I'm
> >>thinking to
> >> >take it one step further by providing the desired path name as an
> >>argument
> >> >to my new 4lw and to have Zookeeper server drop the snapshot into the
> >> >specified file and report success/failure back. This way I can avoid
> >> >cluttering the data directory and interfering with what Zookeeper finds
> >> >when it scans the data directory.
> >>