Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper, mail # user - Re: Efficient backup and a reasonable restore of an ensemble


Copy link to this message
-
Re: Efficient backup and a reasonable restore of an ensemble
Sergey Maslyakov 2013-07-08, 22:28
Thank you for your response, Flavio. I apologize, I did not provide a clear
explanation of the use case.

This backup/restore is not intended to be tied to any write event, instead,
it is expected to run as a periodic (daily?) cron job on one of the
servers, which is not guaranteed to be the leader of the ensemble. There is
no expectation that all recent changes are committed and persisted to disk.
The system can sustain the loss of several hours worth of recent changes in
the event of restore.

As for finding the leader dynamically and performing backup on it, this
approach could be more difficult as the leader can change time to time and
I still need to fetch the file to store it in my designated backup
location. Taking backup on one server and picking it up from a local file
system looks less error-prone. Even if I went the fancy route and had
Zookeeper send me the serialized DataTree in response to the 4wl, this
approach would involve a lot of moving parts.

I have already made a PoC for a new 4wl that invokes takeSnapshot() and
returns an absolute path to the snapshot it drops on disk. I have already
protected takeSnapshot() from concurrent invocation, which is likely to
corrupt the snapshot file on disk. This approach works but I'm thinking to
take it one step further by providing the desired path name as an argument
to my new 4lw and to have Zookeeper server drop the snapshot into the
specified file and report success/failure back. This way I can avoid
cluttering the data directory and interfering with what Zookeeper finds
when it scans the data directory.

Approach with having an additional server that would take the leadership
and populate the ensemble is just a theory. I don't see a clean way of
making a quorum member the leader of the quorum. Am I overlooking something
simple?

In backup and restore of an ensemble the biggest unknown for me remains
populating the ensemble with desired data. I can think of two ways:

1. Clear out all servers by stopping them, purge version-2 directories,
restore a snapshot file on one server that will be brought first, and then
bring up the rest of the ensemble. This way I somewhat force the first
server to be the leader because it has data and it will be the only member
of a quorum with data, provided to the way I start the ensemble. This looks
like a hack, though.

2. Clear out the ensemble and reload it with a dedicated client using the
provided Zookeeper API.

With the approach of backing up an actual snapshot file, option #1 appears
to be more practical.

I wish I could start the ensemble with a designate leader that would
bootstrap the ensemble with data and then the ensemble would go into its
normal business...

On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira <[EMAIL PROTECTED]>wrote:

> One bit that is still a bit confusing to me in your use case is if you
> need to take a snapshot right after some event in your application. Even if
> you're able to tell ZooKeeper to take a snapshot, there is no guarantee
> that it will happen at the exact point you want it if update operations
> keep coming.
>
> If you use your four-letter word approach, then would you search for the
> leader or would you simply take a snapshot at any server? If it has to go
> through the leader so that you make sure to have the most recent committed
> state, then it might not be a bad idea to have an api call that tells the
> leader to take a snapshot at some directory of your choice. Informing you
> the name of the snapshot file so that you can copy sounds like an option,
> but perhaps it is not as convenient.
>
> The approach of adding another server is not very clear. How do you force
> it to be the leader? Keep in mind that if it crashes, then it will lose
> leadership.
>
> -Flavio
>
> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <[EMAIL PROTECTED]> wrote:
>
> > It looks like the "dev" mailing list is rather inactive. Over the past
> few
> > days I only saw several automated emails from JIRA and this is pretty