Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper, mail # dev - Does abrupt kill corrupts the datadir?


Copy link to this message
-
Re: FW: Does abrupt kill corrupts the datadir?
Benjamin Reed 2011-07-28, 16:05
almost everything we do in zookkeeper is to make sure that we don't
lose data in much worse scenarios. the probably of a loss in this
scenario is really just the probability of a bug in the code. i don't
think that kill -TERM vs kill -KILL changes that probability at all
either way.

ben

On Thu, Jul 28, 2011 at 12:50 AM, Laxman <[EMAIL PROTECTED]> wrote:
> Thanks for the responses Mahadev, Pat and Ben.
> I understand your explanation.
>
> My only question is "Will there be any probability data loss in the scenario
> mentioned?"
>
>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
> there is a chance of data loss.
>
>>>if we use sigterm in the script, we would want to put a timeout in to
> escalate to a -9
>
> As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still
> we may have data loss. But the probability is very less by giving a chance
> to shutdown gracefully.
>
> Please do correct me if my understanding is wrong.
> --
> Laxman
>
> -----Original Message-----
> From: Benjamin Reed [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, July 28, 2011 11:40 AM
> To: [EMAIL PROTECTED]
> Subject: Re: FW: Does abrupt kill corrupts the datadir?
>
> i agree with pat. if we use sigterm in the script, we would want to
> put a timeout in to escalate to a -9 which makes the script a bit more
> complicated without reason since we don't have any exit hooks that we
> want to run. zookeeper is designed to recover well from hard failures,
> much worse than a kill -9. i don't think we want to change that.
>
> ben
>
> On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote:
>> ZK has been built around the "fail fast" approach. In order to
>> maintain high availability we want to ensure that restarting a server
>> will result in it attempting to rejoin the quorum. IMO we would not
>> want to change this (kill -9).
>>
>> Patrick
>>
>> On Tue, Jul 26, 2011 at 2:02 AM, Laxman <[EMAIL PROTECTED]> wrote:
>>> Hi Everyone,
>>>
>>> Any thoughts?
>>> Do we need consider changing abrupt shutdown to
>>>
>>> Implementations in some other hadoop eco system projects for your
> reference.
>>> Hadoop - kill [SIGTERM]
>>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung
>>> ZooKeeper - "kill -9" [SIGKILL]
>>>
>>>
>>> -----Original Message-----
>>> From: Laxman [mailto:[EMAIL PROTECTED]]
>>> Sent: Wednesday, July 13, 2011 12:36 PM
>>> To: '[EMAIL PROTECTED]'
>>> Subject: RE: Does abrupt kill corrupts the datadir?
>>>
>>> Hi Mahadev,
>>>
>>> Shutdown hook is just a quick thought. Another approach can be just give
> a
>>> kill [SIGTERM] call which can be interpreted by process.
>>>
>>> First look at the "kill -9" triggered the following scenario.
>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
> there
>>>>is a chance of dataloss.
>>>
>>> How does zookeeper can deal with this scenario gracefully?
>>>
>>> Also, I feel we should give a chance to application to shutdown
> gracefully
>>> before abrupt shutdown.
>>>
>>> http://en.wikipedia.org/wiki/SIGKILL
>>>
>>> Because SIGKILL gives the process no opportunity to do cleanup operations
> on
>>> terminating, in most system shutdown procedures an attempt is first made
> to
>>> terminate processes using SIGTERM, before resorting to SIGKILL.
>>>
>>> http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/
>>>
>>> The application can determine what it wants to do once a SIGTERM is
>>> received. While most applications will clean up their resources and stop,
>>> some may not. An application may be configured to do something completely
>>> different when a SIGTERM is received. Also, if the application is in a
> bad
>>> state, such as waiting for disk I/O, it may not be able to act on the
> signal
>>> that was sent.
>>>
>>> Most system administrators will usually resort to the more abrupt signal
>>> when an application doesn't respond to a SIGTERM.
>>>
>>> -----Original Message-----
>>> From: Mahadev Konar [mailto:[EMAIL PROTECTED]]