|
|
-
Does abrupt kill corrupts the datadir?
Laxman 2011-07-13, 06:16
When we stop zookeeper through zkServer.sh stop, we are aborting the zookeeper process using "kill -9".
129 stop)
130 echo -n "Stopping zookeeper ... "
131 if [ ! -f "$ZOOPIDFILE" ]
132 then
133 echo "error: could not find file $ZOOPIDFILE"
134 exit 1
135 else
136 $KILL -9 $(cat "$ZOOPIDFILE")
137 rm "$ZOOPIDFILE"
138 echo STOPPED
139 exit 0
140 fi
141 ;;
This may corrupt the snapshot and transaction logs. Also, its not recommended to use "kill -9".
In worst case, if latest snaps in all zookeeper nodes gets corrupted there is a chance of dataloss.
How about introducing a shutdown hook which will ensure zookeeper is shutdown gracefully when we call stop?
Note: This is just an observation and its not found in a test.
--
Thanks,
Laxman
+
Laxman 2011-07-13, 06:16
-
Re: Does abrupt kill corrupts the datadir?
Mahadev Konar 2011-07-13, 06:31
Hi Laxman, The servers takes care of all the issues with data integrity, so a kill -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure everything works reliably is use kill -9 :).
Thanks mahadev
On 7/12/11 11:16 PM, "Laxman" <[EMAIL PROTECTED]> wrote:
>When we stop zookeeper through zkServer.sh stop, we are aborting the >zookeeper process using "kill -9". > > > >129 stop) > >130 echo -n "Stopping zookeeper ... " > >131 if [ ! -f "$ZOOPIDFILE" ] > >132 then > >133 echo "error: could not find file $ZOOPIDFILE" > >134 exit 1 > >135 else > >136 $KILL -9 $(cat "$ZOOPIDFILE") > >137 rm "$ZOOPIDFILE" > >138 echo STOPPED > >139 exit 0 > >140 fi > >141 ;; > > > > > >This may corrupt the snapshot and transaction logs. Also, its not >recommended to use "kill -9". > >In worst case, if latest snaps in all zookeeper nodes gets corrupted there >is a chance of dataloss. > > > >How about introducing a shutdown hook which will ensure zookeeper is >shutdown gracefully when we call stop? > > > >Note: This is just an observation and its not found in a test. > > > >-- > >Thanks, > >Laxman >
+
Mahadev Konar 2011-07-13, 06:31
-
RE: Does abrupt kill corrupts the datadir?
Laxman 2011-07-13, 07:05
Hi Mahadev, Shutdown hook is just a quick thought. Another approach can be just give a kill [SIGTERM] call which can be interpreted by process. First look at the "kill -9" triggered the following scenario. >In worst case, if latest snaps in all zookeeper nodes gets corrupted there >is a chance of dataloss. How does zookeeper can deal with this scenario gracefully? Also, I feel we should give a chance to application to shutdown gracefully before abrupt shutdown. http://en.wikipedia.org/wiki/SIGKILLBecause SIGKILL gives the process no opportunity to do cleanup operations on terminating, in most system shutdown procedures an attempt is first made to terminate processes using SIGTERM, before resorting to SIGKILL. http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/The application can determine what it wants to do once a SIGTERM is received. While most applications will clean up their resources and stop, some may not. An application may be configured to do something completely different when a SIGTERM is received. Also, if the application is in a bad state, such as waiting for disk I/O, it may not be able to act on the signal that was sent. Most system administrators will usually resort to the more abrupt signal when an application doesn't respond to a SIGTERM. -----Original Message----- From: Mahadev Konar [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 13, 2011 12:02 PM To: [EMAIL PROTECTED] Subject: Re: Does abrupt kill corrupts the datadir? Hi Laxman, The servers takes care of all the issues with data integrity, so a kill -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure everything works reliably is use kill -9 :). Thanks mahadev On 7/12/11 11:16 PM, "Laxman" <[EMAIL PROTECTED]> wrote: >When we stop zookeeper through zkServer.sh stop, we are aborting the >zookeeper process using "kill -9". > > > >129 stop) > >130 echo -n "Stopping zookeeper ... " > >131 if [ ! -f "$ZOOPIDFILE" ] > >132 then > >133 echo "error: could not find file $ZOOPIDFILE" > >134 exit 1 > >135 else > >136 $KILL -9 $(cat "$ZOOPIDFILE") > >137 rm "$ZOOPIDFILE" > >138 echo STOPPED > >139 exit 0 > >140 fi > >141 ;; > > > > > >This may corrupt the snapshot and transaction logs. Also, its not >recommended to use "kill -9". > >In worst case, if latest snaps in all zookeeper nodes gets corrupted there >is a chance of dataloss. > > > >How about introducing a shutdown hook which will ensure zookeeper is >shutdown gracefully when we call stop? > > > >Note: This is just an observation and its not found in a test. > > > >-- > >Thanks, > >Laxman >
+
Laxman 2011-07-13, 07:05
-
FW: Does abrupt kill corrupts the datadir?
Laxman 2011-07-26, 09:02
Hi Everyone, Any thoughts? Do we need consider changing abrupt shutdown to Implementations in some other hadoop eco system projects for your reference. Hadoop - kill [SIGTERM] HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung ZooKeeper - "kill -9" [SIGKILL] -----Original Message----- From: Laxman [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 13, 2011 12:36 PM To: '[EMAIL PROTECTED]' Subject: RE: Does abrupt kill corrupts the datadir? Hi Mahadev, Shutdown hook is just a quick thought. Another approach can be just give a kill [SIGTERM] call which can be interpreted by process. First look at the "kill -9" triggered the following scenario. >In worst case, if latest snaps in all zookeeper nodes gets corrupted there >is a chance of dataloss. How does zookeeper can deal with this scenario gracefully? Also, I feel we should give a chance to application to shutdown gracefully before abrupt shutdown. http://en.wikipedia.org/wiki/SIGKILLBecause SIGKILL gives the process no opportunity to do cleanup operations on terminating, in most system shutdown procedures an attempt is first made to terminate processes using SIGTERM, before resorting to SIGKILL. http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/The application can determine what it wants to do once a SIGTERM is received. While most applications will clean up their resources and stop, some may not. An application may be configured to do something completely different when a SIGTERM is received. Also, if the application is in a bad state, such as waiting for disk I/O, it may not be able to act on the signal that was sent. Most system administrators will usually resort to the more abrupt signal when an application doesn't respond to a SIGTERM. -----Original Message----- From: Mahadev Konar [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 13, 2011 12:02 PM To: [EMAIL PROTECTED] Subject: Re: Does abrupt kill corrupts the datadir? Hi Laxman, The servers takes care of all the issues with data integrity, so a kill -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure everything works reliably is use kill -9 :). Thanks mahadev On 7/12/11 11:16 PM, "Laxman" <[EMAIL PROTECTED]> wrote: >When we stop zookeeper through zkServer.sh stop, we are aborting the >zookeeper process using "kill -9". > > > >129 stop) > >130 echo -n "Stopping zookeeper ... " > >131 if [ ! -f "$ZOOPIDFILE" ] > >132 then > >133 echo "error: could not find file $ZOOPIDFILE" > >134 exit 1 > >135 else > >136 $KILL -9 $(cat "$ZOOPIDFILE") > >137 rm "$ZOOPIDFILE" > >138 echo STOPPED > >139 exit 0 > >140 fi > >141 ;; > > > > > >This may corrupt the snapshot and transaction logs. Also, its not >recommended to use "kill -9". > >In worst case, if latest snaps in all zookeeper nodes gets corrupted there >is a chance of dataloss. > > > >How about introducing a shutdown hook which will ensure zookeeper is >shutdown gracefully when we call stop? > > > >Note: This is just an observation and its not found in a test. > > > >-- > >Thanks, > >Laxman >
+
Laxman 2011-07-26, 09:02
-
Re: FW: Does abrupt kill corrupts the datadir?
Patrick Hunt 2011-07-27, 17:25
ZK has been built around the "fail fast" approach. In order to maintain high availability we want to ensure that restarting a server will result in it attempting to rejoin the quorum. IMO we would not want to change this (kill -9). Patrick On Tue, Jul 26, 2011 at 2:02 AM, Laxman <[EMAIL PROTECTED]> wrote: > Hi Everyone, > > Any thoughts? > Do we need consider changing abrupt shutdown to > > Implementations in some other hadoop eco system projects for your reference. > Hadoop - kill [SIGTERM] > HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung > ZooKeeper - "kill -9" [SIGKILL] > > > -----Original Message----- > From: Laxman [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, July 13, 2011 12:36 PM > To: '[EMAIL PROTECTED]' > Subject: RE: Does abrupt kill corrupts the datadir? > > Hi Mahadev, > > Shutdown hook is just a quick thought. Another approach can be just give a > kill [SIGTERM] call which can be interpreted by process. > > First look at the "kill -9" triggered the following scenario. >>In worst case, if latest snaps in all zookeeper nodes gets corrupted there >>is a chance of dataloss. > > How does zookeeper can deal with this scenario gracefully? > > Also, I feel we should give a chance to application to shutdown gracefully > before abrupt shutdown. > > http://en.wikipedia.org/wiki/SIGKILL> > Because SIGKILL gives the process no opportunity to do cleanup operations on > terminating, in most system shutdown procedures an attempt is first made to > terminate processes using SIGTERM, before resorting to SIGKILL. > > http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/> > The application can determine what it wants to do once a SIGTERM is > received. While most applications will clean up their resources and stop, > some may not. An application may be configured to do something completely > different when a SIGTERM is received. Also, if the application is in a bad > state, such as waiting for disk I/O, it may not be able to act on the signal > that was sent. > > Most system administrators will usually resort to the more abrupt signal > when an application doesn't respond to a SIGTERM. > > -----Original Message----- > From: Mahadev Konar [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, July 13, 2011 12:02 PM > To: [EMAIL PROTECTED] > Subject: Re: Does abrupt kill corrupts the datadir? > > Hi Laxman, > The servers takes care of all the issues with data integrity, so a kill > -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure > everything works reliably is use kill -9 :). > > Thanks > mahadev > > On 7/12/11 11:16 PM, "Laxman" <[EMAIL PROTECTED]> wrote: > >>When we stop zookeeper through zkServer.sh stop, we are aborting the >>zookeeper process using "kill -9". >> >> >> >>129 stop) >> >>130 echo -n "Stopping zookeeper ... " >> >>131 if [ ! -f "$ZOOPIDFILE" ] >> >>132 then >> >>133 echo "error: could not find file $ZOOPIDFILE" >> >>134 exit 1 >> >>135 else >> >>136 $KILL -9 $(cat "$ZOOPIDFILE") >> >>137 rm "$ZOOPIDFILE" >> >>138 echo STOPPED >> >>139 exit 0 >> >>140 fi >> >>141 ;; >> >> >> >> >> >>This may corrupt the snapshot and transaction logs. Also, its not >>recommended to use "kill -9". >> >>In worst case, if latest snaps in all zookeeper nodes gets corrupted there >>is a chance of dataloss. >> >> >> >>How about introducing a shutdown hook which will ensure zookeeper is >>shutdown gracefully when we call stop? >> >> >> >>Note: This is just an observation and its not found in a test. >> >> >> >>-- >> >>Thanks, >> >>Laxman >> > > >
+
Patrick Hunt 2011-07-27, 17:25
-
Re: FW: Does abrupt kill corrupts the datadir?
Benjamin Reed 2011-07-28, 06:09
i agree with pat. if we use sigterm in the script, we would want to put a timeout in to escalate to a -9 which makes the script a bit more complicated without reason since we don't have any exit hooks that we want to run. zookeeper is designed to recover well from hard failures, much worse than a kill -9. i don't think we want to change that. ben On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > ZK has been built around the "fail fast" approach. In order to > maintain high availability we want to ensure that restarting a server > will result in it attempting to rejoin the quorum. IMO we would not > want to change this (kill -9). > > Patrick > > On Tue, Jul 26, 2011 at 2:02 AM, Laxman <[EMAIL PROTECTED]> wrote: >> Hi Everyone, >> >> Any thoughts? >> Do we need consider changing abrupt shutdown to >> >> Implementations in some other hadoop eco system projects for your reference. >> Hadoop - kill [SIGTERM] >> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung >> ZooKeeper - "kill -9" [SIGKILL] >> >> >> -----Original Message----- >> From: Laxman [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, July 13, 2011 12:36 PM >> To: '[EMAIL PROTECTED]' >> Subject: RE: Does abrupt kill corrupts the datadir? >> >> Hi Mahadev, >> >> Shutdown hook is just a quick thought. Another approach can be just give a >> kill [SIGTERM] call which can be interpreted by process. >> >> First look at the "kill -9" triggered the following scenario. >>>In worst case, if latest snaps in all zookeeper nodes gets corrupted there >>>is a chance of dataloss. >> >> How does zookeeper can deal with this scenario gracefully? >> >> Also, I feel we should give a chance to application to shutdown gracefully >> before abrupt shutdown. >> >> http://en.wikipedia.org/wiki/SIGKILL>> >> Because SIGKILL gives the process no opportunity to do cleanup operations on >> terminating, in most system shutdown procedures an attempt is first made to >> terminate processes using SIGTERM, before resorting to SIGKILL. >> >> http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/>> >> The application can determine what it wants to do once a SIGTERM is >> received. While most applications will clean up their resources and stop, >> some may not. An application may be configured to do something completely >> different when a SIGTERM is received. Also, if the application is in a bad >> state, such as waiting for disk I/O, it may not be able to act on the signal >> that was sent. >> >> Most system administrators will usually resort to the more abrupt signal >> when an application doesn't respond to a SIGTERM. >> >> -----Original Message----- >> From: Mahadev Konar [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, July 13, 2011 12:02 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Does abrupt kill corrupts the datadir? >> >> Hi Laxman, >> The servers takes care of all the issues with data integrity, so a kill >> -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure >> everything works reliably is use kill -9 :). >> >> Thanks >> mahadev >> >> On 7/12/11 11:16 PM, "Laxman" <[EMAIL PROTECTED]> wrote: >> >>>When we stop zookeeper through zkServer.sh stop, we are aborting the >>>zookeeper process using "kill -9". >>> >>> >>> >>>129 stop) >>> >>>130 echo -n "Stopping zookeeper ... " >>> >>>131 if [ ! -f "$ZOOPIDFILE" ] >>> >>>132 then >>> >>>133 echo "error: could not find file $ZOOPIDFILE" >>> >>>134 exit 1 >>> >>>135 else >>> >>>136 $KILL -9 $(cat "$ZOOPIDFILE") >>> >>>137 rm "$ZOOPIDFILE" >>> >>>138 echo STOPPED >>> >>>139 exit 0 >>> >>>140 fi >>> >>>141 ;; >>> >>> >>> >>> >>> >>>This may corrupt the snapshot and transaction logs. Also, its not >>>recommended to use "kill -9". >>> >>>In worst case, if latest snaps in all zookeeper nodes gets corrupted there >>>is a chance of dataloss. >>> >>> >>> >>>How about introducing a shutdown hook which will ensure zookeeper is
+
Benjamin Reed 2011-07-28, 06:09
-
RE: FW: Does abrupt kill corrupts the datadir?
Laxman 2011-07-28, 07:50
Thanks for the responses Mahadev, Pat and Ben. I understand your explanation. My only question is "Will there be any probability data loss in the scenario mentioned?" >>>In worst case, if latest snaps in all zookeeper nodes gets corrupted there is a chance of data loss. >>if we use sigterm in the script, we would want to put a timeout in to escalate to a -9 As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still we may have data loss. But the probability is very less by giving a chance to shutdown gracefully. Please do correct me if my understanding is wrong. -- Laxman -----Original Message----- From: Benjamin Reed [mailto:[EMAIL PROTECTED]] Sent: Thursday, July 28, 2011 11:40 AM To: [EMAIL PROTECTED] Subject: Re: FW: Does abrupt kill corrupts the datadir? i agree with pat. if we use sigterm in the script, we would want to put a timeout in to escalate to a -9 which makes the script a bit more complicated without reason since we don't have any exit hooks that we want to run. zookeeper is designed to recover well from hard failures, much worse than a kill -9. i don't think we want to change that. ben On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > ZK has been built around the "fail fast" approach. In order to > maintain high availability we want to ensure that restarting a server > will result in it attempting to rejoin the quorum. IMO we would not > want to change this (kill -9). > > Patrick > > On Tue, Jul 26, 2011 at 2:02 AM, Laxman <[EMAIL PROTECTED]> wrote: >> Hi Everyone, >> >> Any thoughts? >> Do we need consider changing abrupt shutdown to >> >> Implementations in some other hadoop eco system projects for your reference. >> Hadoop - kill [SIGTERM] >> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung >> ZooKeeper - "kill -9" [SIGKILL] >> >> >> -----Original Message----- >> From: Laxman [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, July 13, 2011 12:36 PM >> To: '[EMAIL PROTECTED]' >> Subject: RE: Does abrupt kill corrupts the datadir? >> >> Hi Mahadev, >> >> Shutdown hook is just a quick thought. Another approach can be just give a >> kill [SIGTERM] call which can be interpreted by process. >> >> First look at the "kill -9" triggered the following scenario. >>>In worst case, if latest snaps in all zookeeper nodes gets corrupted there >>>is a chance of dataloss. >> >> How does zookeeper can deal with this scenario gracefully? >> >> Also, I feel we should give a chance to application to shutdown gracefully >> before abrupt shutdown. >> >> http://en.wikipedia.org/wiki/SIGKILL>> >> Because SIGKILL gives the process no opportunity to do cleanup operations on >> terminating, in most system shutdown procedures an attempt is first made to >> terminate processes using SIGTERM, before resorting to SIGKILL. >> >> http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/>> >> The application can determine what it wants to do once a SIGTERM is >> received. While most applications will clean up their resources and stop, >> some may not. An application may be configured to do something completely >> different when a SIGTERM is received. Also, if the application is in a bad >> state, such as waiting for disk I/O, it may not be able to act on the signal >> that was sent. >> >> Most system administrators will usually resort to the more abrupt signal >> when an application doesn't respond to a SIGTERM. >> >> -----Original Message----- >> From: Mahadev Konar [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, July 13, 2011 12:02 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Does abrupt kill corrupts the datadir? >> >> Hi Laxman, >> The servers takes care of all the issues with data integrity, so a kill >> -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure >> everything works reliably is use kill -9 :). >> >> Thanks >> mahadev >> >> On 7/12/11 11:16 PM, "Laxman" <[EMAIL PROTECTED]> wrote: >> >>>When we stop zookeeper through zkServer.sh stop, we are aborting the there
+
Laxman 2011-07-28, 07:50
-
Re: FW: Does abrupt kill corrupts the datadir?
Benjamin Reed 2011-07-28, 16:05
almost everything we do in zookkeeper is to make sure that we don't lose data in much worse scenarios. the probably of a loss in this scenario is really just the probability of a bug in the code. i don't think that kill -TERM vs kill -KILL changes that probability at all either way. ben On Thu, Jul 28, 2011 at 12:50 AM, Laxman <[EMAIL PROTECTED]> wrote: > Thanks for the responses Mahadev, Pat and Ben. > I understand your explanation. > > My only question is "Will there be any probability data loss in the scenario > mentioned?" > >>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted > there is a chance of data loss. > >>>if we use sigterm in the script, we would want to put a timeout in to > escalate to a -9 > > As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still > we may have data loss. But the probability is very less by giving a chance > to shutdown gracefully. > > Please do correct me if my understanding is wrong. > -- > Laxman > > -----Original Message----- > From: Benjamin Reed [mailto:[EMAIL PROTECTED]] > Sent: Thursday, July 28, 2011 11:40 AM > To: [EMAIL PROTECTED] > Subject: Re: FW: Does abrupt kill corrupts the datadir? > > i agree with pat. if we use sigterm in the script, we would want to > put a timeout in to escalate to a -9 which makes the script a bit more > complicated without reason since we don't have any exit hooks that we > want to run. zookeeper is designed to recover well from hard failures, > much worse than a kill -9. i don't think we want to change that. > > ben > > On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >> ZK has been built around the "fail fast" approach. In order to >> maintain high availability we want to ensure that restarting a server >> will result in it attempting to rejoin the quorum. IMO we would not >> want to change this (kill -9). >> >> Patrick >> >> On Tue, Jul 26, 2011 at 2:02 AM, Laxman <[EMAIL PROTECTED]> wrote: >>> Hi Everyone, >>> >>> Any thoughts? >>> Do we need consider changing abrupt shutdown to >>> >>> Implementations in some other hadoop eco system projects for your > reference. >>> Hadoop - kill [SIGTERM] >>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung >>> ZooKeeper - "kill -9" [SIGKILL] >>> >>> >>> -----Original Message----- >>> From: Laxman [mailto:[EMAIL PROTECTED]] >>> Sent: Wednesday, July 13, 2011 12:36 PM >>> To: '[EMAIL PROTECTED]' >>> Subject: RE: Does abrupt kill corrupts the datadir? >>> >>> Hi Mahadev, >>> >>> Shutdown hook is just a quick thought. Another approach can be just give > a >>> kill [SIGTERM] call which can be interpreted by process. >>> >>> First look at the "kill -9" triggered the following scenario. >>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted > there >>>>is a chance of dataloss. >>> >>> How does zookeeper can deal with this scenario gracefully? >>> >>> Also, I feel we should give a chance to application to shutdown > gracefully >>> before abrupt shutdown. >>> >>> http://en.wikipedia.org/wiki/SIGKILL>>> >>> Because SIGKILL gives the process no opportunity to do cleanup operations > on >>> terminating, in most system shutdown procedures an attempt is first made > to >>> terminate processes using SIGTERM, before resorting to SIGKILL. >>> >>> http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/>>> >>> The application can determine what it wants to do once a SIGTERM is >>> received. While most applications will clean up their resources and stop, >>> some may not. An application may be configured to do something completely >>> different when a SIGTERM is received. Also, if the application is in a > bad >>> state, such as waiting for disk I/O, it may not be able to act on the > signal >>> that was sent. >>> >>> Most system administrators will usually resort to the more abrupt signal >>> when an application doesn't respond to a SIGTERM. >>> >>> -----Original Message----- >>> From: Mahadev Konar [mailto:[EMAIL PROTECTED]]
+
Benjamin Reed 2011-07-28, 16:05
-
Re: FW: Does abrupt kill corrupts the datadir?
Andrei Savu 2011-07-28, 23:14
I've been doing some testing in the past for this scenario and I've seen no data loss over an extended period of time (a day). Testing steps: 0. start an ensemble running 5 servers 1. start an workload generator (e.g. push a strictly increasing sequence of numbers to a queue stored in zookeeper) every few seconds: kill the cluster leader (-9) and restart You should be careful how you handle ConnectionLossException & OperationTimeoutException You can find the code for this test here (executed against the trunk version): https://github.com/andreisavu/zookeeper-mq-- Andrei Savu / andreisavu.ro On Thu, Jul 28, 2011 at 9:05 AM, Benjamin Reed <[EMAIL PROTECTED]> wrote: > almost everything we do in zookkeeper is to make sure that we don't > lose data in much worse scenarios. the probably of a loss in this > scenario is really just the probability of a bug in the code. i don't > think that kill -TERM vs kill -KILL changes that probability at all > either way. > > ben > > On Thu, Jul 28, 2011 at 12:50 AM, Laxman <[EMAIL PROTECTED]> wrote: >> Thanks for the responses Mahadev, Pat and Ben. >> I understand your explanation. >> >> My only question is "Will there be any probability data loss in the scenario >> mentioned?" >> >>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >> there is a chance of data loss. >> >>>>if we use sigterm in the script, we would want to put a timeout in to >> escalate to a -9 >> >> As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still >> we may have data loss. But the probability is very less by giving a chance >> to shutdown gracefully. >> >> Please do correct me if my understanding is wrong. >> -- >> Laxman >> >> -----Original Message----- >> From: Benjamin Reed [mailto:[EMAIL PROTECTED]] >> Sent: Thursday, July 28, 2011 11:40 AM >> To: [EMAIL PROTECTED] >> Subject: Re: FW: Does abrupt kill corrupts the datadir? >> >> i agree with pat. if we use sigterm in the script, we would want to >> put a timeout in to escalate to a -9 which makes the script a bit more >> complicated without reason since we don't have any exit hooks that we >> want to run. zookeeper is designed to recover well from hard failures, >> much worse than a kill -9. i don't think we want to change that. >> >> ben >> >> On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >>> ZK has been built around the "fail fast" approach. In order to >>> maintain high availability we want to ensure that restarting a server >>> will result in it attempting to rejoin the quorum. IMO we would not >>> want to change this (kill -9). >>> >>> Patrick >>> >>> On Tue, Jul 26, 2011 at 2:02 AM, Laxman <[EMAIL PROTECTED]> wrote: >>>> Hi Everyone, >>>> >>>> Any thoughts? >>>> Do we need consider changing abrupt shutdown to >>>> >>>> Implementations in some other hadoop eco system projects for your >> reference. >>>> Hadoop - kill [SIGTERM] >>>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung >>>> ZooKeeper - "kill -9" [SIGKILL] >>>> >>>> >>>> -----Original Message----- >>>> From: Laxman [mailto:[EMAIL PROTECTED]] >>>> Sent: Wednesday, July 13, 2011 12:36 PM >>>> To: '[EMAIL PROTECTED]' >>>> Subject: RE: Does abrupt kill corrupts the datadir? >>>> >>>> Hi Mahadev, >>>> >>>> Shutdown hook is just a quick thought. Another approach can be just give >> a >>>> kill [SIGTERM] call which can be interpreted by process. >>>> >>>> First look at the "kill -9" triggered the following scenario. >>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >> there >>>>>is a chance of dataloss. >>>> >>>> How does zookeeper can deal with this scenario gracefully? >>>> >>>> Also, I feel we should give a chance to application to shutdown >> gracefully >>>> before abrupt shutdown. >>>> >>>> http://en.wikipedia.org/wiki/SIGKILL>>>> >>>> Because SIGKILL gives the process no opportunity to do cleanup operations >> on >>>> terminating, in most system shutdown procedures an attempt is first made
+
Andrei Savu 2011-07-28, 23:14
-
Re: FW: Does abrupt kill corrupts the datadir?
Patrick Hunt 2011-08-01, 18:37
Andrei, you might find this useful for such testing: https://github.com/toddlipcon/gremlinsPatrick On Thu, Jul 28, 2011 at 4:14 PM, Andrei Savu <[EMAIL PROTECTED]> wrote: > I've been doing some testing in the past for this scenario and I've > seen no data loss over an extended period of time (a day). > > Testing steps: > 0. start an ensemble running 5 servers > 1. start an workload generator (e.g. push a strictly increasing > sequence of numbers to a queue stored in zookeeper) > every few seconds: kill the cluster leader (-9) and restart > > You should be careful how you handle ConnectionLossException & > OperationTimeoutException > > You can find the code for this test here (executed against the trunk version): > https://github.com/andreisavu/zookeeper-mq> > -- Andrei Savu / andreisavu.ro > > On Thu, Jul 28, 2011 at 9:05 AM, Benjamin Reed <[EMAIL PROTECTED]> wrote: >> almost everything we do in zookkeeper is to make sure that we don't >> lose data in much worse scenarios. the probably of a loss in this >> scenario is really just the probability of a bug in the code. i don't >> think that kill -TERM vs kill -KILL changes that probability at all >> either way. >> >> ben >> >> On Thu, Jul 28, 2011 at 12:50 AM, Laxman <[EMAIL PROTECTED]> wrote: >>> Thanks for the responses Mahadev, Pat and Ben. >>> I understand your explanation. >>> >>> My only question is "Will there be any probability data loss in the scenario >>> mentioned?" >>> >>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >>> there is a chance of data loss. >>> >>>>>if we use sigterm in the script, we would want to put a timeout in to >>> escalate to a -9 >>> >>> As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still >>> we may have data loss. But the probability is very less by giving a chance >>> to shutdown gracefully. >>> >>> Please do correct me if my understanding is wrong. >>> -- >>> Laxman >>> >>> -----Original Message----- >>> From: Benjamin Reed [mailto:[EMAIL PROTECTED]] >>> Sent: Thursday, July 28, 2011 11:40 AM >>> To: [EMAIL PROTECTED] >>> Subject: Re: FW: Does abrupt kill corrupts the datadir? >>> >>> i agree with pat. if we use sigterm in the script, we would want to >>> put a timeout in to escalate to a -9 which makes the script a bit more >>> complicated without reason since we don't have any exit hooks that we >>> want to run. zookeeper is designed to recover well from hard failures, >>> much worse than a kill -9. i don't think we want to change that. >>> >>> ben >>> >>> On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >>>> ZK has been built around the "fail fast" approach. In order to >>>> maintain high availability we want to ensure that restarting a server >>>> will result in it attempting to rejoin the quorum. IMO we would not >>>> want to change this (kill -9). >>>> >>>> Patrick >>>> >>>> On Tue, Jul 26, 2011 at 2:02 AM, Laxman <[EMAIL PROTECTED]> wrote: >>>>> Hi Everyone, >>>>> >>>>> Any thoughts? >>>>> Do we need consider changing abrupt shutdown to >>>>> >>>>> Implementations in some other hadoop eco system projects for your >>> reference. >>>>> Hadoop - kill [SIGTERM] >>>>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung >>>>> ZooKeeper - "kill -9" [SIGKILL] >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Laxman [mailto:[EMAIL PROTECTED]] >>>>> Sent: Wednesday, July 13, 2011 12:36 PM >>>>> To: '[EMAIL PROTECTED]' >>>>> Subject: RE: Does abrupt kill corrupts the datadir? >>>>> >>>>> Hi Mahadev, >>>>> >>>>> Shutdown hook is just a quick thought. Another approach can be just give >>> a >>>>> kill [SIGTERM] call which can be interpreted by process. >>>>> >>>>> First look at the "kill -9" triggered the following scenario. >>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >>> there >>>>>>is a chance of dataloss. >>>>> >>>>> How does zookeeper can deal with this scenario gracefully? >>>>> >>
+
Patrick Hunt 2011-08-01, 18:37
-
RE: FW: Does abrupt kill corrupts the datadir?
Laxman 2011-07-29, 09:26
Thanks you all for your inputs. That clears my suspect. I will perform some test and publish in case of any finding. -----Original Message----- From: Andrei Savu [mailto:[EMAIL PROTECTED]] Sent: Friday, July 29, 2011 4:45 AM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: FW: Does abrupt kill corrupts the datadir? I've been doing some testing in the past for this scenario and I've seen no data loss over an extended period of time (a day). Testing steps: 0. start an ensemble running 5 servers 1. start an workload generator (e.g. push a strictly increasing sequence of numbers to a queue stored in zookeeper) every few seconds: kill the cluster leader (-9) and restart You should be careful how you handle ConnectionLossException & OperationTimeoutException You can find the code for this test here (executed against the trunk version): https://github.com/andreisavu/zookeeper-mq-- Andrei Savu / andreisavu.ro On Thu, Jul 28, 2011 at 9:05 AM, Benjamin Reed <[EMAIL PROTECTED]> wrote: > almost everything we do in zookkeeper is to make sure that we don't > lose data in much worse scenarios. the probably of a loss in this > scenario is really just the probability of a bug in the code. i don't > think that kill -TERM vs kill -KILL changes that probability at all > either way. > > ben > > On Thu, Jul 28, 2011 at 12:50 AM, Laxman <[EMAIL PROTECTED]> wrote: >> Thanks for the responses Mahadev, Pat and Ben. >> I understand your explanation. >> >> My only question is "Will there be any probability data loss in the scenario >> mentioned?" >> >>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >> there is a chance of data loss. >> >>>>if we use sigterm in the script, we would want to put a timeout in to >> escalate to a -9 >> >> As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still >> we may have data loss. But the probability is very less by giving a chance >> to shutdown gracefully. >> >> Please do correct me if my understanding is wrong. >> -- >> Laxman >> >> -----Original Message----- >> From: Benjamin Reed [mailto:[EMAIL PROTECTED]] >> Sent: Thursday, July 28, 2011 11:40 AM >> To: [EMAIL PROTECTED] >> Subject: Re: FW: Does abrupt kill corrupts the datadir? >> >> i agree with pat. if we use sigterm in the script, we would want to >> put a timeout in to escalate to a -9 which makes the script a bit more >> complicated without reason since we don't have any exit hooks that we >> want to run. zookeeper is designed to recover well from hard failures, >> much worse than a kill -9. i don't think we want to change that. >> >> ben >> >> On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >>> ZK has been built around the "fail fast" approach. In order to >>> maintain high availability we want to ensure that restarting a server >>> will result in it attempting to rejoin the quorum. IMO we would not >>> want to change this (kill -9). >>> >>> Patrick >>> >>> On Tue, Jul 26, 2011 at 2:02 AM, Laxman <[EMAIL PROTECTED]> wrote: >>>> Hi Everyone, >>>> >>>> Any thoughts? >>>> Do we need consider changing abrupt shutdown to >>>> >>>> Implementations in some other hadoop eco system projects for your >> reference. >>>> Hadoop - kill [SIGTERM] >>>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung >>>> ZooKeeper - "kill -9" [SIGKILL] >>>> >>>> >>>> -----Original Message----- >>>> From: Laxman [mailto:[EMAIL PROTECTED]] >>>> Sent: Wednesday, July 13, 2011 12:36 PM >>>> To: '[EMAIL PROTECTED]' >>>> Subject: RE: Does abrupt kill corrupts the datadir? >>>> >>>> Hi Mahadev, >>>> >>>> Shutdown hook is just a quick thought. Another approach can be just give >> a >>>> kill [SIGTERM] call which can be interpreted by process. >>>> >>>> First look at the "kill -9" triggered the following scenario. >>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >> there >>>>>is a chance of dataloss. >>>> >>>> How does zookeeper can deal with this scenario gracefully? operations made stop, completely signal kill
+
Laxman 2011-07-29, 09:26
|
|