At LinkedIn, the most common type of failure is controlled shutdown for
code/config pushes. For that, we have a tool for reducing
the unavailability window (
can happen once or twice a month. The next common type of failure is
disk/raid failure, which seems to happen once every month or two. The
remaining types of failure include Linux crashes, JMV bugs, and other types
of hardware failures. They happen a few times a year.
On Tue, Jun 11, 2013 at 1:22 AM, Pankaj Misra <[EMAIL PROTECTED]>wrote:
> We are using 0.8 version of Kafka and are planning for high availability
> testing with replication. While the entire scheme to enable the cluster to
> be highly available is clear, I wanted to get some idea about Kafka Service
> lifetime in terms of Mean-Time to Failure and Time of Recovery in cases of
> failure. Any historic evidences will also help, as it will be vital for us
> to calculate the actual availability of the system across an year.
> While I understand that Kafka provides more of active/active mode of
> seamless high availability, but any failure, will impact the performance to
> some extent and this calculation will help in deriving the actual number of
> nodes that we should consider without compromising on the performance as
> well, while the system is available.
> Any ideas/facts would be very helpful .
> Thanks & Regards
> Pankaj Misra
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.