Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Kafka/ZK Cluster Example


Copy link to this message
-
Re: Kafka/ZK Cluster Example
Ok, I had overlooked that... so, more redundant RAID arrays would decrease
the chances that we lose data, but it wouldn't help much with availability
because rebuilding the arrays after failures interferes with Kafka's normal
IO.

Really looking forward to sync replication with KAFKA-50 :D !!

--
Felix

On Thu, Jan 12, 2012 at 12:15 PM, Jun Rao <[EMAIL PROTECTED]> wrote:

> Felix,
>
> We use RAID too. One potential problem with RAID is that if you replace a
> broken disk, RAID goes into rebuild mode. This could significantly slow
> down I/O and make a broker not fully functional for new requests. Adding
> more mirrors doesn't alleviate this problem.
>
> Jun
>
> On Wed, Jan 11, 2012 at 3:50 PM, Felix GV <[EMAIL PROTECTED]> wrote:
>
> > We've been thinking about this stuff a lot recently, at work.
> >
> > We've had some HD failures in our Kafka cluster. I don't know all the
> > details, but from what I heard, the HDs were mirrored in RAID but several
> > of them failed in a close time interval and the array did not have time
> to
> > fully rebuild itself, so we lost all of that data from the Kafka cluster.
> > Thankfully, the data was being consumed in near real time, so we only
> > really lost a small unconsumed window of data.
> >
> > Now, we're wondering what we could improve to prevent this scenario in
> the
> > future. I investigated Kafka mirroring but since it relies on consuming
> > data, the probability to lose the unconsumed window is still there. If we
> > had consumers that were more batch oriented (like hadoop) rather than
> > real-time, the benefits of a mirrored Kafka cluster would be greater, but
> > for our use cases, where data is consumed near real-time, we would still
> > lose as much data as before. Am I right?
> >
> > KAFKA-50, with sync replication would have solved our problem, but until
> > that's done, what are our options?
> >
> > I came to the conclusion that simply adding more mirrored copies in our
> > RAID arrays would be the most cost-effective way to give us both more
> > availability and more redundancy. This doesn't deal with the scenario
> where
> > a machine fails and becomes unavailable, in which case the data on it
> would
> > be temporarily unavailable but not lost (although, again, there could be
> a
> > small window of uncommited data). However, in terms of protection against
> > data loss from HD failures, it seems like the best option for now, no?
> >
> > It doesn't feel right to just throw more hardware at problems hehe...
> but I
> > guess sometimes it's the only choice :) ...
> >
> > Please tell me if that makes sense!
> >
> > --
> > Felix
> >
> >
> >
> > On Wed, Jan 11, 2012 at 6:16 PM, Felix GV <[EMAIL PROTECTED]> wrote:
> >
> > > As I understand it, you cannot use a mirrored Kafka cluster as a hot
> > > fail-over.
> > >
> > > You could probably use it as a manual fail-over, but I don't know the
> > > complexity involved in doing that.
> > >
> > > Also, if your source cluster fails while producers were putting data
> into
> > > it, there will be an "unconsumed window" of data that is lost. This
> > > corresponds to the data that the embedded consumer in the mirrored
> > cluster
> > > did not have time to consume from the source cluster.
> > >
> > > All in all, the mirrored cluster is akin to asynchronous replication,
> > > without any hot fail-over capability. Thus, it provides data redundancy
> > > (outside of the unconsumed window described above) but no extra
> > > availability (unless you count manual interventions).
> > >
> > > KAFKA-50 <https://issues.apache.org/jira/browse/KAFKA-50>, on the
> other
> > > hand, will provide both asynchronous AND synchronous replication
> > (although
> > > the latter will incur a latency penalty) and will be able to use the
> > > replicas (data redundancy) as hot-fail overs.
> > >
> > > Depending on your personal definition of "highly reliable" (whether it
> > > includes data redundancy and/or availability), I think that should
> > probably