Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # dev >> Re: Segment recovery and replication


Copy link to this message
-
Re: Segment recovery and replication
Can't he get this automatically though with the Sriram's controlled
shutdown stuff?

-Jay
On Thu, Aug 29, 2013 at 2:12 PM, Neha Narkhede <[EMAIL PROTECTED]>wrote:

> >> How do you automate waiting for the broker to come up? Just keep
> monitoring the process and keep trying to connect to the port?
>
> Every leader in a Kafka cluster exposes the UnderReplicatedPartitionCount
> metric. The safest way to issue controlled shutdown is to wait until that
> metric reports 0 on the brokers. If you try to shutdown the last broker in
> the ISR, the controlled shutdown cannot succeed since there is no other
> broker to move the leader to. Waiting until under replicated partition
> count hits 0 prevents you from hitting this issue.
>
> This also solves the problem of waiting until the broker comes up since you
> will automatically wait until the broker comes up and joins ISR.
>
>
> Thanks,
> Neha
>
>
> On Thu, Aug 29, 2013 at 12:59 PM, Sam Meder <[EMAIL PROTECTED]
> >wrote:
>
> > Ok, I spent some more time staring at our logs and figured out that it
> was
> > our fault. We were not waiting around for the Kafka broker to fully
> > initialize before moving on to the next broker and loading the data logs
> > can take quite some time (~7 minutes in one case), so   we ended up with
> no
> > replicas online at some point and the replica that came back first was a
> > little short on data...
> >
> > How do you automate waiting for the broker to come up? Just keep
> > monitoring the process and keep trying to connect to the port?
> >
> > /Sam
> >
> > On Aug 29, 2013, at 6:40 PM, Sam Meder <[EMAIL PROTECTED]>
> wrote:
> >
> > >
> > > On Aug 29, 2013, at 5:50 PM, Sriram Subramanian <
> > [EMAIL PROTECTED]> wrote:
> > >
> > >> Do you know why you timed out on a regular shutdown?
> > >
> > > No, though I think it may just have been that the timeout we put in was
> > too short.
> > >
> > >> If the replica had
> > >> fallen off of the ISR and shutdown was forced on the leader this could
> > >> happen.
> > >
> > > Hmm, but it shouldn't really be made leader if it isn't even in the
> isr,
> > should it?
> > >
> > > /Sam
> > >
> > >> With ack = -1, we guarantee that all the replicas in the in sync
> > >> set have received the message before exposing the message to the
> > consumer.
> > >>
> > >> On 8/29/13 8:32 AM, "Sam Meder" <[EMAIL PROTECTED]> wrote:
> > >>
> > >>> We've recently come across a scenario where we see consumers
> resetting
> > >>> their offsets to earliest and which as far as I can tell may also
> lead
> > to
> > >>> data loss (we're running with ack = -1 to avoid loss). This seems to
> > >>> happen when we time out on doing a regular shutdown and instead kill
> -9
> > >>> the kafka broker, but does obviously apply to any scenario that
> > involves
> > >>> a unclean exit. As far as I can tell what happens is
> > >>>
> > >>> 1. On restart the broker truncates the data for the affected
> > partitions,
> > >>> i.e. not all data was written to disk.
> > >>> 2. The new broker then becomes a leader for the affected partitions
> and
> > >>> consumers get confused because they've already consumed beyond the
> now
> > >>> available offset.
> > >>>
> > >>> Does that seem like a possible failure scenario?
> > >>>
> > >>> /Sam
> > >>
> > >
> >
> >
>

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB