Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Instances became unresponsive


Copy link to this message
-
Re: Instances became unresponsive
Vadim,

Without knowing the original cause, it's hard to for me to say how to
recover from it or prevent it from happening. If you stop all producers and
restart the whole cluster, does that bring the cluster to a healthy state?

Going forward, I recommend that you add monitoring of the brokers and keep
the log4j logs for a few days. This way, if the problem shows up again, we
can see which broker first has the problem and what's causing it.

Thanks,

Jun
On Tue, Aug 27, 2013 at 11:50 PM, Vadim Keylis <[EMAIL PROTECTED]>wrote:

> Hello Jun. Unfortunately I do not have logs from broker 6 to find out
> reasons for it to be unresponsive, but yes it was not healthy. I found it
> to be unresponsive as well.
> How can I recover from all this failures with minimum data loss?
>
>
>
>
> On Tue, Aug 27, 2013 at 8:51 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
>
> > It seems the replica fetch thread died because of socket timeout
> (defaults
> > to 30 secs). Was broker 6 healthy at that point?
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Tue, Aug 27, 2013 at 11:36 AM, Vadim Keylis <[EMAIL PROTECTED]
> > >wrote:
> >
> > > We do not use controlled shutdown through JMX, its configured in the
> > > property file. I do not see control shutdown message at the time I
> > > initiated the shutdown. However searching for the string produced the
> > > following error messages which happened hours before I started shutting
> > > down service.
> > >
> > > [2013-08-26 12:38:18,850] WARN [ReplicaFetcherThread--1-6], Error in
> > fetch
> > > Name: FetchRequest; Version: 0; CorrelationId: 1541; ClientId:
> > > ReplicaFetcherThread--1-6; ReplicaId: 5; MaxWait: 500 ms; MinBytes: 1
> > > bytes; RequestInfo: [pets_pageview,25] ->
> > > PartitionFetchInfo(0,1048576),[mm_msg,26] ->
> > > PartitionFetchInfo(0,1048576),[pets_cashruns_spin,22] ->
> > > PartitionFetchInfo(0,1048576),[page_timings,5] ->
> > > PartitionFetchInfo(0,1048576),[cafe_purchases,13] ->
> > > PartitionFetchInfo(0,1048576),[meetme_spotlight_action,33] ->
> > > PartitionFetchInfo(0,1048576),[mob_gift,5] ->
> > > PartitionFetchInfo(0,1048576),[cafe_coin_purchases,18] ->
> > > PartitionFetchInfo(0,1048576),[security_trigger,11] ->
> > > PartitionFetchInfo(0,1048576),[pysm_click,30] ->
> > > PartitionFetchInfo(0,1048576),[gold_blacklist,21] ->
> > > PartitionFetchInfo(0,1048576),[meetme_oops,9] ->
> > > PartitionFetchInfo(0,1048576),[m3_session_info,15] ->
> > > PartitionFetchInfo(0,1048576),[link_review,10] ->
> > > PartitionFetchInfo(0,1048576),[cafe_debug,28] ->
> > > PartitionFetchInfo(0,1048576),[m3_login_button,0] ->
> > > PartitionFetchInfo(0,1048576),[pets_level,24] ->
> > > PartitionFetchInfo(0,1048576),[login_detail,9] ->
> > > PartitionFetchInfo(0,1048576),[click_mail,31] ->
> > > PartitionFetchInfo(0,1048576),[pets_wish,15] ->
> > > PartitionFetchInfo(0,1048576),[page_view_admin,6] ->
> > > PartitionFetchInfo(0,1048576),[hi5_image_cleanup,9] ->
> > > PartitionFetchInfo(0,1048576),[pets_wish,24] ->
> > > PartitionFetchInfo(0,1048576),[cafe_food_spoiled,23] ->
> > > PartitionFetchInfo(0,1048576),[pets_wish,33] ->
> > > PartitionFetchInfo(0,1048576),[account_notifications,11] ->
> > > PartitionFetchInfo(0,1048576),[google_transactions,7] ->
> > > PartitionFetchInfo(0,1048576),[hi5_image_cleanup,3] ->
> > > PartitionFetchInfo(0,1048576),[pets_economy_change,15] ->
> > > PartitionFetchInfo(0,1048576),[payment,13] ->
> > > PartitionFetchInfo(0,1048576),[validation,0] ->
> > > PartitionFetchInfo(0,1048576),[meetme_new_contact_count,5] ->
> > > PartitionFetchInfo(0,1048576),[mail_send,18] ->
> > > PartitionFetchInfo(0,1048576),[lightbox_click,28] ->
> > > PartitionFetchInfo(0,1048576),[rso_scanner_append,5] ->
> > > PartitionFetchInfo(0,1048576),[mob_gift,2] ->
> > > PartitionFetchInfo(0,1048576),[cafe_waiter_tips,24] ->
> > > PartitionFetchInfo(0,1048576),[groups_user_actions,22] ->
> > > PartitionFetchInfo(0,1048576),[jstiming,23] ->
> > > PartitionFetchInfo(0,1048576),[viral_contact_inviters,26] ->