Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - Surprisingly high network traffic between kafka servers


Copy link to this message
-
Re: Surprisingly high network traffic between kafka servers
Neha Narkhede 2014-02-07, 04:25
So, if you start from scratch (new environment and download of the Kafka
release), could you post the list of steps to reproduce this issue?
On Thu, Feb 6, 2014 at 7:48 PM, Carl Lerche <[EMAIL PROTECTED]> wrote:

> So, the "good news" is that the problem came back again. The bad news
> is that I disabled debug logs as it was filling disk (and I had other
> fires to put out). I will re-enable debug logs and wait for it to
> happen again.
>
> On Thu, Feb 6, 2014 at 4:05 AM, Neha Narkhede <[EMAIL PROTECTED]>
> wrote:
> > Carl,
> >
> > It will help if you can list the steps to reproduce this issue starting
> > from a fresh installation. Your setup, the way it stands, seems to have
> > gone through some config and state changes.
> >
> > Thanks,
> > Neha
> >
> >
> > On Wed, Feb 5, 2014 at 5:17 PM, Joel Koshy <[EMAIL PROTECTED]> wrote:
> >
> >> On Wed, Feb 05, 2014 at 04:51:16PM -0800, Carl Lerche wrote:
> >> > So, I tried enabling debug logging, I also made some tweaks to the
> >> > config (which I probably shouldn't have) and craziness happened.
> >> >
> >> > First, some more context. Besides the very high network traffic, we
> >> > were seeing some other issues that we were not focusing on yet.
> >> >
> >> > * Even though the log retention was set to 50GB & 24 hours, data logs
> >> > were getting cleaned up far quicker quicker. I'm not entirely sure how
> >> > much quicker, but there was definitely far less than 12 hours and 1GB
> >> > of data.
> >> >
> >> > * Kafka was not properly balanced. We had 3 servers, and only 2 of
> >> > them were partition leaders. One server was a replica for all
> >> > partitions. We tried to run a rebalance command, but it did not work.
> >> > We were going to investigate later.
> >>
> >> Were any of the brokers down for an extended period? If the preferred
> >> replica election command failed it could be because the preferred
> >> replica was catching up (which could explain the higher than expected
> >> network traffic). Do you monitor the under-replicated partitions count
> >> on your cluster? If you have that data it could help confirm this.
> >>
> >> Joel
> >>
> >> >
> >> > So, after restarting all the kafkas, something happened with the
> >> > offsets. The offsets that our consumers had no longer existed. It
> >> > looks like somehow all the contents was lost? The logs show many
> >> > exceptions like:
> >> >
> >> > `Request for offset 770354 but we only have log segments in the range
> >> > 759234 to 759838.`
> >> >
> >> > So, I reset all the consumer offsets to the head of the queue as I did
> >> > not know of anything better to do. Once the dust settled, all the
> >> > issues we were seeing vanished. Communication between Kafka nodes
> >> > appear to be normal, Kafka was able to rebalance, and hopefully log
> >> > retention will be normal.
> >> >
> >> > I am unsure what happened or how to get more debug information.
> >> >
> >> > On Wed, Feb 5, 2014 at 12:31 PM, Jay Kreps <[EMAIL PROTECTED]>
> wrote:
> >> > > Can you enable DEBUG logging in log4j and see what requests are
> coming
> >> in?
> >> > >
> >> > > -Jay
> >> > >
> >> > >
> >> > > On Tue, Feb 4, 2014 at 9:51 PM, Carl Lerche <[EMAIL PROTECTED]>
> wrote:
> >> > >
> >> > >> Hi Jay,
> >> > >>
> >> > >> I do not believe that I have changed the replica.fetch.wait.max.ms
> >> > >> setting. Here I have included the kafka config as well as a
> snapshot
> >> > >> of jnettop from one of the servers.
> >> > >>
> >> > >> https://gist.github.com/carllerche/4f2cf0f0f6d1e891f482
> >> > >>
> >> > >> The bottom row (89.9K/s) is the producer (it lives on a Kafka
> server).
> >> > >> The top two rows are Kafkas on other servers, you can see the
> combined
> >> > >> throughput is ~80MB/s
> >> > >>
> >> > >> On Tue, Feb 4, 2014 at 9:36 PM, Jay Kreps <[EMAIL PROTECTED]>
> >> wrote:
> >> > >> > No this is not normal.
> >> > >> >
> >> > >> > Checking twice a second (using 500ms default) for new data
> shouldn't
> >> > >> cause
> >> > >> > high network traffic (that should be like < 1KB of overhead). I