Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Surprisingly high network traffic between kafka servers


Copy link to this message
-
Re: Surprisingly high network traffic between kafka servers
So, if you start from scratch (new environment and download of the Kafka
release), could you post the list of steps to reproduce this issue?
On Thu, Feb 6, 2014 at 7:48 PM, Carl Lerche <[EMAIL PROTECTED]> wrote:

> So, the "good news" is that the problem came back again. The bad news
> is that I disabled debug logs as it was filling disk (and I had other
> fires to put out). I will re-enable debug logs and wait for it to
> happen again.
>
> On Thu, Feb 6, 2014 at 4:05 AM, Neha Narkhede <[EMAIL PROTECTED]>
> wrote:
> > Carl,
> >
> > It will help if you can list the steps to reproduce this issue starting
> > from a fresh installation. Your setup, the way it stands, seems to have
> > gone through some config and state changes.
> >
> > Thanks,
> > Neha
> >
> >
> > On Wed, Feb 5, 2014 at 5:17 PM, Joel Koshy <[EMAIL PROTECTED]> wrote:
> >
> >> On Wed, Feb 05, 2014 at 04:51:16PM -0800, Carl Lerche wrote:
> >> > So, I tried enabling debug logging, I also made some tweaks to the
> >> > config (which I probably shouldn't have) and craziness happened.
> >> >
> >> > First, some more context. Besides the very high network traffic, we
> >> > were seeing some other issues that we were not focusing on yet.
> >> >
> >> > * Even though the log retention was set to 50GB & 24 hours, data logs
> >> > were getting cleaned up far quicker quicker. I'm not entirely sure how
> >> > much quicker, but there was definitely far less than 12 hours and 1GB
> >> > of data.
> >> >
> >> > * Kafka was not properly balanced. We had 3 servers, and only 2 of
> >> > them were partition leaders. One server was a replica for all
> >> > partitions. We tried to run a rebalance command, but it did not work.
> >> > We were going to investigate later.
> >>
> >> Were any of the brokers down for an extended period? If the preferred
> >> replica election command failed it could be because the preferred
> >> replica was catching up (which could explain the higher than expected
> >> network traffic). Do you monitor the under-replicated partitions count
> >> on your cluster? If you have that data it could help confirm this.
> >>
> >> Joel
> >>
> >> >
> >> > So, after restarting all the kafkas, something happened with the
> >> > offsets. The offsets that our consumers had no longer existed. It
> >> > looks like somehow all the contents was lost? The logs show many
> >> > exceptions like:
> >> >
> >> > `Request for offset 770354 but we only have log segments in the range
> >> > 759234 to 759838.`
> >> >
> >> > So, I reset all the consumer offsets to the head of the queue as I did
> >> > not know of anything better to do. Once the dust settled, all the
> >> > issues we were seeing vanished. Communication between Kafka nodes
> >> > appear to be normal, Kafka was able to rebalance, and hopefully log
> >> > retention will be normal.
> >> >
> >> > I am unsure what happened or how to get more debug information.
> >> >
> >> > On Wed, Feb 5, 2014 at 12:31 PM, Jay Kreps <[EMAIL PROTECTED]>
> wrote:
> >> > > Can you enable DEBUG logging in log4j and see what requests are
> coming
> >> in?
> >> > >
> >> > > -Jay
> >> > >
> >> > >
> >> > > On Tue, Feb 4, 2014 at 9:51 PM, Carl Lerche <[EMAIL PROTECTED]>
> wrote:
> >> > >
> >> > >> Hi Jay,
> >> > >>
> >> > >> I do not believe that I have changed the replica.fetch.wait.max.ms
> >> > >> setting. Here I have included the kafka config as well as a
> snapshot
> >> > >> of jnettop from one of the servers.
> >> > >>
> >> > >> https://gist.github.com/carllerche/4f2cf0f0f6d1e891f482
> >> > >>
> >> > >> The bottom row (89.9K/s) is the producer (it lives on a Kafka
> server).
> >> > >> The top two rows are Kafkas on other servers, you can see the
> combined
> >> > >> throughput is ~80MB/s
> >> > >>
> >> > >> On Tue, Feb 4, 2014 at 9:36 PM, Jay Kreps <[EMAIL PROTECTED]>
> >> wrote:
> >> > >> > No this is not normal.
> >> > >> >
> >> > >> > Checking twice a second (using 500ms default) for new data
> shouldn't
> >> > >> cause
> >> > >> > high network traffic (that should be like < 1KB of overhead). I