I think the root problem is that replicas are falling behind and hence
are effectively "failed" under normal load and also that you have
unclean leader election enabled which "solves" this catastrophic
failure by electing new leaders without complete data.

Starting in 0.8.2 you will be able to selectively disable unclean
leader election.

The root problem for the spuriously failing replicas is the
configuration replica.lag.max.messages. This configuration defaults to
4000. But throughput can be really high, like a million messages per
second. At a million messages per second, 4k messages of lag is only
4ms behind, which can happen for all kinds of reasons (e.g. just
normal linux i/o latency jitter).

Jiang, I suspect you can resolve your issue by just making this higher.

However, raising this setting is not a panacea. The higher you raise
it the longer it will take to detect a partition that is actually
falling behind.

We have been discussing this setting, and if you think about it the
setting is actually somewhat impossible to set right in a cluster
which has both low volume and high volume topics/partitions. For the
low-volume topic it will take a very long time to detect a lagging
replica, and for the high-volume topic it will have false-positives.
One approach to making this easier would be to have the configuration
be something like replica.lag.max.ms and translate this into a number
of messages dynamically based on the throughput of the partition.

On Fri, Jul 11, 2014 at 2:55 PM, Jiang Wu (Pricehistory) (BLOOMBERG/
731 LEX -) <[EMAIL PROTECTED]> wrote:

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB