Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka, mail # dev - [jira] [Commented] (KAFKA-691) Fault tolerance broken with replication factor 1


Copy link to this message
-
[jira] [Commented] (KAFKA-691) Fault tolerance broken with replication factor 1
"Jun Rao 2013-01-14, 01:01

    [ https://issues.apache.org/jira/browse/KAFKA-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552371#comment-13552371 ]

Jun Rao commented on KAFKA-691:
-------------------------------

That's a good point. Increasing the timeout will work for most cases. If a broker goes down, the client request will get a socket exception immediately, independent of the timeout. So setting a large timeout doesn't hurt. When the broker host goes down and the client is waiting for a response from the server, I think the client will have to wait until the timeout. If we set a larger timeout, it means that the client has to wait longer before realizing the broker is down. However, since this is a rarer case, I think setting a larger timeout for now is probably good enough.
                
> Fault tolerance broken with replication factor 1
> ------------------------------------------------
>
>                 Key: KAFKA-691
>                 URL: https://issues.apache.org/jira/browse/KAFKA-691
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Jay Kreps
>            Assignee: Maxime Brugidou
>             Fix For: 0.8
>
>         Attachments: KAFKA-691-v1.patch, KAFKA-691-v2.patch
>
>
> In 0.7 if a partition was down we would just send the message elsewhere. This meant that the partitioning was really more of a "stickiness" then a hard guarantee. This made it impossible to depend on it for partitioned, stateful processing.
> In 0.8 when running with replication this should not be a problem generally as the partitions are now highly available and fail over to other replicas. However in the case of replication factor = 1 no longer really works for most cases as now a dead broker will give errors for that broker.
> I am not sure of the best fix. Intuitively I think this is something that should be handled by the Partitioner interface. However currently the partitioner has no knowledge of which nodes are available. So you could use a random partitioner, but that would keep going back to the down node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

 
+
Maxime Brugidou 2013-01-14, 14:54
+
Jun Rao 2013-01-17, 17:14