We're using Mesos 1.1.0 and have observed some unexpected behavior with
regards to Agent reregistration on our cluster.
When a health check failure happens, our framework (in this case Apache
Aurora) receives an Agent Lost message along with TASK_LOST messages for
each of the tasks that was currently running on the agent that failed the
health check (not responding after *max_agent_ping_timeouts*).
We expected the same behavior to take place when an Agent does not register
before the *agent_reregister_timeout* is up. However, while our framework
did receive an Agent Lost message after 10 minutes had passed (default
agent_reregister_timeout value) since leader election, it did not receive
any messages concerning the tasks that were running on that node.
This can create a scenario where, if the Agent goes away permanently, we
have tasks that are unaccounted for and won't be restarted on another Agent
until an explicit reconciliation is done.
On the other hand, if the Agent does come back after the reregister
timeout, and the framework has replaced the missing instances, the
instances that were previously running will continue to run until an
implicit reconciliation is done.
I understand some behavior may have changed with partition aware
frameworks, so I'm trying to understand if this is the expected behavior.
For what is worth, Aurora is not a partition aware framework.
Any help would be appreciated,