Not sending TASK_LOST is a breaking change compared to previous behavior.
From the docs here:

When it is time to remove an agent, the master removes the agent from the
And then from the section on agent reregistration:

If an agent does not reregister with the new master within a timeout
Clearly the idea was that frameworks would see TASK_LOST every time the
agent is marked as lost.

This behavior appears to have been broken by this commit:

Reconciliation is still required because message delivery is best-effort,
but the fundamental difference is now frameworks *rely* on reconciliation
for basic operation. We have plans to eventually adopt partition-awareness
into Aurora, but IMO this change in behavior was an oversight when trying
to maintain backwards compatibility and can be (harmlessly) fixed in Mesos.


On 2017-07-17 09:20 (-0700), Ilya Pronin <[EMAIL PROTECTED]> wrote:
> it simply doesn't know about them. The only thing the master can do in
> situation is send LostSlaveMessage that will tell the scheduler that
> on this agent are LOST/UNREACHABLE.>
> The situation where the agent came back after reregistration timeout>
> doesn't sound good. The only way for the framework to learn about tasks>
> that are still running on such agent is either from status updates or
> implicit reconciliation. Perhaps, the master could send updates for
> it learned about when such agent is readmitted?>
> On Sun, Jul 16, 2017 at 5:54 AM, Meghdoot bhattacharya <>
> [EMAIL PROTECTED]lid> wrote:>
> > This looks like a serious bug unless we are missing something. Hoping
> > clarifications.>
> >>
> > Thx>
> >>
> > > On Jul 14, 2017, at 3:52 PM, Renan DelValle <[EMAIL PROTECTED]>>
> > wrote:>
> > >>
> > > Hi all,>
> > >>
> > > We're using Mesos 1.1.0 and have observed some unexpected behavior
> > > regards to Agent reregistration on our cluster.>
> > >>
> > > When a health check failure happens, our framework (in this case
> > > Aurora) receives an Agent Lost message along with TASK_LOST messages
> > > each of the tasks that was currently running on the agent that failed
> > > health check (not responding after *max_agent_ping_timeouts*).>
> > >>
> > > We expected the same behavior to take place when an Agent does not>
> > register>
> > > before the *agent_reregister_timeout* is up. However, while our
> > > did receive an Agent Lost message after 10 minutes had passed
> > > agent_reregister_timeout value) since leader election, it did not
> > > any messages concerning the tasks that were running on that node.>
> > >>
> > > This can create a scenario where, if the Agent goes away permanently,
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB