Yes, we've confirmed this internally too (Santhosh did the work here):
When an agent becomes unreachable while the master is running, it sends
The separate code path for markUnreachableAfterFailover appears to have
been added by this commit:https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa174ca0bd371d0c
And I think this totally breaks the promise of introducing the
PARTITION_AWARE stuff in a backwards-compatible way.
So right now, yes we rely on reconciliation to finally mark the tasks as
LOST and reschedule their replacements.
I think the only reason we haven't been more impacted by this at Twitter is
our Mesos master is remarkably stable (compared to Aurora's daily
We have two paths forward here: push forward and embrace the new partition
awareness features in Aurora and/or push back on the above change with the
Mesos community and have a better story for non-partition aware APIs in the
On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya <
[EMAIL PROTECTED]lid> wrote: