[JS-2018] Controller Cluster should support repeated fail-over in case of network isolation - SOS JIRA

XML

Word

Printable

Current Situation

If a Controller Cluster is operated in a network environment that frequently isolates individual Controller instances from the network for a short period then the cluster will stop to work after repeated fail-over in short intervals.
- Network isolation includes that a Controller instance cannot connect to other components and cannot be connected to. At the same time remaining components (Secondary Controller, Cluster Watch Agent) can continue to communicate with each other.
- Repeated fail-over means that fail-over occurs in intervals that are shorter than the time required by the Controller Cluster to synchronize from a previous fail-over.
As a result of this situation both Primary and Secondary Controller instances can become active at the same point in time.

Desired Behavior

The Controller Cluster should cope with a situation when individual Controller instances are repeatedly isolated from the network for a short period and when fail-over repeatedly occurs in short intervals.
No two Controller instances in a cluster should be active at the same point in time.