[JS-2209] Improve resilience of Controller/Director Agent Cluster in case of hardware clock leaps - SOS JIRA

XML

Word

Printable

Details

Type: Feature
Status: Open (View Workflow)
Priority: High
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 2.8.1
Component/s: JS7 Agent, JS7 Controller
Labels:
None

Description

Current Situation

The Controller Cluster and Director Agent Cluster rely on synchronization of server clocks, find details from the Wiki page.
In case of clock leaps, certain thresholds exist for the cluster's behavior:
- 3s: cluster will catch up
- 10s: cluster is affected, but usually will recover
- 20s cluster will fail

Problem
If the threshold value for clock leaps is exceeded and the hardware clock of the Active Controller instance is slower than that of the Standby Controller instance, then the Cluster Watch (JOC Cockpit) and Standby Controller will initiate fail-over as they consider the messages of the Active Controller being outdated.

However, the Active Controller is still alive (while all others consider it dead due to the time difference) and is still connected to Agents.

At the same time, the Standby Controller becomes active and starts exchanging events with the Agents. This state does not last long, after 1-3s the Cluster Watch will instruct the Active Controller (if reachable) to become standby. But: in the mean-time, the (former) Active Controller possibly has exchanged events with Agents that the new Active Controller does not know (and vice versa). This can result in journal corruption which is indicated by warnings such as “inapplicable event”.

The cluster no longer couples, if both Controller instances receive events from Agents for the moment they are active at the same time. Agent responses arrive to requests made by the other Controller, there is nothing a Controller instance can do about a response for which it didn't send the request. Both Controller instances assume they are on standby as they do not receive current events and neither instance will take the lead in the Cluster. This means the Cluster is inoperable.

Desired Behavior

Logging
- Clock leaps will be logged . If thresholds are exceeded then warnings and errors will be stated with the log.
- The difference in server clock times between active instance, standby instance and Cluster Watch is stated with the log. If thresholds are exceeded, then warnings and errors will be stated with the log.
Resilience
- When both active and standby instances exchange events with Agents, then this will be denied by Agents that identify too frequent fail-over operations between instances that send events in parallel.
- The Cluster will shutdown if it cannot catch up with clock leaps exceeding thresholds. The cluster can be restarted and normal operation can be resumed when server clocks are synchronized.
This applies to both Controller Cluster and Agent Director Cluster.

Attachments

Issue Links

mentioned in: Page Loading...

Activity

People

Assignee:: Joacim Zschimmer

Reporter:: Andreas Püschel

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19 July 2025 06:53

Updated:: 19 July 2025 08:11