[JS-2141] Confirm loss of Subagent to restart jobs - SOS JIRA

XML

Word

Printable

Current Situation
Consider an Agent Cluster executing jobs with Subagents:

If the Subagent is crashed before a job is about to start then the active Director Agent will select the next available Subagent to execute the job.
If the Subagent is crashed while a job is running then the related order is put to the blocked state. No operations are available on the order. When the Subagent is restarted, then the job will be restarted.

Desired Behavior

The behavior is correct to restart jobs in case of restart of a crashed Subagent. In addition, users would like to restart jobs from a different Subagent if a crashed Subagent is not restarted, for example in a situation when the server is out of order.
A Director Agent cannot know if the Subagent is not running or if the Subagent is not accessible but is still executing the job, for example in case of network issues. It can result in double job execution if the Director Agent would automatically restart such jobs.
Users who wish to restart jobs from a different Subagent, can confirm loss of the crashed Subagent (Controller command). The next Subagent will be selected based on the Subagent Cluster configuration.
- The command is sent from JOC Cockpit to the Controller.
- The Controller forwards the command to the Director Agent that must be reachable to the Controller.

Maintainer Note

The functionality is available from the "Reset" operation that is offered from the "Manage Controllers/Agents" page. The "Reset" operation for a crashed Subagent causes crashed jobs to be restarted from the next Subagent in the given Subagent Cluster unless the job is marked being non-restartable.