Details
-
Fix
-
Status: Released (View Workflow)
-
Minor
-
Resolution: Fixed
-
1.10
Description
Current Situation
- The Universal Agent continues execution of a task if the connection to its Master gets lost.
- This behavior does not comply with the behavior of the Classic Agent that would kill tasks in a similar situation.
- Due to the change from a TCP to an HTTP connection between Master and Agent an interception of the connection is not immediately detected.
Desired Behavior
- The Universal Agent will kill tasks if the connection to its Master gets lost. A connection loss is detected by missing heartbeats that are sent from Master to Agent and vice versa.
- This behavior is intended to prevent simultaneous duplicate execution of tasks: should the connection loss be due to failure of a JobScheduler Master and should the Master later on come up then it would request the Agent once more to start the respective task as it has no knowledge of the previous execution result.
Heartbeat Implementation
- The Master and Agent send heartbeats to each other.
- The Agent receives HTTP POST requests from the Master and will respond within 5s, independently from the completion of the command that has been requested by the Master.
- The Master will repeat sending further HTTP POST requests and accepting acknowledgements until the Agent sends the final response, i.e. after completion of a task.
- If the Agent does not receive a heartbeat from the Master within the double period (10s) then the Agent will assume the connection to be lost and will kill the task.
- If the Master does not receive a heartbeat from the Agent then the Master will consider the task being lost and will assign the task an error state.
Attachments
Issue Links
- is related to
-
JS-1567 Agent should not kill a quiet task being this task the only one running
- Released
- is required by
-
JS-1518 Handling of Agent tasks in case of failure and connection loss
- Dismissed
-
JOE-204 JOE supports heartbeat attributes for Agents
- Released
-
JS-1524 Universal Agent supports reconciliation after connection loss
- Released
- relates to
-
JS-1456 Keep-Alive for JobScheduler Universal Agent
- Released
(4 mentioned in)