Uploaded image for project: 'JS - JobScheduler'
  1. JS - JobScheduler
  2. JS-1518

Handling of Agent tasks in case of failure and connection loss

    XMLWordPrintable

    Details

      Description

      Current Situation

      Desired Behavior for Use Cases

      • Network Connection Loss (covered by JS-1524)
        • Agent
          • Tasks are continued with the Agent.
          • The Agent stores log output and the execution history of tasks in local files (see JS-1521).
        • Master
          • The Master will retry attempts to re-connect to the Agent.
        • In case of successful re-connect:
          • The Agent reports the log information of running and completed tasks to the Master.
          • The Agent reports the execution history of running and completed tasks to the Master.
          • The Master adds the information received from re-connected Agents to its history.
          • The Master will report running tasks of an Agent after re-connect.
        • In case of unsuccessful re-connect:
          • Tasks are killed by the Agent (see JS-1523).
      • Master Service Failure
        • The handling of tasks with the Agent is same as in case of Network Connection Loss with unsuccessful re-connect.
        • The Master can be configured to start in paused mode to allow users to check the job history and the task logs of Agents before continuing task execution (see JS-1511 and JS-1522).
      • Database Service Failure
        • same as in case of Master Service Failure
        • The Master will retry to connect to the database every minute
          • For a JobScheduler Active Cluster the re-connect has to take place within 120s (see JS-1032).
          • For a JobScheduler Passive Cluster or single instance the re-connect attempts can be configured for an unlimited number of times.

      Maintainer Notes

      • This feature is dismissed as the resilience capabilities will be completely reworked with JobScheduler 2.0.
      • Future releases 2.0 of Agents will include semi-autonomous behavior that works in case of connection loss and in case of outage of a Master.
      • Therefore we will not focus on providing this functionality to a Master for a limited time (release 1.12 will be the last minor release before the major release 2.0).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sos_engine_team TeamEngine
                Reporter:
                ap Andreas Püschel
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: