Uploaded image for project: 'JS - JobScheduler'
  1. JS - JobScheduler
  2. JS-2148

Kill child processes of shell jobs terminating after SIGTERM

    XMLWordPrintable

Details

    • Feature
    • Status: Approved (View Workflow)
    • High
    • Resolution: Fixed
    • 2.0.0
    • 2.7.2
    • JS7 Agent
    • None

    Description

      Current Situation

      • The JS7 - FAQ - Does JS7 reliably kill running jobs article explains the steps performed by the Agent to terminate job processes.
      • For job processes that are terminated by a SIGTERM signal any child processes will continue to run. This is the built-in behavior of Unix shells.
      • Users who wish to terminate child processes can add to their job script:
        trap "wait && exit 143" TERM
        

        The trap will prevent a job process from terminating until all child processes are terminated. If the Grace Timeout is exceeded then the Agent executes the kill_task.sh | .cmd script to kill the job process and any child processes.
        Specification of the exit command in a trap is required, so is the return value of the exit command that can be used to signal abnormal termination of a job in the workflow.

      Desired Behavior for use with cancel/kill and suspend/kill operations

      • Users should not be forced to add a trap to their job scripts as killing of child processes is considered the default behavior.
      • When killing a job process, the Agent performs the following steps:
        • collect chld process PIDs of job process
        • send SIGTERM to job process
        • wait for one of the following events, whichever arrives first:
          • wait for Grace Timeout configured with the job
          • wait for stdout/stderr to be released by the job process
        • send SIGKILL to job process if Grace Timeout is exceeded
        • send SIGTERM to child processes for which PIDs have previously been collected; send SIGTERM recursively to child processes of a child process.
        • wait for 50% of the duration of the Grace Timeout or for 1s whichever is the higher value.
        • send SIGKILL signal to remaining child processes recursively.
      • The Agent drops use of the kill_task.sh | .cmd scripts. Instead, Java is used for process management.

      Desired Behavior in the event of Agent crash

      • If the Agent is killed, then the watchdog.sh | .cmd script will terminate remaining child processes.
      • The Agent keeps track of the PIDs of job processes that have been started. The information is available from a temporary file.
      • The Agent offers a Java class that is used by the watchdog.sh | .cmd script. The Java class will use the information about the PIDs of any started processes and will apply the following sequence of actions:
        • for each job process identify child processes recursively,
        • send SIGTERM to all job processes and child processes,
        • wait for 3s,
        • send SIGKILL to all remaining job processes and child processes.

      Attachments

        Issue Links

          Activity

            People

              jz Joacim Zschimmer
              ap Andreas PĆ¼schel
              Oliver Haufe Oliver Haufe
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: