Details
-
Feature
-
Status: Released (View Workflow)
-
High
-
Resolution: Fixed
-
2.0.0
-
None
Description
Current Situation
- The JS7 - FAQ - How does JobScheduler terminate Jobs article explains the steps performed by the Agent to terminate job processes.
- For job processes that are terminated by a SIGTERM signal any child processes will continue to run. This is the built-in behavior of Unix shells.
- Users who wish to terminate child processes can add to their job script:
trap "wait && exit 143" TERM
The trap will prevent a job process from terminating until all child processes are terminated. If the Grace Timeout is exceeded then the Agent executes the kill_task.sh | .cmd script to kill the job process and any child processes.
Specification of the exit command in a trap is required, so is the return value of the exit command that can be used to signal abnormal termination of a job in the workflow.
Desired Behavior for use with cancel/kill and suspend/kill operations
- Users should not be forced to add a trap to their job scripts as killing of child processes is considered the default behavior.
- When killing a job process, the Agent performs the following steps:
- collect chld process PIDs of job process
- send SIGTERM to job process
- wait for one of the following events, whichever arrives first:
- wait for Grace Timeout configured with the job
- wait for stdout/stderr to be released by the job process
- send SIGKILL to job process if Grace Timeout is exceeded
- send SIGTERM to child processes for which PIDs have previously been collected; send SIGTERM recursively to child processes of a child process.
- wait for 50% of the duration of the Grace Timeout or for 1s whichever is the higher value.
- send SIGKILL signal to remaining child processes recursively.
- The Agent drops use of the kill_task.sh | .cmd scripts. Instead, Java is used for process management.
Desired Behavior in the event of Agent crash
- If the Agent is killed, then the watchdog.sh | .cmd script will terminate remaining child processes.
- The Agent keeps track of the PIDs of job processes that have been started. The information is available from a temporary file.
- The Agent offers a Java class that is used by the watchdog.sh | .cmd script. The Java class will use the information about the PIDs of any started processes and will apply the following sequence of actions:
-
- for each job process identify child processes recursively,
- send SIGTERM to all job processes and child processes,
- wait for 3s,
- send SIGKILL to all remaining job processes and child processes.