Uploaded image for project: 'JS - JobScheduler'
  1. JS - JobScheduler
  2. JS-684

System Monitor (Nagios, op5) should notify if a JobScheduler Universal Agent is not available

    XMLWordPrintable

    Details

      Description

      Current Situation

      • JobScheduler has no knowledge if an Agent is available as long as no job is launched for execution on that Agent.

      Desired Behavior

      • Users would like to know if any Agent in the network were not available in order to take measures before the execution of a job.

      Implementation

      • The Perl script check_jobscheduler_agent.pl can be used for integration with Nagios, op5 and compatible System Monitors.
      • The script is used with the following command arguments:
        • host and port of the JobScheduler Universal Agent with the URL
          http://host:port/jobscheduler/agent/api/overview
        • list of attributes that are used to perform the check
        • list of attributes that are added to the output of the script
        • max. timeout to connect to the Agent

      Operation

      • Check
        • The script makes use of the attributes totalTaskCount and currentTaskCount from an Agent response to check if the Agent is available.
      • Output
        • If the Agent is not available:
          • Message:
            Check JobScheduler Universal Agent - Agent is CRITICAL - Connection failed: 500 Can't connect to 192.11.0.38:4445 (connect: timeout)
          • The timeout is configurable, see below "Service Parameters"
        • If the Agent is available:
          • Message:
            Check JobScheduler Universal Agent - Agent is OK - startedAt: 2015-07-17T12:05:52.245Z, totalTaskCount: 170422, currentTaskCount: 52, isTerminating: 0
      • Service Command
        A Service Command has to be declared before configuring the Nagios/op5 Service that makes use of this Command. The following declaration for the Command is recommended:
         define command{
            command_name                   check_jobscheduler_agent
            command_line                   /opt/plugins/check_jobscheduler_agent.pl -u $ARG1$ -a $ARG2$ -o $ARG3$ -t $ARG4$
            }
      • Service Parameters
        When configuring the Nagios/op5 Service then parameters have to be specified, e.g.:
        http://galadriel.sos:4455/jobscheduler/agent/api/overview!'{totalTaskCount},{currentTaskCount}'!'{startedAt},{totalTaskCount},{currentTaskCount},{isTerminating}'!20

        where

        • first argument is the URL for the HTTP connection
        • second argument is the list of attributes that are used to check the Agent availability
        • third argument is the list of attributes that are used for output of the script and that will be displayed in the System Monitor.
        • fourth argument is the timeout for the connection to the Agent.

      Maintainer Notes

      • This task is preferably performed by a System Monitor such as Nagios, HP OpenView, SCOM etc. as such monitors provide a better overview of network related problems and escalation rules that JobScheduler cannot be aware of.
      • The JobScheduler Universal Agent accept and respond to a HTTP web service request that can be used to check the Agent status. This check can be performed either by a System Monitor or by a JobScheduler job on a cyclic basis.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ap Andreas Püschel
                Reporter:
                ss Stefan Schädlich (Inactive)
                Approver:
                Andreas Püschel
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: