JS - JobScheduler
  1. JS - JobScheduler
  2. JS-1830

Agent fails all the tasks when more than five tasks are run at a time

    Details

    • Type: Fix Fix
    • Status: Released (View Workflow)
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.12.8
    • Fix Version/s: 1.12.9
    • Component/s: None
    • Labels:
      None

      Description

      Current situation

      • User has 6 standalone jobs with a shell script with sleep e.g. sleep 60 , configure to be executed on remote Linux JobScheduler Universal Agent.
      • User starts first 5 Jobs at the same time, Once all 5 jobs start executing, but if user starts the 6th job, first the 6th job will hang with the message "waiting for agent", 5 previously started jobs on the agent will fail too.
      • The agent log show following message once limit is reached, see attached logs JS-1830-LOGS.zip
         
        .... 
        ... 
        876476.sh) Process started 'job7' 
        2019-01-14 11:52:01,508 +0100 [INFO ] com.sos.scheduler.engine.taskserver.task.process.RichProcess - Start process 
        2019-01-14 11:52:01,520 +0100 [INFO ] com.sos.scheduler.engine.taskserver.task.process.RichProcess - (AgentTaskId(14-3259309389149839280) Pid(4367) /tmp/JobScheduler-Agent-job8-6376966759914975551.sh) Process started 'job8' 
        2019-01-14 11:53:01,982 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(7-6892326541133305520): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.471Z. Task is being killed 
        2019-01-14 11:53:02,012 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(10-3865485915042489164): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.499Z. Task is being killed 
        2019-01-14 11:53:02,022 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(8-3936377009075590372): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.551Z. Task is being killed 
        2019-01-14 11:53:02,024 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(9-7284851154719276103): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.607Z. Task is being killed 
        2019-01-14 11:53:02,026 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(11-6508967328914310046): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.654Z. Task is being killed 
        2019-01-14 11:53:02,029 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(12-1681751001463131093): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.662Z. Task is being killed 
        2019-01-14 11:53:02,029 +0100 [INFO ] com.sos.scheduler.engine.agent.command.AgentCommandHandler - #21 CloseTask(AgentTaskId(14-3259309389149839280),false) 
        2019-01-14 11:53:02,072 +0100 [INFO ] com.sos.scheduler.engine.agent.command.AgentCommandHandler - #22 CloseTask(AgentTaskId(13-2219642504410684700),false) 
        2019-01-14 11:53:02,111 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(13-2219642504410684700): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:01.642Z. Task is being killed 
        2019-01-14 11:53:02,280 +0100 [INFO ] com.sos.scheduler.engine. 
        .... 
        .... 
        
      • For the customer, the task limit is 5 parallel jobs, but in support's test, the limit is 6-8 jobs.
      • The Linux server where the agent is running is an Ubuntu x64, 4GB RAM and 2 CPU server.
      • See attached agent.log and scheduler.log files

      How to Reproduce

      • Deploy the attached job configuration JS-1830.zip in the ./config/live
      • Change the process class configuration for your Linux.
      • Start the 5 jobs at the same time
      • As soon as the task is started for 5 jobs, start one after another 6th, 7th and 8th job, after some time all the jobs will fail.

      Desired Behavior

      • The agent should execute multiple tasks simultaneously.

      Workaround

      • Create a config file in AGENT_HOME/config/agent.conf and insert the following settings. the path for a defaul agent set might look like ./sos-berli/agent/var_4445/config/agent.conf
        akka.actor.default-dispatcher.fork-join-executor.parallelism-min=20
      • After creatigng the agent.conf file with setting, restart the agent.
      • The setting {{akka.actor.default-dispatcher.fork-join-executor.parallelism-min}{ reserves the total threads required to start the shell script process. Recommended is to allocate bit higher threads then expected concurrent process,e.g., if the expected concurrent process are 15, set the value to 20.

      Maintainers Note

      • Once the fix is released with the version 1.12.9, after update to the release 1.12.9, delete the AGENT_HOME/config/agent.conf file

        Issue Links

          Activity

          Hide
          Mahendra Patidar added a comment -

          Please see the OTRS for more information.

          Your SOS Support Team

          Mahendra Patidar


          Software- und Organisations-Service GmbH
          Giesebrechtstr. 15, 10629 Berlin
          Germany
          Tel. +49 (30) 86 47 90-0
          Mail mahendra.patidar@sos-berlin.com
          Web [http://www.sos-berlin.com|../../../../../]

          Managing Director/Geschäftsführung: Andreas Püschel
          Registered/Gerichtsstand: Amtsgericht Berlin-Charlottenburg, HRB 21015

          ---- Forwarded message from "Tin Aye" <tin.aye@cgi.com> —
          From: "Tin Aye" <tin.aye@cgi.com>
          To: -Standard Support
          Subject: Job Scheduler - "waiting_for_agent" issue related to JS-1830
          Date: 2019-02-06 18:03:58
          Hello SOS Team,

          We are running into the issue reported in this JIRA ticket - https://change.sos-berlin.com/browse/JS-1830. We see Status is Resolved. Could you let us know where we can obtain the fix for this?

          Here is the screenshot from our JOC. Our log files are attached. The issue is created around 08:30-09:00 EST. Our Job Scheduler Master is 1.12.8 and Job Scheduler agents are 1.12.7.

          ---- End forwarded message —

          Show
          Mahendra Patidar added a comment - Please see the OTRS for more information. Your SOS Support Team Mahendra Patidar – Software- und Organisations-Service GmbH Giesebrechtstr. 15, 10629 Berlin Germany Tel. +49 (30) 86 47 90-0 Mail mahendra.patidar@sos-berlin.com Web [http://www.sos-berlin.com|../../../../../] Managing Director/Geschäftsführung: Andreas Püschel Registered/Gerichtsstand: Amtsgericht Berlin-Charlottenburg, HRB 21015 – ---- Forwarded message from "Tin Aye" <tin.aye@cgi.com> — From: "Tin Aye" <tin.aye@cgi.com> To: -Standard Support Subject: Job Scheduler - "waiting_for_agent" issue related to JS-1830 Date: 2019-02-06 18:03:58 Hello SOS Team, We are running into the issue reported in this JIRA ticket - https://change.sos-berlin.com/browse/JS-1830. We see Status is Resolved. Could you let us know where we can obtain the fix for this? Here is the screenshot from our JOC. Our log files are attached. The issue is created around 08:30-09:00 EST. Our Job Scheduler Master is 1.12.8 and Job Scheduler agents are 1.12.7. ---- End forwarded message —

            People

            • Assignee:
              Joacim Zschimmer
              Reporter:
              Mahendra Patidar
              Approver:
              Mahendra Patidar
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: