Uploaded image for project: 'JS - JobScheduler'
  1. JS - JobScheduler
  2. JS-1830

Agent fails all the tasks when more than five tasks are run at a time

    XMLWordPrintable

Details

    Description

      Current situation

      • User has 6 standalone jobs with a shell script with sleep e.g. sleep 60 , configure to be executed on remote Linux JobScheduler Universal Agent.
      • User starts first 5 Jobs at the same time, Once all 5 jobs start executing, but if user starts the 6th job, first the 6th job will hang with the message "waiting for agent", 5 previously started jobs on the agent will fail too.
      • The agent log show following message once limit is reached, see attached logs JS-1830-LOGS.zip
         
        .... 
        ... 
        876476.sh) Process started 'job7' 
        2019-01-14 11:52:01,508 +0100 [INFO ] com.sos.scheduler.engine.taskserver.task.process.RichProcess - Start process 
        2019-01-14 11:52:01,520 +0100 [INFO ] com.sos.scheduler.engine.taskserver.task.process.RichProcess - (AgentTaskId(14-3259309389149839280) Pid(4367) /tmp/JobScheduler-Agent-job8-6376966759914975551.sh) Process started 'job8' 
        2019-01-14 11:53:01,982 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(7-6892326541133305520): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.471Z. Task is being killed 
        2019-01-14 11:53:02,012 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(10-3865485915042489164): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.499Z. Task is being killed 
        2019-01-14 11:53:02,022 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(8-3936377009075590372): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.551Z. Task is being killed 
        2019-01-14 11:53:02,024 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(9-7284851154719276103): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.607Z. Task is being killed 
        2019-01-14 11:53:02,026 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(11-6508967328914310046): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.654Z. Task is being killed 
        2019-01-14 11:53:02,029 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(12-1681751001463131093): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:00.662Z. Task is being killed 
        2019-01-14 11:53:02,029 +0100 [INFO ] com.sos.scheduler.engine.agent.command.AgentCommandHandler - #21 CloseTask(AgentTaskId(14-3259309389149839280),false) 
        2019-01-14 11:53:02,072 +0100 [INFO ] com.sos.scheduler.engine.agent.command.AgentCommandHandler - #22 CloseTask(AgentTaskId(13-2219642504410684700),false) 
        2019-01-14 11:53:02,111 +0100 [ERROR] com.sos.scheduler.engine.agent.task.TaskHandler - AgentTaskId(13-2219642504410684700): StandardTaskServer(master=127.0.0.1:45637) has no connection activity since 2019-01-14T10:52:01.642Z. Task is being killed 
        2019-01-14 11:53:02,280 +0100 [INFO ] com.sos.scheduler.engine. 
        .... 
        .... 
        
      • For the customer, the task limit is 5 parallel jobs, but in support's test, the limit is 6-8 jobs.
      • The Linux server where the agent is running is an Ubuntu x64, 4GB RAM and 2 CPU server.
      • See attached agent.log and scheduler.log files

      How to Reproduce

      • Deploy the attached job configuration JS-1830.zip in the ./config/live
      • Change the process class configuration for your Linux.
      • Start the 5 jobs at the same time
      • As soon as the task is started for 5 jobs, start one after another 6th, 7th and 8th job, after some time all the jobs will fail.

      Desired Behavior

      • The agent should execute multiple tasks simultaneously.

      Workaround

      • Create a config file in AGENT_HOME/config/agent.conf and insert the following settings. the path for a defaul agent set might look like ./sos-berli/agent/var_4445/config/agent.conf
        akka.actor.default-dispatcher.fork-join-executor.parallelism-min=20
      • After creatigng the agent.conf file with setting, restart the agent.
      • The setting { {akka.actor.default-dispatcher.fork-join-executor.parallelism-min}

        { reserves the total threads required to start the shell script process. Recommended is to allocate bit higher threads then expected concurrent process,e.g., if the expected concurrent process are 15, set the value to 20.

      • This workaround is applicable for up to 50 parallel job executions with an Agent. For a higher number of job executions the workaround will fail. We therefore recommend to switch to release 1.12.9 or later.

      Maintainers Note

      • Once the fix is released with the version 1.12.9, after update to the release 1.12.9, delete the AGENT_HOME/config/agent.conf file

      Attachments

        1. JS-1830.zip
          3 kB
        2. JS-1830-LOGS.zip
          378 kB

        Issue Links

          Activity

            People

              jz Joacim Zschimmer
              mp Mahendra Patidar
              Mahendra Patidar Mahendra Patidar
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: