Uploaded image for project: 'JS - JobScheduler'
  1. JS - JobScheduler
  2. JS-1721

HeartBeat-Watchdog-Thread in a cluster configuration doesn't work if the database server is very slow (takes more than 1 minute to respond)

    XMLWordPrintable

Details

    Description

      Current Situation

      • We assume we have a JobScheduler cluster configuration.
      • If the database server is very slow, e.g. takes more than 1 minute to respond, then a warning is raised that a thread lock is blocked for more than 15s:
        com.sos.scheduler.engine.cplusplus.runtime.ThreadLock [WARN ] - Waiting for Scheduler ThreadLock, currently acquired by Thread[main,5,main], current stack trace:
        com.sos.scheduler.engine.cplusplus.runtime.ThreadLock$LoggingLock$1
        	at java.net.SocketInputStream.socketRead0(Native Method)
        	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        	at java.net.SocketInputStream.read(SocketInputStream.java:171)
        	at java.net.SocketInputStream.read(SocketInputStream.java:141)
        	at oracle.net.ns.Packet.receive(Packet.java:311)
        	at oracle.net.ns.DataPacket.receive(DataPacket.java:105)
        	at oracle.net.ano.CryptoDataPacket.receive(Unknown Source)
        	at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:305)
        	at oracle.net.ns.NetInputStream.read(NetInputStream.java:249)
        	at oracle.net.ns.NetInputStream.read(NetInputStream.java:171)
        	at oracle.net.ns.NetInputStream.read(NetInputStream.java:89)
        	at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:123)
        	at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:79)
        	at oracle.jdbc.driver.T4CMAREngineStream.unmarshalUB1(T4CMAREngineStream.java:429)
        	at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:397)
        	at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:257)
        	at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:587)
        	at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:210)
        	at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:30)
        	at oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:931)
        	at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1150)
        	at oracle.jdbc.driver.OracleStatement.executeUpdateInternal(OracleStatement.java:1707)
        	at oracle.jdbc.driver.OracleStatement.executeUpdate(OracleStatement.java:1670)
        	at oracle.jdbc.driver.OracleStatementWrapper.executeUpdate(OracleStatementWrapper.java:310)
        

        and then the heartbeat-watchdog-thread which should abort the JobScheduler if the last heartbeat is too old doesn't work.

      Desired Behavior

      • The heartbeat-watchdog-thread should abort the JobScheduler if it is necessary even if a thread lock occurred.
      • To activate this behavior the default setting for automated restart has to be disabled by use of the following setting with ./config/scheduler.xml (see JS-1035):
        <params>
            <param name="scheduler.cluster.restart_after_emergency_abort" value="false"/>
        </params>
        
      • If looping through pending objects takes more than 2 minutes in a cluster configuration then JobScheduler will perfom the following actions:
        • Kill all child processes
        • Remove its PID file
        • Abort operation and terminate the JobScheduler instance
        • A backup instance will become active (if configured)

      Attachments

        Activity

          People

            jz Joacim Zschimmer
            oh Oliver Haufe
            Oliver Haufe Oliver Haufe
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: