We occasionally see an intermediate timer not firing in our continuous integration environment. Recently these failures have become more frequent as a result of increased load on our CI machine. I spent some time tracking down the failure to provide some information regarding the failure.
Our application uses the flowable-spring-boot-starter-basic project to configure the process engine and allow spring to manage transaction demarcation. The failure can be reproduced with a simple intermediate timer catch event flowing into a wait state. Debugging through the code I see that the DefaultJobManager adds a CommandContextCloseListener, specifically AsyncJobAddedNotification, to the command context in the hintAsyncExecutor method when the AcquireTimerJobsRunnable thread runs. Then the execute method in the CommandContextInterceptor proceeds to close the context which triggers the CommandContextCloseListeners to run. At this point in time there are two CommandContextCloseListeners set on the CommandContext. The first is the TransactionCommandContextCloseListener and the second is the AsyncJobAddedNotification. The CommandContextInterceptor eventually calls the afterSessionFlush method for each listener. The TransactionCommandContextCloseListener simply calls ‘commit’ on the associated transaction context. The transaction context is an instance of SpringTransactionContext and the commit method is a NO-OP (i.e doesn’t call commit on the PlatformTransactionManager). Eventually execution is returned back to the CommandContextInterceptor which invokes the closed method on the listeners. The closed method for the AsyncJobAddedNotification then proceeds to execute the job on another thread using the ExecuteAsyncRunnable class and the async executor. The ExecuteAsyncJobCmd eventually runs and is unable to find the job (line 31) if the original thread which moved the timer job to the job table has not committed yet. This problem is exacerbated in our environment more so due to the use of Spring TransactionSynchronizationAdapters that are used to send events in the’beforeCommit’ callback which delays the commit on the AcquireTimerJobsRunnable thread. The described behavior is different than that of a process engine configuration without spring which uses a StandaloneMybatisTransactionContext that guarantees that the move of the timer job to the job table is committed prior to the AsyncJobAddedNotification listener is invoked.
In our CI environment when the test fails the timer never fires for the following reasons:
- We move the clock forward via processEngineConfiguration.getClock().setCurrentTime()
- The ExecuteAsyncJobCmd fails to find the job because the thread that moved the timer job to the job table has not committed.
- When the timer is copied to the job table the lock expiration time is set to be 5 minutes later than the current time as defined by the getTimerLockTimeInMillis of DefaultAsyncJobExecutor.
- The current time is now fixed and therefore will never be after job’s lock expiration time. Therefore the ResetExpiredJobsRunnable will never find it.
- The AcquireAsyncJobsDueRunnable will never find it since the job’s lock expiration time is not null.
- Our tests use JobTestHelper.waitForJobExecutorToProcessAllJobsAndExecutableTimerJobs(…) to wait for all the jobs to process. (This call eventually times out.)
Even though the fact that the timer job never executes in our test environment can be attributed to a fixed clock time this behavior still seems like a bug since there is no guarantee that the timer will fire at its intended time when using the SpringProcessEngineConfiguration due to the race condition between the AcquireTimerJobsRunnableThread and the ExecuteAsyncRunnable thread. In a production environment, which would not manipulate the clock, when the ExecuteAsyncRunnable thread fails to find the job, the timer will fire 5 minutes after its intended time due to the value of the lock expiration time.
I apologize in advance for the long post but this was a tricky one to understand and explain.