I have a similar problem, and I’ve found a way to reproduce.
I have a simple process, very fast to execute, with this timer start event running each minute:
Then, I run that bpmn, everything goes fine but sometimes I can see that this timer doesn’t work anymore, until I restart the engine (I clean the flowable DBs at startup). It’s not the related to the retries getting to 0, because I use a value very very high.
To reproduce, I go in AcquireTimerJobsRunnable in the run() method and after I have one value in the acquiredJobs array, I just need to make the code throw an exception before the job is effectively launched, to make the job in the acquiredJobs array to not work anymore. So I breakpoint on the commandExecutor.execute method. At this point I can see in the database, for the collection ACT_RU_TIMER_JOB, that there’s an entry with the field LOCK_EXP_TIME_ that is not null; it’s a date a couple of minutes in the future. LOCK_OWNER_ is also not null and REV_ is 2. Then in Eclipse, I generate an exception by stepping into the code until I see a logger where the code do “config.getTransactionPropagation()” and config is a local variable, so I just need to set its value to null to cause a NullPointerException. After that, if I let that run, the entry in ACT_RU_TIMER_JOB will never disappears and the timer will not start again. Its field values will not change. Nothing appears in ACT_RU_DEADLETTER_JOB nor ACT_RU_JOB for this timer. Other timers continue to work normally.
Here I simulated an Exception with a NullPointerException, but on production I saw a MySQL Timeout in the logs.
I’ve found that there’s a flowable-reset-expired-jobs thread, and debugging it I’ve found that it search for expired jobs, correctly I think, but in the ACT_RU_JOB collection, not the ACT_RU_TIMER_JOB collection. I didn’t found a class in the package org.flowable.job.service.impl.asyncexecutor that run a job to watch for expired TIMER_JOB.
I think something is missing, help would be appreciated. Thank you.