No retry triggered in case of exception

Hi everyone,

We are using Flowable 6.2 and we want to implement some retry logic. In some cases, our delegates are interrupted by runtime exceptions and we want to use the default retry (We haven’t changed anything about the retries, we just want to use the default retry mechanism). Everything seems to be perfect, but sometimes the job executor doesn’t want to trigger failed jobs and the process timeouts. I have debugged this and found out something strange: The engine tried to get all async and timer jobs from the DB, but no result was found. Does the AsyncJobExecutor get the failed jobs when the due date expires and do you think there could be a bug in the engine?
The exceptions are thrown in asynchronous (not exclusive) service tasks.

Thanks and best regards
Iliyan Videnov

Hi Iliyan,

For async service tasks, the AsyncJobExecutor will execute the async job the first time. Then, when it fails, the job will be moved to the timer job table with a due date of the current time + the asyncFailedJobWaitTime property of the process engine configuration. The timer job query will then fetch the job again when the due date has passed, and move the job to the async job table and then the AsyncJobExecutor will execute it again.

When the job has failed 3 times, the job will be moved to the dead letter job and the process instance will not continue anymore and you will have to manually execute the job from the dead letter job table.

Best regards,

Tijs

1 Like

Hi Tijs,

All you have said is clear for me. Sometimes the timer job is not converted back to async job and this job is never executed again.

Thanks and best regards,
Iliyan Videnov

HI iliyan,

Ok, did you look in the dead letter job table?
If this is not the case, did you see any exception messages that are related to this?

Best regards,

Tijs

Hi Tijs,

I haven’t checked the deadletter table, as soon as I have acces to a computer I will check it and write you back. There are no flowable exceptions, only my custom exceptions are thrown (which should trigger the wanted retries). Also no error messages are logged.

Thanks and best regards,
Iliyan Videnov

Hi Tijs,

I have tried to reproduce the problem but without success. We will continue working and if the problem occurs again, I will check all the job tables and write you back. Thanks for the help!

Best Regards,
Iliyan

Hi Tijs,

Happy new year! Wish you all the best!

I have reproduced the problem - there was an exception and only one retry was triggered. I checked the ACT_RU_JOB table but didn’t find any records. Then I checked the ACT_RU_TIMER_JOB and found the job I have been searching for but with a strange due date - the create time was 11:21:34.8 and the due date was 11:38:14.8. Almost 20 minutes between retries - this is the reason why my processes timeout sometimes. Do you have any explanation about the problem? Do I miss something somewhere?

Thanks and best regards,
Iliyan Videnov

I have a similar problem, and I’ve found a way to reproduce.

I have a simple process, very fast to execute, with this timer start event running each minute:
timerstart

Then, I run that bpmn, everything goes fine but sometimes I can see that this timer doesn’t work anymore, until I restart the engine (I clean the flowable DBs at startup). It’s not the related to the retries getting to 0, because I use a value very very high.

To reproduce, I go in AcquireTimerJobsRunnable in the run() method and after I have one value in the acquiredJobs array, I just need to make the code throw an exception before the job is effectively launched, to make the job in the acquiredJobs array to not work anymore. So I breakpoint on the commandExecutor.execute method. At this point I can see in the database, for the collection ACT_RU_TIMER_JOB, that there’s an entry with the field LOCK_EXP_TIME_ that is not null; it’s a date a couple of minutes in the future. LOCK_OWNER_ is also not null and REV_ is 2. Then in Eclipse, I generate an exception by stepping into the code until I see a logger where the code do “config.getTransactionPropagation()” and config is a local variable, so I just need to set its value to null to cause a NullPointerException. After that, if I let that run, the entry in ACT_RU_TIMER_JOB will never disappears and the timer will not start again. Its field values will not change. Nothing appears in ACT_RU_DEADLETTER_JOB nor ACT_RU_JOB for this timer. Other timers continue to work normally.

Here I simulated an Exception with a NullPointerException, but on production I saw a MySQL Timeout in the logs.

I’ve found that there’s a flowable-reset-expired-jobs thread, and debugging it I’ve found that it search for expired jobs, correctly I think, but in the ACT_RU_JOB collection, not the ACT_RU_TIMER_JOB collection. I didn’t found a class in the package org.flowable.job.service.impl.asyncexecutor that run a job to watch for expired TIMER_JOB.

I think something is missing, help would be appreciated. Thank you.

Issue: https://github.com/flowable/flowable-engine/issues/2354

The discussion was continued in the issue linked above.