Upgrade to v6, Timer jobs ends up in deadletter_job

We are in the process of upgrading from activiti 5.14 to -> flowable 6.
In our current setup we have 11 timer jobs which reside in the act_ru_job table.
When starting the server which runs flowable 6, the database updated with the new jobs tables.
According to the migration guide, our timer jobs should have ended up in the new act_ru_timer_job table.
But in our case they end up in the act_ru_deadletter_job table with exception message: “No process definition found with id null”.
In activiti 5.14 all these timer jobs in act_ru_job actually has null value in the column proc_def_id_ but this is as it has always been. The column “handler_cfg_” refers to the actual process definition.

I also have another question. In the old job table, rerunning failed jobs was as easy as updating the retries_ column. Now when jobs are moved to the deadletter table, what is the easiest way of rerunning failed jobs when working in the database?

Thanks in advance for any help.

Hi,

Thanks for the feedback. When a message ends up in the deadletter job table it can have two reasons. One is that the retries value was already 0 in the act_ru_job table in the old version. The migrate logic will move these jobs to the deadletter job table. The second reason is that after the migration the job is executed until the retries value is 0 and then it’s moved to the deadletter job table. Can you check if the retries value was 0 already before migrating?

You can still easily rerun a failed job, but now the easiest way is to call the moveDeadLetterJobToExecutableJob in the ManagementService. This will move the deadletter job to the executable job table (act_ru_job).

Best regards,

Tijs

Hey,

To add to the explanation of Tijs. In case you want to immediately execute the moved dead letter job. You can additionally call ManagementService#executeJon(String) with the id of the job.

Cheers,
Filip

Hello again,
Thanks for your replies.
I have been testing a bit more and still I have not found how to get the timer processes to run correctly through a direct upgrade.
This is what they look like in our current 5.14 installation, as you can see proc_def_id_ is null here.
image
So when I started the flowable 6 version, the database was upgraded, and these entries where moved to the act_ru_timer_job table, also here with proc_def_id_= null
When the process engine attempted to run these jobs, it gave an error saying “can’t find process definition with id null” And then they were moved to the deadletter table.

I also did another approach: before I started the flowable version I deleted all timer jobs in the job table, I also removed all timer process definitions from act_re_procdef and act_ge_bytearray.
This caused the definitions to be redeployed and all timer jobs ended up in the timer_job table, with proc_def_id_ now set to the corresponding defintion.

When the process engine started to execute these timer processes it initially seemed to work fine.
But then I discovered that some processes started to disappear, seemingly at random. By disappear I mean deleted from the act_ru_timer_job table, and not placed in any of the other job tables.
Investigating this further I have concluded that if more than one job has the same due_date (and time) the jobs disappear after execution.

Would be really grateful for any help on how to work further on this issue. For now we are pretty stuck.

@filiphr, @tijs, any thoghts about this?

Hi,

I think the problem is that the logic to execute a job with a process definition key in the HANDLER_CFG_ column is part of the Flowable 5 embedded engine. But that won’t be used in this case because we can’t determine that this is a V5 job or not. We could implement additional logic to check when the PROC_DEF_ID_ column is null and the HANDLER_CFG_ column contains a valid V5 process definition key that it needs to execute the TimerStartEventJobHandler class of the Flowable 5 engine. Then it would work.

That you see processes started disappearing is something that definitely shouldn’t happen. The only logical explanation would be that the timer was executed successfully and therefore removed. There’s no logic that needs the due date to be unique in some way.

Would the change in determining if the job is a V5 job or not work for you?

Best regards,

Tijs

Hi again,
Yes, it would seem that implementing that additional logic could solve the problem. But as i mentioned previously,we can work around it by redeploying the definitions, thus using the flowable 6 engine.

I am more concerned about the disappearing jobs.

I did some more testing, I set the due date on all jobs in act_ru_timer_job table one day ahead.
Then I changed the due date, directly in the database, on one job to a point in the past.
The engine picked id up, removed it from act_ru_timer_job table, executed it and, added a new entry in act_ru_timer_job table with a new due date for the next execution.
Tested this a couple of time for different jobs and all went well.

Then I did the same procedure on two jobs at the same time. That is, I changed the due date on two jobs, to a point in the past.
Committed the changes in the database.
The engine picked up both jobs, removed them from act_ru_timer_job table, executed them, but no new records where added back to the act_ru_timer_job table.

I will attempt to debug the flowable code to see if I can figure out where it goes wrong.
It would be helpful if you could point me to where in your code the jobs are inserted back into the act_ru_timer_job table, that would give me good place to start.

Regards
Jarle

Hey Jarle,

Inserting a timer back into the timer job tables depends on the configuration of the timer. Only timers with a set repeat (cycle) will be inserted back (and only if the repeat is not down to 0, or the end time is not finished).

Have a look at

Thanks for the pointer Filiph,
Debugged this code just now, And find that if just one timerjob is picked up timerEntity.getRepeat() is not null but returns repeat expression as expected. Thus a new timerjob is inserted.

If two or more jobs are picked up, timerEntity.getRepeat() returns null, thus no new job is scheduled.
So next step would be to figure out why the repeat property is null when more than one job is read at a time
As you can see from the attached screenshot, the left side is how the job looks when picked up as the only timerjob, here repeat is set correctly. The exact same job is also on the right, but this is when it is picked up along with another job, now the repeat value is null.
image