I´m experiencing a weird behavior with flowable 5.21. The scenario is, we are having a process definition with a number of script tasks in a sequence (no parallel branches). Every activity has the “async=true” flag. What I can see is that every script task is only executed after the asyncJobAcquireTime is over. My assumption so far was that after executing an activity the async executor directly submits the next job into the async executor. From what I can see in the debugger, this seems to happen. However, this new thread tries to lock the process instance before the first (original) thread is able to unlock the process instance. In the logs, I see that the second thread throws an ActivitiOptimisticLockingException in ExecutionEntityManager#updateProcessInstanceLockTime.
My obseervation is also that the first thread uses two different transactions for executing the job and unlocking it. As the submission to the async executor happens in a commit transaction listener, it is at least executed in parallel to the actual unlocking.
This looks to me like a timing issue. Any hints or guidance?
By default the async job should be immediately executed. Are you using the async or the old job executor?
Only when the async job fails, there will be a due date and the async job is not executed immediately.
Can you reproduce the behaviour in a unit test?
I created an issue in github for this and attached the resources there:
Looking forward to your feedback.
Thanks for the unit test in the issue, that makes things easy to talk about!
First of all: the unit test can be made to work by adding exclusive to ‘false’. So now for the long explanation. What’s happening, is the following:
- All the steps in the process definition are marked as async. The default is to be ‘exclusive’.
- When a step is marked as exclusive, the process instance is locked before the step is executed, making sure only one activity of the same process instance is executed at any time. When the locking fails, an optimistic locking exception is thrown. This is actually what is expected when there is a fight about the lock. (Arguably, the logging should be lowered).
- This is indeed done in two transactions and need to be in two transactions for correctness: first the current executor needs to have the lock in a guaranteed way (by committing the transaction) before executing the async logic.
- The race condition here is not good though, as this normally shouldn’t happen as everything is indeed sequential. This is probably due to the post-commit transaction listener being on the execute of the job logic and not on the unlock. However, changing this is not easy. We’ll look into it and post an update here.
- Normally, this isn’t a problem however, as the wait time before a new acquiring is small and the job will be retried later. In the unit test here however, the wait time is set to a high value, and the job is never retried.
So in short: the optimistic locking exception is to be expected when doing exclusive jobs and is part of the ‘correct’ working.
Now, in this case, there is nothing in parallel possible, it is all sequential. Locking the process instance is not needed in this case. This can be done by setting the ‘exclusive’ flag to false for all the steps: no locking will happen now.
The reason why exclusive is the default is because it is the safest choice (and you need to understand the underlying mechanism), but with a potential performance hit of the extra optimistic locking exception as seen in the test. But eventually, the process instance will execute correctly.
Fixed the (potential) race condition here: https://github.com/flowable/flowable-engine/commit/829da955b12218967f830c691f4b2aaacc25b043 :
Unlocking of the job (if it’s exclusive) is now part of the same transaction as the execution of the job. This way, the next jobs (in the example above the second script task) will be inserted in the database (and thus found by the async executor) at the same moment as the unlocking. The newly created jobs won’t be trying to lock before the old job has been fully finished.
In case of job failure, the unlocking is now done in a new transaction. This is needed as the original job transaction should not be used for this (as it’s failing).
@tijs : Can you verify whether you think this is a correct change?
@Stefan: this is now fixed on the v6 branch (it was the same problem there). Do you need the fix on v5 too (we can port it) or do you plan to migrate to v6 soon?
Thanks Joram for fixing this so fast. From a conceptual point of view the fix you are describing looks very promising.
We are currently looking into v6 and will start migration in the very near future. The only reason why a downport to v5 might make sense is the embedded v5 engine which is shipped with v6. It would be very beneficial (not just for us) if the embedded v5 engine would also include this fix.
Yes, the embedded v5 engine needs to have this fix as well. We’ll make sure it gets applied.
Super. Thanks a lot for your immediate help