We are seeing ACTIVITY_CANCELLED events received in a different order depending on how we run our tests (each test individually in IDE, vs one at a time in IDE, vs maven command line).
In this particular case, the definition is using a terminate end event. When the process flows to terminate end event, there are five active executions. TerminateEndEventActivityBehavior.deleteExecutionEntities, line 200 calls into the executionEntityManager.collectChildren().
These executions are returned in a different order depending on how the test is run. This is why the tests for my pull request failled, https://github.com/flowable/flowable-engine/pull/470 The event order appears to be determined by how the db returns the result set.
With this pull request you will see the test testMultiInstanceCancelledWhenFlowToTerminateEnd(). Sometimes I see the activity cancelled events in this order:
multi-instance task cancelled->multi-instance task cancelled->boundary event cancelled->root of multi-instance cancelled-> end event cancelled
Other times I see
end event cancelled->multi-instance task cancelled->multi-instance task cancelled->boundary event cancelled->root of multi-instance cancelled
Should the query be sorted in some manner to guarantee the results come back in a deterministic order?
I think you’re right, there’s probably some default sorting happening, most likely the db id. In the IDE, running as a single test, the ids will be different from the maven run (where the process is cached for all the tests).
Imo, the correct event order should be:
Looking at the code, the collectChildren already does a parent->child sorting (which is why the loop is traversing them in reverse order).
Also looking at your example, I think the problem is about the ‘end event cancelled’. Imho, this event should be fired before the others. So the fix might be as simple as firing that event first and checking in the loop where the events are normally fired the id of the execution, and skip firing it when it’s the execution at the terminate end.
Is this something you’d like to add to your PR (as you’ve pretty much figured it out) or do you want us to add it?
Thanks. You can make the fix as far as this ordering issue discussed here.
Just to clarify, the pull request does reproduce this ordering issue, but it also reproduces a separate issue which is we receive different events depending on how a multi-instance user task is cancelled. The multi-instance issue is a separate issue (not an ordering issue). Depending on how the multi-instance user task is cancelled, we receive totally different events. In one scenario we receive only an activity cancelled event for the root multi-user instance, and in another scenario, we receive a cancelled event for the root along with each instance.
Ok, thanks for the clarification.
I’ll look into adding the ordering and fixing the other issue you mention. Will add a comment when done.
I’ve just pushed a fix for this issue, see https://github.com/flowable/flowable-engine/commit/9ea509ca0e0c84e1098a9a9108996386cfa245fa
The commit changes the following:
- Makes the ordering of the collectChildren call consistent (ordering on start time)
- There was often a cancel and a complete event being fired for the same activity, most of the times a terminate end event. This is now only the complete event.
- The order of of cancel events is now from inner steps to scopes (so a task first, then it’s subprocess, callactivity, etc.)
Given your the test, the order is now (wrt the cancel events only):
task 2 (multi instance root)
message boundary flow:
task 2 (multi instance root)
task 1 - execution 5
Note that the end event is not in these lists anymore. It is not cancelled, but completed.
Can you verify this fix and update your PR?
I will take a look at your changes now. THanks!
For reference: issue (https://github.com/flowable/flowable-engine/pull/470) was resolved and PR with unit test accepted. The order in which cancel events wrt multi instance, terminate end event and boundary events being sent is fixed.
The order of collectChildren is now set to create time, however observations showed that this could still lead to child executions sometimes being swapped (probably by having same timestamp or not enough timestamp resolution). However, this was deemed not a problem.