Dear Flowable experts,
I am observing different results when retrying a CallActivity with multi-instance behavior, depending on the value of completeAsync
.
This topic is similar to this other topic. However, they are focusing on inconsistencies of the event context in case of a retry of a failed CallActivity. I decided to create a dedicated topic for this because they only mention my problem in a comment in their linked PR.
TL/DR:
If the CallActivity (with multi-instance behavior) is configured with completeAsync=true
and retried after an error, only one subflow completes successfully, the other subflows get canceled.
Unit Test:
Here is a commit with a complete test setup. For completeness of this post, I will outline the scenario here as well.
The parent process contains a CallActivity with multi-instance behavior, which starts 3 subflows. The CallActivity is either configured with completeAsync=true
or completeAsync=false
. Additionally, the CallActivity contains a VariableAggregation, which sole purpose is to provoke an error inside ParallelMultiInstanceBehavior::leave
. The subflow only contains an start and end event.
The test where completeAsync
is true
fails (only one of the subflows completes successfully, the other two subflows get canceled), but the test where it is false
passes.
@Test
@Deployment(resources = {
"org/flowable/engine/test/api/event/CallActivityTest.testCallActivityWithMultiInstanceBehaviorAndAsyncCompleteRetry.bpmn20.xml",
"org/flowable/engine/test/api/event/CallActivityTest.testCallActivityWithMultiInstanceBehaviorAndAsyncCompleteRetry_subflow.bpmn20.xml",
})
public void testCallActivityWithMultiInstanceBehaviorAndAsyncCompleteRetry() throws Exception {
testCallActivityWithMultiInstanceBehaviorRetryAfterFailure();
}
@Test
@Deployment(resources = {
"org/flowable/engine/test/api/event/CallActivityTest.testCallActivityWithMultiInstanceBehaviorAndSyncCompleteRetry.bpmn20.xml",
"org/flowable/engine/test/api/event/CallActivityTest.testCallActivityWithMultiInstanceBehaviorAndAsyncCompleteRetry_subflow.bpmn20.xml",
})
public void testCallActivityWithMultiInstanceBehaviorAndSyncCompleteRetry() throws Exception {
testCallActivityWithMultiInstanceBehaviorRetryAfterFailure();
}
public void testCallActivityWithMultiInstanceBehaviorRetryAfterFailure() throws Exception {
// Set number of retries to 1. In case of an error, the Job becomes a DeadLetterJob
processEngineConfiguration.getJobServiceConfiguration().setAsyncExecutorNumberOfRetries(1);
ProcessInstance processInstance = runtimeService.startProcessInstanceByKey("callActivity");
waitForJobExecutorToProcessAllJobsAndExecutableTimerJobs(20000L, 200L);
// Check erroneous state:
// The CallActivity is configured with VariableAggregation. However, it is intentionally referencing a non-existing variable. This will cause the CallActivity to fail.
assertThat(processEngine.getManagementService().createDeadLetterJobQuery().withException().count() == 3).isTrue();
// Set the missing process variable. After it is set, the process should complete successfully.
runtimeService.setVariable(processInstance.getId(), "nonExistingVariable", "target");
// Retry failed jobs
managementService.createDeadLetterJobQuery().list().forEach(job -> managementService.moveDeadLetterJobToExecutableJob(job.getId(), 3));
waitForJobExecutorToProcessAllJobsAndExecutableTimerJobs(20000L, 200L);
// Check that the parent process and its subflows completed
assertProcessEnded(processInstance.getId());
}
Error Analysis: I will now explain what goes wrong, if completeAsync
is true
:
When the exception is thrown due to the VariableAggregation, the execution of the current job is set to “inactive” (JobRetryCmd::execute
). In this test, the execution of the current job refers to the “CallActivity” (in the attached image, these are the executions with the IDs 17
, 18
, and 19
). This differs from the test where completeAsync
is false
(there, the execution of the jobs refer to the EndEvent of the subflows (see IDs 23
, 29
, and 35
in the attached image)). After the retry (i.e. moving the DeadLetterJobs), all three CallActivity-executions are executed again. However, they remain “inactive” (i.e. the column IS_ACTIVE
in the DB has the value false
).
Every time an instance of the CallActivity finishes, the number of “completed” instances is compared to the total number of instances (in ParallelMultiInstanceBehavior::internalLeave
). If nrOfCompletedInstances >= nrOfInstances || isCompletionConditionSatisfied
evaluates to true
all remaining instances of the CallActivity get cancelled. However, here is the problem: when the first instance reaches this point (after the retry), the value of nrOfCompletedInstances
is 4
(which is wrong) and the value of nrOfInstances
is 3
(which is correct). Hence, the other two instances get cancelled.
The value of nrOfCompletedInstances
is calculated by counting the number of CallActivity-executions where IS_ACTIVE==false
(i.e. 17
, 18
, 19
) (+ 1 for the current instance).
Question / Feedback Needed:
In my opinion the test should pass in both cases. Therefore, it looks like a bug to me, but I am not sure. I would like to know if this is indeed a bug, or if the current behavior is intended. As of now, I don’t have a proposal for a fix, but I would be happy to give it a try if that is desired.
Appendix: