Unexpected behavior when retrying a CallActivity depending on the completeAsync-flag

Dear Flowable experts,

I am observing different results when retrying a CallActivity with multi-instance behavior, depending on the value of completeAsync.

This topic is similar to this other topic. However, they are focusing on inconsistencies of the event context in case of a retry of a failed CallActivity. I decided to create a dedicated topic for this because they only mention my problem in a comment in their linked PR.

TL/DR:
If the CallActivity (with multi-instance behavior) is configured with completeAsync=true and retried after an error, only one subflow completes successfully, the other subflows get canceled.

Unit Test:
Here is a commit with a complete test setup. For completeness of this post, I will outline the scenario here as well.

The parent process contains a CallActivity with multi-instance behavior, which starts 3 subflows. The CallActivity is either configured with completeAsync=true or completeAsync=false. Additionally, the CallActivity contains a VariableAggregation, which sole purpose is to provoke an error inside ParallelMultiInstanceBehavior::leave. The subflow only contains an start and end event.

The test where completeAsync is true fails (only one of the subflows completes successfully, the other two subflows get canceled), but the test where it is false passes.

    @Test
    @Deployment(resources = {
            "org/flowable/engine/test/api/event/CallActivityTest.testCallActivityWithMultiInstanceBehaviorAndAsyncCompleteRetry.bpmn20.xml",
            "org/flowable/engine/test/api/event/CallActivityTest.testCallActivityWithMultiInstanceBehaviorAndAsyncCompleteRetry_subflow.bpmn20.xml",
    })
    public void testCallActivityWithMultiInstanceBehaviorAndAsyncCompleteRetry() throws Exception {
        testCallActivityWithMultiInstanceBehaviorRetryAfterFailure();
    }
    
    @Test
    @Deployment(resources = {
            "org/flowable/engine/test/api/event/CallActivityTest.testCallActivityWithMultiInstanceBehaviorAndSyncCompleteRetry.bpmn20.xml",
            "org/flowable/engine/test/api/event/CallActivityTest.testCallActivityWithMultiInstanceBehaviorAndAsyncCompleteRetry_subflow.bpmn20.xml",
    })
    public void testCallActivityWithMultiInstanceBehaviorAndSyncCompleteRetry() throws Exception {
        testCallActivityWithMultiInstanceBehaviorRetryAfterFailure();
    }
    
    public void testCallActivityWithMultiInstanceBehaviorRetryAfterFailure() throws Exception {
        // Set number of retries to 1. In case of an error, the Job becomes a DeadLetterJob
        processEngineConfiguration.getJobServiceConfiguration().setAsyncExecutorNumberOfRetries(1);

        ProcessInstance processInstance = runtimeService.startProcessInstanceByKey("callActivity");

        waitForJobExecutorToProcessAllJobsAndExecutableTimerJobs(20000L, 200L);

        // Check erroneous state:
        // The CallActivity is configured with VariableAggregation. However, it is intentionally referencing a non-existing variable. This will cause the CallActivity to fail.
        assertThat(processEngine.getManagementService().createDeadLetterJobQuery().withException().count() == 3).isTrue();

        // Set the missing process variable. After it is set, the process should complete successfully.
        runtimeService.setVariable(processInstance.getId(), "nonExistingVariable", "target");

        // Retry failed jobs
        managementService.createDeadLetterJobQuery().list().forEach(job -> managementService.moveDeadLetterJobToExecutableJob(job.getId(), 3));
        waitForJobExecutorToProcessAllJobsAndExecutableTimerJobs(20000L, 200L);
        
        // Check that the parent process and its subflows completed
        assertProcessEnded(processInstance.getId());
    }

Error Analysis: I will now explain what goes wrong, if completeAsync is true:

When the exception is thrown due to the VariableAggregation, the execution of the current job is set to “inactive” (JobRetryCmd::execute). In this test, the execution of the current job refers to the “CallActivity” (in the attached image, these are the executions with the IDs 17, 18, and 19). This differs from the test where completeAsync is false (there, the execution of the jobs refer to the EndEvent of the subflows (see IDs 23, 29, and 35 in the attached image)). After the retry (i.e. moving the DeadLetterJobs), all three CallActivity-executions are executed again. However, they remain “inactive” (i.e. the column IS_ACTIVE in the DB has the value false).

Every time an instance of the CallActivity finishes, the number of “completed” instances is compared to the total number of instances (in ParallelMultiInstanceBehavior::internalLeave). If nrOfCompletedInstances >= nrOfInstances || isCompletionConditionSatisfied evaluates to true all remaining instances of the CallActivity get cancelled. However, here is the problem: when the first instance reaches this point (after the retry), the value of nrOfCompletedInstances is 4 (which is wrong) and the value of nrOfInstances is 3 (which is correct). Hence, the other two instances get cancelled.
The value of nrOfCompletedInstances is calculated by counting the number of CallActivity-executions where IS_ACTIVE==false (i.e. 17, 18, 19) (+ 1 for the current instance).

Question / Feedback Needed:
In my opinion the test should pass in both cases. Therefore, it looks like a bug to me, but I am not sure. I would like to know if this is indeed a bug, or if the current behavior is intended. As of now, I don’t have a proposal for a fix, but I would be happy to give it a try if that is desired.

Appendix:


I have created a bug report in the Flowable-Engine repo, because it is rather a bug report than a discussion for the community. This post can be closed