We’re encountering a very reproducible problem with asynchronous service tasks (event registry “send-event”) executed in parallel. I’ve assembled a minimal facsimile of the problem here: GitHub - chaserb/parallel-async-service-tasks: Demonstration of zombie process instances . The problem is that about 40-50% of the time, my process instance winds up in a zombie state in that it has a single record in the ACT_RU_EXECUTION table with a null ACT_ID_, PARENT_ID_, and SUPER_EXEC_ even though all the service tasks complete successfully.
I think I’m on to the solution here. I noticed I was getting FlowableOptimisticLockingException when my parent process had an explicitly declared parallelGateway on the join side, so I updated my TestInboundChannel to catch FlowableOptimisticLockingException and retry the event which I believe (am hoping ) simulates the NACKs back to the Rabbit Message Broker that should be performed by the production event registry…I’ll test that.
I think the real issue is that my original process definition had implicit join gateways, which was what produced the zombie process instances described above. Having explicit parallel gateways with retry on FlowableOptimisticLockingException causes my tests to pass in the example repo.
I had a quick look at the code and had following question:
You’re executing the jobs yourself through managementService.executeJob. Is there a reason for not wanting to use the async executor (as there are some extra pieces of logic that happen when doing so)
What’s the purpose of making the send task a wait state (i.e. triggerable)? Not sure I’m getting the use case here yet.