Zombie Processes With Parallel Async Service Tasks

Hello,

We’re encountering a very reproducible problem with asynchronous service tasks (event registry “send-event”) executed in parallel. I’ve assembled a minimal facsimile of the problem here: GitHub - chaserb/parallel-async-service-tasks: Demonstration of zombie process instances . The problem is that about 40-50% of the time, my process instance winds up in a zombie state in that it has a single record in the ACT_RU_EXECUTION table with a null ACT_ID_, PARENT_ID_, and SUPER_EXEC_ even though all the service tasks complete successfully.

I noticed a similar problem here: ParallelGateway - Process Instance remains after all sub-processes complete , but I’m positive our async executor is running normally.

I also tried the various serviceTask options suggested here: ParallelGateway - Process Instance remains after all sub-processes complete - #2 by adymlincoln , but they did not help the situation.

Thanks for your help,
Chase

LET’S HOLD OFF ON THIS FOR A BIT. I don’t think this is demonstrating what I intended. Let me work it a little more.

I think I’m on to the solution here. I noticed I was getting FlowableOptimisticLockingException when my parent process had an explicitly declared parallelGateway on the join side, so I updated my TestInboundChannel to catch FlowableOptimisticLockingException and retry the event which I believe (am hoping :grinning_face:) simulates the NACKs back to the Rabbit Message Broker that should be performed by the production event registry…I’ll test that.

I think the real issue is that my original process definition had implicit join gateways, which was what produced the zombie process instances described above. Having explicit parallel gateways with retry on FlowableOptimisticLockingException causes my tests to pass in the example repo.

I haven’t looked at the example, but in that situation, I would expect a deadletter job for the joining gateway. Did you see that?

No, I don’t have any records in the dead letter job table for those process instances.

I had a quick look at the code and had following question:

  • You’re executing the jobs yourself through managementService.executeJob. Is there a reason for not wanting to use the async executor (as there are some extra pieces of logic that happen when doing so)
  • What’s the purpose of making the send task a wait state (i.e. triggerable)? Not sure I’m getting the use case here yet.