Zombie Processes With Parallel Async Service Tasks

Hello,

We’re encountering a very reproducible problem with asynchronous service tasks (event registry “send-event”) executed in parallel. I’ve assembled a minimal facsimile of the problem here: GitHub - chaserb/parallel-async-service-tasks: Demonstration of zombie process instances . The problem is that about 40-50% of the time, my process instance winds up in a zombie state in that it has a single record in the ACT_RU_EXECUTION table with a null ACT_ID_, PARENT_ID_, and SUPER_EXEC_ even though all the service tasks complete successfully.

I noticed a similar problem here: ParallelGateway - Process Instance remains after all sub-processes complete , but I’m positive our async executor is running normally.

I also tried the various serviceTask options suggested here: ParallelGateway - Process Instance remains after all sub-processes complete - #2 by adymlincoln , but they did not help the situation.

Thanks for your help,
Chase

LET’S HOLD OFF ON THIS FOR A BIT. I don’t think this is demonstrating what I intended. Let me work it a little more.

I think I’m on to the solution here. I noticed I was getting FlowableOptimisticLockingException when my parent process had an explicitly declared parallelGateway on the join side, so I updated my TestInboundChannel to catch FlowableOptimisticLockingException and retry the event which I believe (am hoping :grinning_face:) simulates the NACKs back to the Rabbit Message Broker that should be performed by the production event registry…I’ll test that.

I think the real issue is that my original process definition had implicit join gateways, which was what produced the zombie process instances described above. Having explicit parallel gateways with retry on FlowableOptimisticLockingException causes my tests to pass in the example repo.

I haven’t looked at the example, but in that situation, I would expect a deadletter job for the joining gateway. Did you see that?

No, I don’t have any records in the dead letter job table for those process instances.

I had a quick look at the code and had following question:

  • You’re executing the jobs yourself through managementService.executeJob. Is there a reason for not wanting to use the async executor (as there are some extra pieces of logic that happen when doing so)
  • What’s the purpose of making the send task a wait state (i.e. triggerable)? Not sure I’m getting the use case here yet.

Sorry, I could have been more clear. Thanks for your quick reply.

Regarding the triggerable flag, we use that to implement the Request-Reply pattern with async callback, using the execution ID as the correlation ID. For example, one of our serviceTasks will dispatch a “send email request” event to our service that accomplishes this, and then check the success of that request on the “send email response” event.

Regarding the managementService.executeJob(), my only intent was to ensure the test was truly multi-threaded to simulate multiple cluster nodes receiving responses nearly simultaneously. I didn’t realize the async executor was an option in a test setup. I can give that a try.

Thanks,
Chase

A question I have regarding this relates to the async executor which we have configured with the default number of retries at 3 retries. Will the event registry “response” events benefit from this setting? Reason I ask is that the request events are dispatched on threads named “task-123”, but the response events are received on threads named “org.flowable.eventregistry.rabbit.ChannelRabbitListenerEndpointContainer#workflowInbound-1”

If your send task is async (which it is), then the sending will be done by the async executor.

On the receival side, you would also need to make the first step async, or it will be run on the thread of the receiver. It’s the same story as e.g. a web request - it’ll be handled by the web container thread, unless you make a step async.

Let me rephrase what you said just so I’m sure I understand. I currently have this in the child process:

  • (startEvent) → [serviceTask (async=true)] → (endEvent)

To incorporate the async executor on the receival side, I would need to make the first step async, which in the case above becomes the following:

  • (startEvent) → [serviceTask (async=true)] → (endEvent (async=true))

Is that correct?

Thanks again,
Chase

Yes - you mention above you’re using rabbitMQ, right?
This means that the receival will be handled on the rabbitMQ thread and making the step async will then hand-off to the async executor, freeing up the rabbitMQ thread.