We’re encountering a very reproducible problem with asynchronous service tasks (event registry “send-event”) executed in parallel. I’ve assembled a minimal facsimile of the problem here: GitHub - chaserb/parallel-async-service-tasks: Demonstration of zombie process instances . The problem is that about 40-50% of the time, my process instance winds up in a zombie state in that it has a single record in the ACT_RU_EXECUTION table with a null ACT_ID_, PARENT_ID_, and SUPER_EXEC_ even though all the service tasks complete successfully.
I think I’m on to the solution here. I noticed I was getting FlowableOptimisticLockingException when my parent process had an explicitly declared parallelGateway on the join side, so I updated my TestInboundChannel to catch FlowableOptimisticLockingException and retry the event which I believe (am hoping ) simulates the NACKs back to the Rabbit Message Broker that should be performed by the production event registry…I’ll test that.
I think the real issue is that my original process definition had implicit join gateways, which was what produced the zombie process instances described above. Having explicit parallel gateways with retry on FlowableOptimisticLockingException causes my tests to pass in the example repo.
I had a quick look at the code and had following question:
You’re executing the jobs yourself through managementService.executeJob. Is there a reason for not wanting to use the async executor (as there are some extra pieces of logic that happen when doing so)
What’s the purpose of making the send task a wait state (i.e. triggerable)? Not sure I’m getting the use case here yet.
Sorry, I could have been more clear. Thanks for your quick reply.
Regarding the triggerable flag, we use that to implement the Request-Reply pattern with async callback, using the execution ID as the correlation ID. For example, one of our serviceTasks will dispatch a “send email request” event to our service that accomplishes this, and then check the success of that request on the “send email response” event.
Regarding the managementService.executeJob(), my only intent was to ensure the test was truly multi-threaded to simulate multiple cluster nodes receiving responses nearly simultaneously. I didn’t realize the async executor was an option in a test setup. I can give that a try.
A question I have regarding this relates to the async executor which we have configured with the default number of retries at 3 retries. Will the event registry “response” events benefit from this setting? Reason I ask is that the request events are dispatched on threads named “task-123”, but the response events are received on threads named “org.flowable.eventregistry.rabbit.ChannelRabbitListenerEndpointContainer#workflowInbound-1”
If your send task is async (which it is), then the sending will be done by the async executor.
On the receival side, you would also need to make the first step async, or it will be run on the thread of the receiver. It’s the same story as e.g. a web request - it’ll be handled by the web container thread, unless you make a step async.
Yes - you mention above you’re using rabbitMQ, right?
This means that the receival will be handled on the rabbitMQ thread and making the step async will then hand-off to the async executor, freeing up the rabbitMQ thread.