Parallel subprocess and locking problem

Hello
I’m using Flowable 6.7.2 in multiserver env. I have simple master-slave process.(definitions below)

When, I start it I see in act_ru_job table records like that

When master proces enteres into “subprocess” elements this should create 100 subprocesses(slave) but these jobs are “exclusive” so there are trying too lock the same process_id of parent. A don’t understand why these recods dosen’t contains process_id of slave. These is parallel execution so locking on parent_id leads to synchronization.

We see it on our procduction env when process with 100.000 subprocesses takes very long time and most of it is spending on waiting on locking parent process id. (ThreadDump below). Maybe there is some problem in our definitions

<?xml version="1.0" encoding="UTF-8"?>
<definitions
	xmlns="http://www.omg.org/spec/BPMN/20100524/MODEL"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:xsd="http://www.w3.org/2001/XMLSchema"
	xmlns:bpmndi="http://www.omg.org/spec/BPMN/20100524/DI"
	xmlns:omgdc="http://www.omg.org/spec/DD/20100524/DC"
	xmlns:omgdi="http://www.omg.org/spec/DD/20100524/DI"
	xmlns:flowable="http://flowable.org/bpmn"
	typeLanguage="http://www.w3.org/2001/XMLSchema"
	expressionLanguage="http://www.w3.org/1999/XPath"
	targetNamespace="http://www.flowable.org/processdef">
	<process id="master" name="master" isExecutable="true">
		<startEvent id="startMaster" />
		<scriptTask id="scriptTaskMaster" name="scriptTaskMaster"
			flowable:exclusive="true" flowable:async="true" scriptFormat="groovy">
			<script>
				println ' START master '
			</script>
		</scriptTask>


		<callActivity id="subprocessMaster" calledElement="slave"
			flowable:async="true" flowable:completeAsync="true"
			flowable:exclusive="true">
			<multiInstanceLoopCharacteristics
				isSequential="false" flowable:collection="EmployeeRecords"></multiInstanceLoopCharacteristics>
		</callActivity>

		<scriptTask id="scriptTaskMaster2" name="scriptTaskMaster2"
			flowable:exclusive="true" flowable:async="true" scriptFormat="groovy">
			<script>
				println ' STOP master '
			</script>
		</scriptTask>
		<userTask id="userTaskMaster" name="userTaskMaster" flowable:async="true" />

		<sequenceFlow sourceRef="startMaster" targetRef="scriptTaskMaster" />
		<sequenceFlow sourceRef="scriptTaskMaster"
			targetRef="subprocessMaster" />
		<sequenceFlow sourceRef="subprocessMaster"
			targetRef="scriptTaskMaster2" />
		<sequenceFlow sourceRef="scriptTaskMaster2"
			targetRef="userTaskMaster" />
		<sequenceFlow sourceRef="userTaskMaster" targetRef="endMaster" />


		<endEvent id="endMaster" />


	</process>

</definitions>

Slave

<?xml version="1.0" encoding="UTF-8"?>
<definitions
	xmlns="http://www.omg.org/spec/BPMN/20100524/MODEL"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:xsd="http://www.w3.org/2001/XMLSchema"
	xmlns:bpmndi="http://www.omg.org/spec/BPMN/20100524/DI"
	xmlns:omgdc="http://www.omg.org/spec/DD/20100524/DC"
	xmlns:omgdi="http://www.omg.org/spec/DD/20100524/DI"
	xmlns:flowable="http://flowable.org/bpmn"
	typeLanguage="http://www.w3.org/2001/XMLSchema"
	expressionLanguage="http://www.w3.org/1999/XPath"
	targetNamespace="http://www.flowable.org/processdef">

	<process id="slave" name="slave"
		isExecutable="true">
		<startEvent id="start" />
		    <scriptTask id="scriptTask" name="scriptTask" flowable:async="true" scriptFormat="groovy" flowable:exclusive="true">
     <script>
     	println '	START slave scriptTask'
     	Thread.sleep(2000)
     </script>

    </scriptTask>
		<sequenceFlow sourceRef="start" targetRef="scriptTask" />
		<sequenceFlow sourceRef="scriptTask" targetRef="end" />

		<endEvent id="end" flowable:async="true">
		</endEvent>
	</process>

</definitions>
org.flowable.engine.impl.persistence.entity.data.impl.MybatisExecutionDataManager.updateProcessInstanceLockTime(MybatisExecutionDataManager.java:309)
org.flowable.engine.impl.persistence.entity.ExecutionEntityManagerImpl.updateProcessInstanceLockTime(ExecutionEntityManagerImpl.java:1065)
org.flowable.engine.impl.cfg.DefaultInternalJobManager.lockJobScopeInternal(DefaultInternalJobManager.java:167)
org.flowable.job.service.ScopeAwareInternalJobManager.lockJobScope(ScopeAwareInternalJobManager.java:79)
org.flowable.job.service.impl.cmd.LockExclusiveJobCmd.execute(LockExclusiveJobCmd.java:59)
org.flowable.engine.impl.interceptor.CommandInvoker$1.run(CommandInvoker.java:67)
org.flowable.engine.impl.interceptor.CommandInvoker.executeOperation(CommandInvoker.java:140)
org.flowable.engine.impl.interceptor.CommandInvoker.executeOperations(CommandInvoker.java:114)
org.flowable.engine.impl.interceptor.CommandInvoker.execute(CommandInvoker.java:72)
org.flowable.engine.impl.interceptor.BpmnOverrideContextInterceptor.execute(BpmnOverrideContextInterceptor.java:26)
org.flowable.common.engine.impl.interceptor.TransactionContextInterceptor.execute(TransactionContextInterceptor.java:53)
org.flowable.common.engine.impl.interceptor.CommandContextInterceptor.execute(CommandContextInterceptor.java:105)
org.flowable.common.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:30)
org.flowable.common.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:56)
org.flowable.common.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:51)
org.flowable.job.service.impl.asyncexecutor.ExecuteAsyncRunnable.lockJob(ExecuteAsyncRunnable.java:178)
org.flowable.job.service.impl.asyncexecutor.ExecuteAsyncRunnable.run(ExecuteAsyncRunnable.java:112)
java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

After debugging of Flowable source code I have some thoughts.
On callActivity element we used flowable:completeAsync="true which means that ending of subprocess should be handle by asynch executor with exclusive behaviour and this is correct. But in my opinion algorith of asynch executor with “exclusive” jobs is not good. Imagine 8 servers each 8 thread pool size. So we have 64 threads, and now process iwth 100k subprocesses. So these 64 threads are trying to execute these jobs. Only one thread in time could win this race by adding lock on parent process(update with lockTime) another threads get FlowableLockException and execute unacquireJob for future execution and over and over. In my opinion “AcquireAsyncJobsDueRunnable” should be smarter.It should skip records in act_ru_job which are already locked because this probolby fail with FlowableLockException

Add to query “selectJobsToExecute” something like that
and RES.PROCESS_INSTANCE_ID_ NOT IN (SELECT PROCESS_INSTANCE_ID_ FROM from ${prefix}ACT_RU_JOB RES WHERE LOCK_OWNER_ IS NOT NULL and EXCLUSIVE_ =1)

Those are some harsh words (“is not good”, “should be smarter”, …). The algorithm as it is today is written to guarantee safety of the data and it has been running with great success for many users/customers.

Do take into account you are doing 100k subprocesses for one instance … this is not regular modelling or usage as far as we know. In fact, 100k instances means 100k execution rows/entities, and I’m surprised the database/JDK actually executes that and doesn’t time out.

That is something to consider, but do take into account we had a lot of these things on the query in the past, and it was bad for performance (table scans when having 1M+ jobs). But we’ll review this suggestion and see what the impact is, thank you for your feedback.

Thanks for answer,
Sorry for hard words, Flowable works great in normal cases 200k processes per day :).

Could you tell me how many suprocesses you have in standard production env?

I thinking about one more solution. I could add to regular engines additional where to query “selectJobsToExecute”
and (RES.HANDLER_TYPE_ !='async-complete-call-actiivty' OR RES.HANDLER_TYPE_ !='parallel-multi-instance-complete')
so engines skips this problematic step. And then create one engine with threadPoolSize=1 to handle only(RES.HANDLER_TYPE_ ='async-complete-call-actiivty' OR RES.HANDLER_TYPE_ ='parallel-multi-instance-complete') .Of course this could be still problem because in Flowable tyring to add jobs without query db - straight to ThreadPoolQueue, but this could lower db locks on parent_id