Flowable 6.5.0 - Deadlock detected ACT_RU_EXECUTION, ACT_RE_PROCDEF

I recently upgraded to 6.5.0 from 6.3.0 and these deadlock error started coming after this upgradation.
Database - postgres
And I have “Multi-Schema Multi-Tenancy” configured.

These deadlocks are coming for ACT_RU_EXECUTION, ACT_RE_PROCDEF table.

I debug more for one of the failure and found that, one request to delete a process is sent, and it took around 10 mins to get a response from RuntimeService.deleteProcessInstance(String, String) API. Since response din’t came for long, so another request is sent for delete and hence deadlock.

I have 2 questions here -

  1. What could be the probable reason for such a simple operation taking so much time. The flowable engine is running on same server, so there is no network latency.
  2. Once deadlock error comes, any operation for any tenant also starts failing. For example even process launch is failing -

Error querying database. Cause: org.postgresql.util.PSQLException: ERROR: current transaction is aborted, commands ignored until end of transaction block
The error may exist in org/flowable/db/mapping/entity/ProcessDefinition.xml
The error may involve org.flowable.engine.impl.persistence.entity.ProcessDefinitionEntityImpl.selectProcessDefinition-Inline
The error occurred while setting parameters
SQL: select * from ACT_RE_PROCDEF where ID_ = ?
Cause: org.postgresql.util.PSQLException: ERROR: current transaction is aborted, commands ignored until end of transaction block
at org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:149)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:140)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectOne(DefaultSqlSession.java:76)
at org.flowable.common.engine.impl.db.DbSqlSession.selectById(DbSqlSession.java:304)
at org.flowable.common.engine.impl.db.AbstractDataManager.findById(AbstractDataManager.java:70)
at org.flowable.common.engine.impl.persistence.entity.AbstractEntityManager.findById(AbstractEntityManager.java:40)
at org.flowable.engine.impl.cmd.StartProcessInstanceCmd.getProcessDefinition(StartProcessInstanceCmd.java:202)
at org.flowable.engine.impl.cmd.StartProcessInstanceCmd.execute(StartProcessInstanceCmd.java:111)
at org.flowable.engine.impl.cmd.StartProcessInstanceCmd.execute(StartProcessInstanceCmd.java:52)
at org.flowable.engine.impl.interceptor.CommandInvoker$1.run(CommandInvoker.java:51)
at org.flowable.engine.impl.interceptor.CommandInvoker.executeOperation(CommandInvoker.java:93)
at org.flowable.engine.impl.interceptor.CommandInvoker.executeOperations(CommandInvoker.java:72)
at org.flowable.engine.impl.interceptor.CommandInvoker.execute(CommandInvoker.java:56)
at org.flowable.engine.impl.interceptor.BpmnOverrideContextInterceptor.execute(BpmnOverrideContextInterceptor.java:25)
at org.flowable.common.engine.impl.interceptor.TransactionContextInterceptor.execute(TransactionContextInterceptor.java:53)
at org.flowable.common.engine.impl.interceptor.CommandContextInterceptor.execute(CommandContextInterceptor.java:72)
at org.flowable.common.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:51)
at org.flowable.common.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:30)
at org.flowable.common.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:56)
at org.flowable.common.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:51)
at org.flowable.engine.impl.RuntimeServiceImpl.startProcessInstanceById(RuntimeServiceImpl.java:156)

Please someone respond, what can be probable reason for RuntimeService.deleteProcessInstance(String, String) operation taking so much time. This is an intermittent issue and occurs when many operations are happening in parallel.

  1. Deleting isn’t necessarily a simple operation. When lots of data is involved, you’re looking at deleting executions, variables, tasks, jobs, etc. Can you describe a bit more about your setup? I.e. how much data you have in the runtime tables, cause 10 minutes sounds very long.

  2. The fact a select statement is failing is very strange. There shouldn’t be any updates or changes happening to a process definition. Is there anything else changing or deleting the process definition at the same time?

  1. There is not much data involved. Like in act_ru_execution table has only 22 rows, act_ru_task has 12 rows.
  2. no, nothing is changing/deleting for a process at same time. It’s just many operations are happening in parallel but not interfering with each other. But fetch for process definition could be parallel. All this used to run before the upgrade without any error.
    I ran below query

blocking.pid AS blocking_id,
blocking.query AS blocking_query
FROM pg_stat_activity AS activity
JOIN pg_stat_activity AS blocking ON blocking.pid = ANY(pg_blocking_pids(activity.pid));

Which returns blocking_query as
select RES.* from ACT_RE_PROCDEF RES WHERE RES.ID_ = $1 order by RES.ID_ asc

One thing to mention, we have asyncExecutorActivate=“false”.
As its not mentioned in doc as well if it is mandatory, so when we migrated, we didn’t changed, as it will affect other thing in our project.

I took a run again, attaching the screen shot if it can help.

simillarly one more instance today. It seems locks are not getting released for long time. Are there any flowable specific settings for this which we need to do? Every time it gets stuck during delete operation only

No, that’s a configuration on the database side.

Looking at the screenshots, I’m not having any idea yet how this can happen. So it only happens when deleting a process instance, while other operations (not related to that process instance) are happening? Is there a way we could try to reproduce this? A simple db dump + test/code?

Yes, while delete is happening, other operations are also happening for other process/definitions.
The lock is not released while deleting the process instance and hence other operations are stuck.

This issue is coming for us when we are running the automation scripts for our product. These scripts are specific to our product which internally call flowable API’s.