Hi,
I’m using flowable with an Oracle database and I´m having an issue on the following query:
update ACT_RU_EXECUTION set LOCK_TIME_ = :1 where ID_ = :2 and (LOCK_TIME_ is null OR LOCK_TIME_ < :3 )
It says:
The SQL statement with SQL_ID “ab46ty50gvtw4” was blocked on row locks.
The SQL_ID shows the previous update query.
In my research, I found that this error occurs when two or more sessions want simultaneous access to a resource.
Are there some changes in the schema for Oracle databases?
What situation could be triggering this error?
We have started seeing this in an AWS Aurora cluster. It surfaces as a connection timeout in the mariadb driver.
### Error updating database. Cause: java.sql.SQLException: (conn=1882029) Communications link failure with primary host redacted.us-east-1.rds.amazonaws.com:3306. Connection timed out
### The error may exist in org/flowable/db/mapping/entity/Execution.xml
### The error may involve org.flowable.engine.impl.persistence.entity.ExecutionEntityImpl.updateProcessInstanceLockTime-Inline
### The error occurred while setting parameters
### SQL: update ACT_RU_EXECUTION set LOCK_TIME_ = ? where ID_ = ? and (LOCK_TIME_ is null OR LOCK_TIME_ < ?)
### Cause: java.sql.SQLException: (conn=1882029) Communications link failure with primary host redacted.us-east-1.rds.amazonaws.com:3306. Connection timed out
I suspect a SIGTERM at the wrong moment started the problem. It is ongoing, and not obvious whether it is one bad job record, or affects everything. The system as a whole appears to be functional.
I will try an index on LOCK_TIME_ as suggested in a previous post and report back.
Indexes added on (LOCK_TIME_) and (ID_, LOCK_TIME_) did not resolve the issue. The only thing that seems to resolve it for a short period is rebooting the Aurora master/writer.
EDIT: it is worth noting this exception happens 100s of times per second
We have not isolated a root cause yet. It does not happen all the time, once the bad ACT_RU_JOB is deleted, things clear up. It has happened about 3 or 4 times total over the last 2 weeks. At current load, we have about 10 or so async jobs per second. It’s not too much. In certain respects it looks like an application level deadlock with Flowable’s custom locking code. The REV_ field was slowly increasing, but the number of retries per REV_ increment was in the 10s of thousands.
I don’t think so - the thing what Flowable does is trying to acquire a lock on the process instance by doing the update from your first post. If that update returns 0, it means the lock was taken by someone else. The only way that can deadlock is when multiple are trying to lock (that on itself is no problem), and current transactions are doing a transaction on that process.
Do you happen to know if the access to the failing process instance is limited to the async jobs only? Or are there other interactions (e.g. task completes, etc.) happening concurrently on those instances?