Update on ACT_RU_EXECUTION

Hi,
I’m using flowable with an Oracle database and I´m having an issue on the following query:

update ACT_RU_EXECUTION set LOCK_TIME_ = :1 where ID_ = :2 and (LOCK_TIME_ is null OR LOCK_TIME_ < :3 )

It says:

The SQL statement with SQL_ID “ab46ty50gvtw4” was blocked on row locks.

The SQL_ID shows the previous update query.

In my research, I found that this error occurs when two or more sessions want simultaneous access to a resource.
Are there some changes in the schema for Oracle databases?
What situation could be triggering this error?

This particular update happens when a a job gets executed and it needs to have a lock on the process instance due to being an exclusive job.

We haven’t heard of this problem yet on Oracle. Which Oracle version are you using? Which isolation level?

One option would be to add/remove indexes on the LOCK_TIME … but want to understand first why this would be a problem for a database.

Nakamura and I work together and I will answer the questions.

The version of Database is 12.1.0.2.0

On the isolation level, the base in question is shared by Flowable’s unique schema and other schemas for other applications.

Thanks. We QA against various versions of Oracle, but haven’t seen this problem yet.

That’s normal behavior, however I would expect one of the two session doesn’t get access and doesn’t update.

Hello,

We have started seeing this in an AWS Aurora cluster. It surfaces as a connection timeout in the mariadb driver.

### Error updating database.  Cause: java.sql.SQLException: (conn=1882029) Communications link failure with primary host redacted.us-east-1.rds.amazonaws.com:3306. Connection timed out
### The error may exist in org/flowable/db/mapping/entity/Execution.xml
### The error may involve org.flowable.engine.impl.persistence.entity.ExecutionEntityImpl.updateProcessInstanceLockTime-Inline
### The error occurred while setting parameters
### SQL: update ACT_RU_EXECUTION     set       LOCK_TIME_ = ?     where ID_ = ?       and (LOCK_TIME_ is null OR LOCK_TIME_ < ?)
### Cause: java.sql.SQLException: (conn=1882029) Communications link failure with primary host redacted.us-east-1.rds.amazonaws.com:3306. Connection timed out

I suspect a SIGTERM at the wrong moment started the problem. It is ongoing, and not obvious whether it is one bad job record, or affects everything. The system as a whole appears to be functional.

I will try an index on LOCK_TIME_ as suggested in a previous post and report back.

Indexes added on (LOCK_TIME_) and (ID_, LOCK_TIME_) did not resolve the issue. The only thing that seems to resolve it for a short period is rebooting the Aurora master/writer.

EDIT: it is worth noting this exception happens 100s of times per second

Do you know when this happens? All the time? Only with certain definitions?

The update query is the one used to lock a process instance. How much concurrent async jobs do you typically have at a given time?

We have not isolated a root cause yet. It does not happen all the time, once the bad ACT_RU_JOB is deleted, things clear up. It has happened about 3 or 4 times total over the last 2 weeks. At current load, we have about 10 or so async jobs per second. It’s not too much. In certain respects it looks like an application level deadlock with Flowable’s custom locking code. The REV_ field was slowly increasing, but the number of retries per REV_ increment was in the 10s of thousands.

Indeed, that load should be no problem.

I don’t think so - the thing what Flowable does is trying to acquire a lock on the process instance by doing the update from your first post. If that update returns 0, it means the lock was taken by someone else. The only way that can deadlock is when multiple are trying to lock (that on itself is no problem), and current transactions are doing a transaction on that process.

Do you happen to know if the access to the failing process instance is limited to the async jobs only? Or are there other interactions (e.g. task completes, etc.) happening concurrently on those instances?

Hello Joram,

The forum has limited me to 3 responses, so I have been adding additional information to a github issue. https://github.com/flowable/flowable-engine/issues/2442.