Job is still locked after it is unacquired

Hello to all,
I have the following situation:
I want to update my application with zero down time. My “graceful shutdown” calls the shutDown() method of the AsyncExecutor. Then the jobs are unacquired (It can be seen that the values of lock_exp_time_ and lock_owner_ in ACT_RU_JOB are null), but after the start of the new application, the jobs hang until the initial value of lock_exp_time_ has passed (again, it is already null in the ACT_RU_JOB).

After inspecting the DB, I saw that there is a lock_time_ value kept in the ACT_RU_EXECUTION table. If I manually update it to null, the jobs do not hang but start execution, which is my expected behaviour.

Is there a chance that not setting the lock_time_ to null in ACT_RU_EXECUTION during the unacquire is a bug?

Hi Radoslav,

I see that a default time of 5 minutes will be applicable when a new scheduler pics the job. It will be NULL only when the tomcat properly shuts down

Hi Bajana,
Let me explain again my case:

A job was picked by the schedule. It was in progress, when the AsyncExecutor was shutdown. The job did not finished during the asyncExecutorSecondsToWaitOnShutdown time and it was killed. In the ACT_RU_JOB table, the lock_exp_time_ and the lock_owner_ are NULL (which is expected), but in the ACT_RU_EXECUTION table, the corresponding lock_time_ value was not set to null (which I think is not expected). After the new scheduler starts, it waits for the lock_time_ in ACT_RU_EXECUTION to pass in order to pick up the job again.

I have the following configurations:

  • asyncExecutorSecondsToWaitOnShutdown is 8 minutes;
  • asyncExecutorAsyncJobLockTimeInMillis is 30 minutes;
    and let say that my job takes 15 minutes.

The following events occur:

  • The job is started;
  • Immediately after the job is started, a shutdown on the AsyncExecutor is called;
  • In parallel, a new AsyncExecutor is started;
  • After 8 minutes (asyncExecutorSecondsToWaitOnShutdown), the job is killed and the old AsyncExecutor is down;
  • the job is started again after around 22 minutes by the new AsyncExecutor;

If I go in the DB, after the old AsyncExecutor is down and execute UPDATE act_ru_execution SET lock_time_ = NULL WHERE id_ = ?, there is no such downtime of 22 minutes and the job is started immediately with the new AsyncExecutor.

@Radoslav : That analysis is spot-on, many thanks. It has been fixed here: https://github.com/flowable/flowable-engine/commit/ee11c2bed83a12ba261da3a16fd35db82e0ff084, by introducing the lockOwner on the instance level and clearing it on shutdown.

@joram Thank you very much for the immediate action. Your resolution will fix a serious problem in our application and we are really concerned when you will come out with a new release. I will be very thankful if you provide us with any information.

There’s currently not a fixed date for the 6.6.0 version yet. We’re working on some features right now that needs some time to settle, so it’s hard to pinpoint when they will land.

Hi @joram When we can expect a new release including this fix? Because of that issue some processes stuck until async job lock time expires when we perform restart of application and this slow overall execution of processes.

We’re working on the last bits for the release. We did some major refactorings (e.g. around the UI apps), which need more testing than usual. But we’re close :wink:

Hello again @joram ,

Thank you for the release, I am working on the adoption of the 6.6.0 version. However, I still have a problem with the acquire of jobs from the new executor. What really happens is that lock_time_ and lock_owner_ in act_ru_execution are not cleared after shutting down the async executor. I saw that there is a command ClearProcessInstanceLockTimesCmd and I tried to include it in the shutdown logic of the async executor:

public class MtaAsyncJobExecutor extends DefaultAsyncJobExecutor {
    
    private boolean unlockOwnedExecutions;

    public MtaAsyncJobExecutor() {
        super();
        super.shutdownTaskExecutor = true; // Side note: this cannot be configured from ProcessEngineConfiguration object
    }

    @Override
    protected void shutdownAdditionalComponents() {
        super.shutdownAdditionalComponents();
        if (unlockOwnedExecutions) {
            unlockOwnedExecutions();
        }
    }

    protected void unlockOwnedExecutions() {
        jobServiceConfiguration.getCommandExecutor()
                               .execute(new ClearProcessInstanceLockTimesCmd(getLockOwner()));
    }

    public boolean isUnlockOwnedExecutions() {
        return unlockOwnedExecutions;
    }

    public void setUnlockOwnedExecutions(boolean unlockOwnedExecutions) {
        this.unlockOwnedExecutions = unlockOwnedExecutions;
    }
}

After testing this approach, the code blocks on the execution of the command (ClearProcessInstanceLockTimesCmd) as if there was a dead lock. I also tested running the update SQL script from pgAdmin console (during runtime of the application), which was also blocked by something and did not execute.
Do you have anything in mind that can help me? Any advice will be highly appreciated.
Many thanks and best regards,
Rado

The ClearProcessInstanceLockTimesCmd is tested on all databases: https://github.com/flowable/flowable-engine/blob/master/modules/flowable-engine/src/test/java/org/flowable/engine/test/jobexecutor/ClearProcessInstanceLocksTest.java so a deadlock sounds really strange.

Could your connection pool have been closed off already at the point where your executor is closed?

Note that the default cmd is executed by the ProcessEngineConfigurationImpl#getProcessEnginecloseRunnable(). You could override that method and return your logic, that would make it consistent with the default order.

Hello @joram ,

Yesterday, I got a new findings regarding the case where I experience deadlock during the ClearProcessInstanceLockTimesCmd. But first, let me explain my scenario again:

  • I have zero-downtime update logic for my tomcat app, where:
    • the new async job executor is started and it is acquiring jobs;
    • the old job executor gets shutdown() call → this internally calls the shutdown of the AsyncTaskExecutor, which actually executes terminate and awaitTermination for the ExecutorService;
    • the executor service awaits thread termination but actually does not kill the threads in the pool in which the jobs are executed (as we know, there is no real kill of thread unless the jvm is killed);
    • the thread in which the job is running continues to live in the background and keeps open transaction and exclusive lock over the current execution in act_ru_execution table → this blocks the execution of the ClearProcessInstanceLockTimesCmd;
    • only if the jvm is killed, then the job thread is killed; therefore the db lock over the exclusive job execution is released; after that, if I clear the lock_time_ in act_ru_execution (with pgAdmin) the new job executor acquires the job; otherwise it waits until the lock_time_ passes.

Can you confirm my thesis? Do you have any recommendations how I can introduce ClearProcessInstanceLockTimesCmd?

EDIT: I debugged the code and it seems that the new job executor goes through ExecuteAsyncRunnable::lockJob where FlowableOptimisticLockingException is catched, traced back to here. All this seems nice handled in the Flowable engine, but what would you suggest for my case. To sum it up:

  • I have long running stuck job executed by the old executor;
  • the old executor goes though shutdown;
  • until the JVM is down, the job cannot be really interrupted and db lock prevents from ClearProcessInstanceLockTimeCmd;
  • this job execution has lock_time_ even after the death of the old executor;
  • the new executor cannot grab the job because of the execution lock.

Many thanks and best regards,
Radoslav