Act ru jobs are not executed

Hello Flowable team,

When our service is under heavy load, we notice performance degradation between Flowable service task executions.

After checking the ACT_RU_JOB table, I noticed that many jobs do not have LOCK_OWNER_ and LOCK_EXPIRATION_TIME_. From the forum, I read that the acquisition thread does not enforce an order when assigning jobs to the Flowable async executor [1].

I suspect that this thread cannot handle the high number of jobs, leading to jobs either not being executed promptly or experiencing significant delays, which causes major slowdowns in our processes.

Our service contains numerous service tasks and parallel call activities, potentially resulting in thousands of jobs running in parallel [2]. It operates on five VMs (Cloud Foundry instances), each running a Flowable async executor, and uses a PostgreSQL database.

Is there a way to allocate more threads so that job assignment is not handled by a single thread?

What else can I check or improve to prevent job starvation?

Is there a way to enforce job assignment in the order they are created, so delays are at least more predictable rather than random?

Flowable config: multiapps-controller/multiapps-controller-web/src/main/java/org/cloudfoundry/multiapps/controller/web/configuration/FlowableConfiguration.java at master · cloudfoundry/multiapps-controller · GitHub

Best regards,
Ivan

[1] Job acquisition order
[2] multiapps-controller/multiapps-controller-process/src/main/resources/org/cloudfoundry/multiapps/controller/process/xs2-bg-deploy.bpmn at master · cloudfoundry/multiapps-controller · GitHub

Flowable: 6.8.0

If you have job rows with no lock owner, it typically means the internal queue of the node it was created on is full. It is inserted without lock owner, so other nodes can pick it up.

The acquisition thread is typically not the problem, it’s most likely that you need more execution threads to handle the jobs on the internal queue faster.

Here’s a series of articles that describe the architecture of the acquiring: Handling asynchronous operations with Flowable – Part 1: Introducing the new Async Executor

What are your current settings? I.e. threadpool, acquisition block size, etc.

Hi joram,
Thank you for the details!

I am testing with different configurations.
Global lock is disabled.
Acquisition size is 1. async / timer job
I have tested with thread pools of:
taskExecutor.setQueueSize(Short.MAX_VALUE); // 32767
taskExecutor.setCorePoolSize(1024);
taskExecutor.setMaxPoolSize(1024);
→ DefaultAsyncTaskExecutor is used
→ AsyncJobAcquireWaitTime → 3 seconds

But I still see the same issue. The jobs are added without a lock owner. And the size of these jobs (without owner) are more than 1000 in some cases.
I believe the queue is not full since I monitor it during the process execution and I set it to a very large number and the result is the same.

Usually the production settings are 64 - core, 96 - max size threads and the queue is 4, but the same issue is observed with very large executor and queue.

I’ve read the async flag and async executor and for now in our service we noticed a performance degradation when the global lock was enabled, so we’re still using the ‘previous’ version of the executor (with the default settings of flowable 6.6.0 (before the global lock introduction), acquisition job of 1.

Best regards,
Ivan

That is very odd, as in all the benchmarks we’ve conducted over the last years the global lock approach always beats the old algorithm, with a significant amount.

With a large queue size, this shouldn’t happen. So it makes me wonder if the properties you are setting are set on the correct place? I’ve ran a benchmark for a customer last month, as executed millions of jobs in an hour, so 1000 is by no means a large number.

Do you have anything that you can share so we can reproduce what you are seeing?

Hello Joram,

Sorry for the late reply.
I have been experimenting with different Flowable settings, and the best results that I observe are with these settings:
• MaxAsyncJobsDuePerAcquisition = 1
• MaxTimerJobsPerAcquisition = 1
• GlobalAcquireLockEnabled = false
• ParallelMultiInstanceAsyncLeave = false
• CorePoolSize / MaxPoolSize – does not really matter, the results are similar with 64 or 512 threads. The queue size also does not matter at all since it is never full.

I tried the global lock once again, and the results are still worse.
I tried increasing MaxAsyncJobsDuePerAcquisition and MaxTimerJobsPerAcquisition with and without the global lock (1 to 50), and it does not perform better with a configuration different from 1.
I also increased the core and max pool sizes and queue size, but these settings do not lead to different results.
Finally, I tried to “add” more acquisition threads using an extension of the DefaultAsyncJobExecutor:

public class ThreadedAsyncJobExecutor extends DefaultAsyncJobExecutor {

private List<AcquireAsyncJobsDueRunnable> asyncJobAcquisitionRunnables;
private List<AcquireTimerJobsRunnable> timerJobAcquisitionRunnables;

@Override
protected void startJobAcquisitionThread() {
    super.startJobAcquisitionThread();
    asyncJobAcquisitionRunnables = new ArrayList<>(30);
    for (int i = 0; i < 30; i++) {
        JobInfoEntityManager<? extends JobInfoEntity> jobEntityManagerToUse = jobEntityManager != null ? jobEntityManager
            : jobServiceConfiguration.getJobEntityManager();
        AcquireAsyncJobsDueRunnable asyncRunnable = new AcquireAsyncJobsDueRunnable("flowable-acquire-async-jobs-"
            + (i + 1), this, jobEntityManagerToUse, asyncJobsDueLifecycleListener, new AcquireAsyncJobsDueRunnableConfiguration());
        asyncJobAcquisitionRunnables.add(asyncRunnable);
        Thread thread = new Thread(asyncRunnable);
        thread.setName("flowable-acquire-async-jobs-" + (i + 1));
        thread.start();
        LOGGER.info("Started job acquisition thread: {}", thread.getName());
    }
}

@Override
protected void startTimerAcquisitionThread() {
    super.startTimerAcquisitionThread();
    timerJobAcquisitionRunnables = new ArrayList<>(30);
    for (int i = 0; i < 30; i++) {
       AcquireTimerJobsRunnable timerJobRunnable = new AcquireTimerJobsRunnable(this, jobServiceConfiguration.getJobManager(),
                timerLifecycleListener, new AcquireTimerRunnableConfiguration(), configuration.getMoveTimerExecutorPoolSize());
        timerJobAcquisitionRunnables.add(timerJobRunnable);
        Thread thread = new Thread(timerJobRunnable);
        thread.setName("flowable-acquire-timer-jobs-" + (i + 1));
        thread.start();
        LOGGER.info("Started timer job acquisition thread: {}", thread.getName());
    }
}


}

Which, again, did not lead to better performance even worse, even though it seems like the jobs from act_ru_job are not getting executed for a long time.
I also checked the database queries, and there are no slow queries — all of them are under 900ms, and most are under 200-300ms.

“If you have job rows with no lock owner, it typically means the internal queue of the node it was created on is full. It is inserted without lock owner, so other nodes can pick it up.”

Does this apply to “async = true” service tasks, since all of our jobs are async? I see that all of the jobs are inserted without lock_owner_ and lock_expiration_time_.

Also, one more question regarding the rev_ counter. As far as I understand, it should be incremented when the job is executed and is used for optimistic locking during update queries. Is this correct? Because I’ve seen that for some jobs, the counter is 8 or 9—why is that?

In our setup, all the jobs are async, and we have a lot of inner call activities that trigger more parallel processes, which in turn trigger other parallel call activities. They are usually not slow, but there might be some async service tasks that take a few seconds.

We usually start to notice degradation after 15-20 minutes of our performance tests, after ~200 parallel processes like this [1] start and run continuously.
If I run 10 parallel processes (less load), this issue is not observed.

I can provide you access to our Cloud Foundry account and, respectively, the database if that can help with the investigation, or even set up a meeting to show our setup if you think this is more appropriate.

[1] multiapps-controller/multiapps-controller-process/src/main/resources/org/cloudfoundry/multiapps/controller/process/xs2-deploy.bpmn at master · cloudfoundry/multiapps-controller · GitHub

Best regards,
Ivan