Checking Process Engine health

So I have a process engine instance that gets set up with a dbpc2 basicdatasource (based on the Flowable docs for how to setup a pooled datasource and link the process engine into a pooled connection). I have some custom rest endpoints set up to use the Java APIs so I can get some custom outputs on my running tasks and processes.

My getTasks endpoint does a task query of all the open tasks in the system. When I first hit it after bringing my app server online (this process initializes the process engine), the getTasks response time is a little bit slow, on the order of 1-8 seconds (the variance seems mostly due to the connection speed between the app server and database server).

Then my app server and connection pool get wise and start caching, so the response times bottom out around 500ms or so (which I’m happy about). I let the app sit idle for a few hours. My response time now goes anywhere from 1.5 seconds (which was my most recent attempt), to upwards of 3 minutes (what I had happen yesterday).

Is there a way for me to test the process engine and/or some kind of health check? In particular any sort of function or command that would check that my connection to the database is still alive and well, so that I could ensure my task query doesn’t take 3 minutes to respond to me. Or is this more a matter of I need to test the datasource and then return an engine using that datasource?

Hey Jeff,

Which DB are you using if I can ask?

When I first hit it after bringing my app server online (this process initializes the process engine), the getTasks response time is a little bit slow, on the order of 1-8 seconds (the variance seems mostly due to the connection speed between the app server and database server).

Isn’t the engine being initialised when you start your application? Usually the engine is initialised when an application start (that takes few seconds) and then doing a query with the Java API should not really matter.

Then my app server and connection pool get wise and start caching, so the response times bottom out around 500ms or so (which I’m happy about). I let the app sit idle for a few hours. My response time now goes anywhere from 1.5 seconds (which was my most recent attempt), to upwards of 3 minutes (what I had happen yesterday).

This really should not affect the engine in a big way. If you are using a connection pool the engine just uses this, it does nothing in particular with it. Can you share the settings for your connection pooling? What’s the max, min connections, the lifetime of a connection, etc.

Is there a way for me to test the process engine and/or some kind of health check? In particular any sort of function or command that would check that my connection to the database is still alive and well, so that I could ensure my task query doesn’t take 3 minutes to respond to me. Or is this more a matter of I need to test the datasource and then return an engine using that datasource?

The engine just uses the datasource to get a connection, so I would say it is more you need to make sure that your pooling is OK.

If you don’t mind asking, how does your task query look like? How many tasks / processes do you have in your application? Are you doing queries where you include the task / process variables and / or query with variable equals, like etc?

Cheers,
Filip

This is against an Oracle 12c database, with the org.apache.commons.dbcp2.BasicDataSource settings (I didn’t change any of the defaults). i’m just setting the connection string to the datasource then feeding this datasource into the process engine using the setDatasource() function.

My task query is just using taskService.createTaskQuery().list(), where taskService is obtained using the processEngine.getTaskService() method, shown below (this is ColdFusion code, translatable to Java by just using the proper object definition syntax).

var taskService = application.processEngine.getTaskService();
var results = taskService.createTaskQuery().list();

There are around 105 process instances, each with a user task so my tasks are returning around 105 records, give or take a few.

I’m using the Java APIs because I’m actually wrapping the Java within ColdFusion, so when I start the application I have to establish the process engine config and build the process engine. The resulting engine is then stored into an application level variable so I can reference it where I need to instead of having to constantly rebuild the engine.

Hey Jeff,

From the info you have provided us seems like the bottleneck is the database, more precisely the pool. I just checked https://commons.apache.org/proper/commons-dbcp/configuration.html and it seems that the defaults are quite low (0 initial connections, 8 max connections, and waiting indefinitely for a connection). How many requests are hitting your endpoints?

I would suggest that you have a look at the default in the flowable applications

Those are the spring boot properties. For you this would mean:

  • minimumIdle - Minimum number of idle connections in the pool. In dbcp: minIdle
  • maximumPoolSize - The maximum number of connections in the pool. In dbcp: maxTotal

We also have connectionTimeout which is not in the properties, and it is 30s. This would be maxWaitMillis in dbcp. I would suggest that you increase the settings for your pool and see if it is better.

I am not sure if Coldfusion can have some impact as well. I have never used it so I can’t help with that.

One other question, the times you mentioned, are they really only for the executing the task query or the entire request including your logic?

Cheers,
Filip

I’ll tune the pooling based on your advice and let you know if that alleviates the delay spikes.

The response time measurements are coming out of SOAP UI’s load test runner. The experiment I set up was to have 1, 5, 10, and 100 concurrent users. Each experiment was run for 1, 10, and 100 total requests per user. Then I’m just examining the growth rate for the average response time to see if there’s a serious problem in there. I’m expecting linear growth rates because my current rest service does a for loop over the results obtained by the task query API. So what I am looking at in my response times is for problems like this, where my response time curves suddenly had a massive spike in the values.

So what I am looking at in my response times is for problems like this, where my response time curves suddenly had a massive spike in the values.

And I suppose the spike occurs for 100 users?

If you can add some logging to your method and log the time the Flowable query takes to reply with your answers it would be easier for you to pinpoint where exactly the problem is. I would also suggest to you to do a warmup of the server (few thousand calls) before doing the actual measurements. This would allow the Java JIT to kick in

The spikes are based on the idle time, not the number of users as far as I can tell. It is worse for large user hits (100 users has the worst spikes), which suggests that I’m probably overloading the connection pool with lots of concurrent connection requests. I will try the logging approach and check when the requests come in and out to see what the response times actually are. I can easily return the API response time as part of the payload (start and stop a timer right around the command) and then attach that info into the response JSON to see if there’s an obvious difference.

And as you noted, warming up the server definitely helps. When I keep the server warm by pinging it periodically, I get gradually faster average response times as the request count piles up.