We have problems under heavy load (8 requests per second) with our setup. Currently we don’t know, what is the cause of the problem, but I will describe our observations. We are using Activiti 5.21.
- NullPointerExceptions in logfile
2017-05-08T10:40:34.255 ERROR [pool-4-thread-9] Job 46568299 failed
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
- Jobs in act_ru_job table with less than 0 retries_ (down to -8)
- These jobs contain the message “JobEntity [id=4431434] was updated by another transaction concurrently”
We have one Activiti engine running, with a configured AsyncJobExecutor. The process itself looks as following:
3 parallel subprocesses, each of them contains a JavaServiceTask and UserTask. The ServiceTasks are all configured as async = true.
Before and after these parallel processes we have parallel gateways to fork/join. The first parallel gateway is also set to async = true.
All elements are also exclusive.
I have the feeling, that our configuration is not well set, because 99,9 % of the failing jobs are on the first parallel gateway. But currently I simply don’t understand why this happens. I read the section about AsyncJobExecutor as well as the one about Exclusive Jobs. With the knowledge I gained there, I would expect, that such concurrency cannot happen.
Any hint is appreciated.
Did you try this with the latest Flowable 5 release as well?
There were some changes after Activiti 5.21 and it would be good to know if that fixes some of the issues you are seeing.
The “JobEntity was updated by another transaction concurrently” message can be ignored, because this can happen in high load environments. That jobs have a retry less than 0 is odd, because they won’t be picked up by the job executor. Could it be that these jobs are triggered manually?
no, we didn’t try it with a newer version until now. Since we want to switch anyways to flowable, it would be an idea. The plan was, to switch later
It cannot be, that the jobs are triggered manually. During the loadtest, nobody is allowed to use the system.
When I think about it, it doesn’t happen under high load if everything is fine. But for example if there are problems (e.g. Latency, which blocks the threads), so that the count of jobs in act_ru_job table is increasing until - let’s say 30.000. When the execution of jobs continues, we see this behavior.
I don’t understand why it happens, because how I understood the documentation, all stuff which is exclusive should be handled by one thread, to avoid concurrency problems…
Thanks a lot until now and best regards
We tried now with Flowable 5.23 and encounter the same problems. We still have jobs in the database with retries_ = -8.
Thanks for trying this with Flowable.
The NPE seems to be cause by an execution that could not be found anymore, so it seems that the process instance move to a different state or even has been ended in the meantime. When you say this happens when there latency issues, what kind of delays are we talking about? Is this more than a minute of delay for example? After some time (out of the top of my head, after 5 minutes), jobs that were locked by the job executor but didn’t finish will be retried. This could lead to unexpected problems.
Did you do any special job executor configuration for Flowable? Or just using the default config values?
we are talking about a delay of 1 second. We have Java Service Tasks which basically do a HTTP request. For that request we simulate a latency of 1 second until we get the answer back.
Is it possible to see, if the jobs were locked? lock_exp_time_ and lock_owner_ fields are empty.
I add a picture of the process definition, to make it more clear how it looks like. In the process, everything is exclusive. All Java Service Tasks are asynchronous too, to make sure, that they are not executed twice, if something goes wrong.
We do some configurations on the job executor, yes.
asyncFailedJobWaitTime = 10
asyncJobLockTimeInMillis = 30000
timerLockTimeInMillis = 30000
keepAliveTime = 0
corePoolSize = 10
maxPoolSize = 10
queueSize = 1000
defaultQueueSizeFullWaitTime = 10000
When I see the parameters, I think we missed one 0 at each of the LockTimeInMillis parameters.
I guess that was not intended to be changed.
The next days I’m on vacation, but I will check any recommendations as soon as possible
Thanks for sharing the diagram.
Your configuration and the 1 second delay should not cause an issue with locked jobs, so I don’t expect this will be the issue.
In your process diagram the boundary error events appear to be non interrupting (dotted line), is that correct? This would mean that in case of an error the service task will still be retried, but also the user task will be created. Is this the desired behaviour?
In a previous post you said this: “I have the feeling, that our configuration is not well set, because 99,9 % of the failing jobs are on the first parallel gateway”. Are you sure it’s on the first parallel gateway and not on the second? You can remove the async attribute from the first parallel gateway and add one to the join parallel gateway with exclusive set to true. Because it’s important that the parallel gateway join is executed asynchronously for each parallel join path. For the fork the async attribute doesn’t add anything, and only adds complexity.
A database dump would help as well to diagnose the issue if that would be possible. A test showing the issue would be even better of course ;-), but I would guess that’s difficult to make reproducible.
the boundary error events should be interrupting. So indeed, we have there a problem in configuration. How can this been set? We are using the Eclipse Designer for creating the processes.
With the one of the last runs, we tried it without the async flag at the parallel gateway and we had the same problems.
In the latest run with Flowable 5.23 we have the problem on the first Java Service Task.
I’ll provide you a database dump of the latest run with Flowable 5.23. There are not so many failures because we only ran the test for appr. 1,5 hour. I will prepare it and maybe clean-up a bit to have only the essential part in there.
I can also try to reproduce the problem in a test, but I guess that it will be hard to reproduce locally.
I created a database dump with pgadmin and postgresql 9.4. You can find it here: https://drive.google.com/open?id=0B6uRHfCQCSLdVmRoX1lPNFlCUUU
I hope it helps to understand the problem.
Setting these parameters
asyncJobLockTimeInMillis and timerLockTimeInMillis
to default value 300000 (adding a ‘0’ to our configuration, which was set accidentally) helps to avoid the problems.
Anyways we will also change the configuration of our JoinGateways to be asynchronous.
I still don’t know, how to configure the errorBoundaryEvents to be interrupting. When I drag them in the process, they do appear with a non-dotted line. But the generated XML looks the same:
<boundaryEvent id="boundaryerror1" name="Error" attachedToRef="MyJavaTask">