Bug Selecting Historical Task Instances in 6.6?

I was trying to figure out why I had a value that doesn’t want to show up in my UI for a list of tasks a user has. The query looks at the history service and queries historical tasks, because the tasks may or may not be completed (and I wanted to report on both kinds of tasks). Here’s the query I originally had:

historyService.createHistoricTaskInstanceQuery().taskCandidateUser(username).taskCreatedAfter(fiscalYearStart).includeProcessVariables().orderByTaskDueDate().asc().list();

I noticed, however, that I have a task which shows up in the ACT_HI_TASKINST table that I can’t seem to pull using this query. The only way I am able to bring in this task is with one of the following inclusions to the query (bussObjID is a correlation value I store in the workflow so I can look up workflows using that identifier) :

.processVariableValueEqualsIgnoreCase(“bussObjID”, 11111)

or

.unfinished()

so that I have one of these two queries:

historyService.createHistoricTaskInstanceQuery().taskCandidateUser(username).taskCreatedAfter(fiscalYearStart).includeProcessVariables().processVariableValueEqualsIgnoreCase(“bussObjID”, 11111).orderByTaskDueDate().asc().list();

or

historyService.createHistoricTaskInstanceQuery().unfinished().taskCandidateUser(username).taskCreatedAfter(fiscalYearStart).includeProcessVariables().orderByTaskDueDate().asc().list();

So it’s weird that if I use one of the two workarounds, I get the task I’m looking for (it is an active task), but if I leave it out my task doesn’t get pulled. I’ve no idea why this happens, and before I spend time doing a .unfinished() on all of my historic task queries that also need to pull live tasks, I figured I would check in here to see if this is a bug against the 6.6 query APIs or if there’s something I’m just not understanding about how to use the historyService (which is my local instance of the HistoryService class).

I do remember that there was a potential bug in this area (but the details are fuzzy). The whole logic has been serious refactored recently: Rework how we do paging and including data from other tables by filiphr · Pull Request #2892 · flowable/flowable-engine · GitHub. Is there a way you could build the snapshot version and try it on a (copy! As to not migrate the schema of your data already) of your data?

In theory, yes. I will have to see if I can pull down that snapshot version and do a compile of it (this will be the first time in a decade of software dev that I’ll be pulling down and compiling from source like this). I’ll spend some time looking at this and let you know.

All the reason more to try it :wink:

git clone the repo and then do mvn -Pdistro clean install -DskipTests and you’ll have all the snapshot jars in your repo. The distro profile is there to make sure you get all possible jars (if you’re only using the engine, it’s probably not needed).

So I pulled down and compiled from the 6.6.2.2 snapshot and ran it against a copy of my data because it upgrades the schema to 6.6.2.2. The record does come back now in the base query without needing to do any additional steps, but the execution time has nearly doubled. I was looking at 2 min 45 sec return time on the original query before swapping out to 6.6.2.2, and now I’m at 4 min 42 sec with the 6.6.2.2 JARs in place. Not sure I’m a fan of that, though it’s nice to see the data I wanted returning in the result set now.

It’s good that the data is now coming back (which means our refactoring did do the right thing).

Is there any way you could share your data/rows in the db/explain plan? Something that we can use to investigate this further.

One possible reason for the increased duration might be the fact that now all the data is coming back.

Before the results coming from the DB were capped at 20K rows (with variables and etc). However, this cap is no gone, which means you can be getting way more than 20K rows from the database and thus the query takes longer because there is more data that you need to transfer.

Do you have to get the processVariables for the tasks? If the large amount of data is the problem, it might speed up the query if you don’t include them and get them slightly different.

I can’t share the data, unfortunately. But I will take a look because your point about getting more records back is valid. If I’m seeing a 100-fold increase in record size, for example, but only a 2x increase in time, then the performance is actually quite good and I just need to reduce the amount of data we are fetching (or make sure to be up front about the performance costs of fetching lots of data).

We do actually use the process variables as part of the final result, and I have a generic function that parses out the process variables into a flattened object structure. I call that parser function after every task/workflow query, so even though I’m not using all the data in this particular instance, it does get used elsewhere.

So it may not be as bad as all that. I looked at the raw statistics for the API test, and running the 6.6.0 code:

  • 573 records
  • 1 min, 13.36 sec response time

When I switched over to the 6.6.2.2 branch and re-ran the exact same code against the copy of the data set, I got these statistics:

  • 1,058 records (1.84x increase)
  • 2 min, 14.94 sec reponse time (1.92x increase)

Given that I saw 1.84x on the record size, and 1.92x on the response time, that’s looks like the query is running at or close to a linear scaling on performance. It’s more odd that network latency is playing a bigger role than the query performance, given that both of these metrics are much faster than the last numbers I threw up here.

The updgrade to 6.6.2.2 has solved my original problem with missing data.

I did a deeper dive into the scaling and performance statistics to make sure that the changes scaled linear. I ran a bunch of different runs, and I came out with the same result. If the record set didn’t change in size at all, I saw a tiny increase in response times (1-4% increase), which I do not consider to be statistically significant.

For two of my data runs, I did actually see a sizeable jump in the response times, but when I look at the amount of data that also returned I have an almost perfect 1:1 ratio between increase in the amount of data and increase in the response times to get that data back. This means that the Flowable query I have scales linearly. I will still want to talk with my requirements team about what they want to do because there are some situations where the app is just going to be too slow to really be useful (I now have a call that takes over 6 min to return because it has to chew through 3,208 records), but that is not exactly a bad problem to have (previously I was only getting 575 records back).

It is good to hear that the new changes do not have a significant performance impact on such complex queries as yours.

Some note regarding your logic. Apart from taking 6 min I guess that it is also consuming more memory now. 3208 records is not that big of a difference compared to 575 records. However, by including variables the amount of the data loaded from the database is most likely quite large.

A possible solution would be you to split the processing into different threads and use paging. With the new changes paging should work smoothly.

Another possible solution would be to reduce the amount of data you are loading by providing a list of variables you want to load. This is currently not supported in the API. However, you could provide a PR with such a functionality.

I may look at doing a modification of my custom parsing because I only have to get 2 data points from the process variable for the situation that’s taking so long. They’re static values (set when the process starts up), so I could always copy them down into task locals and only load up the task locals (a much smaller data set–tasks have at most 4-6 variables, all of which are useful), which should improve my running speed. I’ll tinker with it and see what I get. I’m also going to have to work on whether it’s possible to shrink the timespan of records (right now it’s all records for a given fiscal year, which at this point is months of old data–I’m not sure how useful that is, and anything I can do to shrink the record set size is going to get me better performance).