Stuck Process Instances - How to Find them?

It’s happened pretty rarely, but I have 3 separate occasions where I have a workflow that appears to have gotten stuck before it completed properly.

In the first instance, the cause of the sticky workflow turns out to be that the server got rebooted/went down. Since the server runs the workflow engine, pretty clear that what happened is the server went down and the workflow never had a chance to finish and just got stuck where it was.

The second instance showed a similar pattern but there was no reported server outtage. I went into the ACT_RU_ACTINST table and I could see that it paused right at the point where it was going to invoke an HTTP service call, but no record of the HTTP service call existed in either ACT_RU_ACTINST or ACT_HI_ACTINST. That said, I could see that the server call was invoked by flowable, because the data was altered in such at way to indicate the call did indeed happen and the changes made by the web call were committed.

The third situation, I have the workflow paused on a user task. But I have a record in ACT_HI_VARINST that shows the user approved the task. I also have data showing that HTTP services that fire after the task is approved were invoked (because the data was edited). Yet I have no history in either ACT_RU_ACTINST or ACT_HI_ACTINST showing that the task was completed or that a failure occurred. It’s like Flowable just decided to stop the transaction and roll it back without a clear indication of why. I have no reported server outtage on this situation, either.

My question, as I dig into how this happened: is there a way to recover the history of these workflows somewhere? Is the transaction that’s about to be committed stored somewhere I can view it, or is it totally rolled back and lost to the ether? Is there a setting I can turn on so I can see failed transactions in a history table somewhere?

Alternatively, is there a way to get Flowable to resume workflows that were suspended when the engine got stalled? If I could get flowable to pick up where it left off on these workflows, I wouldn’t have a problem if the server went down on me.

For reference, I’m running Flowable version 6.6.2.2 (I compiled from source between the 6.6 and 6.7 builds).

Hey @jeff.gehly,

What you are explaining sounds like a really strange scenario. Flowable doesn’t do any special magic for storing state. It uses the Database to persist things. If something is missing in the database it means that there were most likely some exceptions after the HTTP task and / or the completion of the user task.

Did you see some error logging in your logs?

For your HTTP task, is that an async HTTP task? If it is, then I would suggest checking the dead letter table.

For the User Task, it is possible that whatever was supposed to be executed after the user task threw an exception. If this exception lead to an HTTP 4xx then nothing would have been logged. It it was HTTP 500 then you should see some error logging here.

If the applications stops in the middle of an execution then Flowable will continue from its last know state. This means that an HTTP task might be executed again. In the upcoming release we are adding a functionality that would allow you to configure an async exit of an Activity. e.g. Once the HTTP call is done and it returns Flowable is going to commit the transaction and create a new job to execute the rest of the flow.

Cheers,
Filip

Yeah, it’s a head scratcher. It’s been identified a total of 5 times now, in thousands of successful workflow executions, so it falls into the category of “Very rare but high impact” problems for me to figure out. To answer your questions (in order):

The HTTP tasks are synchronous. The idea here was that if the HTTP task threw a 400 or 500, even if I didn’t have error logs, I would still be able to see something because I’d have entries in the history table indicating the user task was completed but no side effects to the data because the web call failed. In one of the examples above, I have what looks like a completed HTTP task but the ACT_HI_ACTINST table only shows the wire leading up to the HTTP task and nothing beyond that point (the call itself must have succeeded or the data wouldn’t have been acted upon, so the failure point must be a wire after the HTTP task yet I see nothing to indicate that would be the case).

The logs are mysteriously quiet. I’m not seeing any HTTP error status values (400 or 500) on the web server, nor is Flowable dumping any explosions out to the server error log that would indicate something went wrong processing one of the activity nodes in the workflow diagram (by node I really mean any of the various things you put into a workflow diagram like tasks, sequence flow wires, etc.).

All I can think of is to use a detection/prevention where I write out when a task is completed and then when the next user task/workflow end is reached so that I can detect/prevent the situation where the workflow gets out of sync with the data. I may have to go down this rabbit hole because while it’s a very rare situation, the impact is high.