-
Notifications
You must be signed in to change notification settings - Fork 57
Orchestrations stuck in running state #280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am also seeing this; for example, instance ID |
@davidmrdavid, it's a bit hard to tell, but it appears that a failure in the fan-out call_actvity_with_retry created something weird with the Python implementation. We may want to see if we can repro this ourselves. @pcopley, each of these stuck orchestration bugs can have fairly independent root causes, so I would recommend opening a separate ticket for your issue. |
Just a couple of questions:
Finally, as some unrelated advice, I am noticing that your activities are returning incredibly large outputs (and passing them as inputs to subsequent orchestrations). Some of these are on the order of MB. It is worth noting that when executing an orchestration function, we have to load the entire history into memory at once, which includes all orchestration, suborchestration, and activity inputs/outputs. If you have fan-out/fan-in operations with 100+ MB worth of history, this could easily cause you to run into the 1.5 GB limit of consumption function apps. In the meantime, I am going to transfer this to a separate repo, as this bug seems specific to the Durable Python implementation. |
I realize my previous instructions regarding potential memory issues didn't cover the workaround. Instead of passing huge inputs directly as activity inputs/outputs, you can have your activity store the processed data in a blob and then pass around a reference to that blob. This has huge benefits from both a perf standpoint and a storage IO cost perspective, as you only fetch the data when you actually need it. The current downside is that it involves boiler plate code. We have an upcoming feature that will make this substantially easier. |
I just realised that I used a different Github account for this and the following comment, please excuse!
I will try to figure this out and will update this issue with my findings.
I will see what I can come up with; and again, update this issue with my findings.
Thanks very much for the hint (as well as the workaround you provided later!). I think I understand where the problem lies. We did see "Python exited with code 137" errors some time (but I did not notice any in the execution under investigation here). |
Unfortunately I do not seem to be able to collect the information you asked for in 1. I cannot find the cases where What I did is use the Monitoring menu of the I ran the query in Application Insights (cf 02_app_insight_success_true.png). Still only "Succes == True" traces show up, which is more or less expected. I then changed the filter on the Success column to "not True" (cf 03_success_not_true.png). Confusingly that does not produce any result at all (cf 04_successs_not_true_no_result.png). This surprises me because the Monitor reported Errors in the last 30 days (which is also the search time frame in App Insights). What is the best/most straight forward way to find the answer to your first question? |
I have switched to using blob storage and references to blobs to pass the results from one activity function to the next now. As you assumed, this alone does not solve the issue though. I could not find any orchestration instance that fulfills your request in 1., i.e. that did succeed but had a failed call to Currently it seems to me as if the I am not sure about the My next test will be a run without retry. I will report back as soon as I am done with the test. |
I now have an orchestration id of a run where there does not seem to be any error in I also have two orchestrations that seem to fail right after they have been rewound: These are all in the storage account named |
Thank you for the detailed information, this helps us a lot. I want to take a step back and ask about this sentence you made two comments ago: "And I see the difference in number of calls to CalculateRouteProfile vs CalculateCo2FromRouteProfile!" When you say that, do you mean that you see the difference in orchestration history log, or that you see the difference via a log statement that counts the length of elements in cc @ConnorMcMahon. |
Unfortunately I was referring to the first: I see a difference in the orchestration history log table. Somehow I am not able to find the logs for my orchestration instances (other than when live streaming?). There is a log statement that should show the counts; if only I was able to get at the logs... |
@multipleofzero, we should be able to help with that. Do you mind showing us how you're logging and how you're trying to retrieve those logs? and does your application have Application Insights enabled? |
Awesome! I have tried two approaches, one via if not context.is_replaying:
msg = (f"orchestrator has {len(chunk_list)} chunks with chunk size {FAN_OUT_CHUNK_SIZE} "
f"for calculating {len(list_of_route_params)} route items")
logging.info(msg) Then, before calling custom_status["msg"] = "processing route profiles"
custom_status["routes_in_current_chunk"] = len(chunks_of_route_params)
custom_status["current_chunk"] = idx
context.set_custom_status(custom_status) And before calling custom_status["msg"] = f"processing co2 from {len(chunk_profiles)} route profiles"
context.set_custom_status(custom_status) And I have a new Activity Function to clean up the referenced input data from the blob, custom_status["msg"] = (f"cleaning up after co2 calculation of {len(chunk_profiles)} "
"route profiles")
context.set_custom_status(custom_status) This should all be visible for the orchestration Note that I will be away for a week nowish, but I will come back to this when I am back at the latest. |
Hi @scherand! Great to hear that Application Insights is enabled, that's half the battle! My understanding is that logging to Application Insights from a Python application requires just a tad more setup. There might be an easier way to do this, but let me give you my way and hopefully that's enough to fast-track us towards getting this runtime data. So we're going to follow the general setup outlined here. In particular, you'll want to: (1) Install the OpenCensus handler for Azure Functions. To do this, please add the following dependency to your
As an example, my full requirements.txt when I was reproducing this issue looks like:
(2) Next, we'll set-up the OpenCensus handler in your orchestrator file. You just need to add the following import statement
and then provide your Application Insights InstrumentationKey to this snippet right here: logger_opencensus = logging.getLogger('opencensus')
logger_opencensus.addHandler(
AzureLogHandler(
connection_string='InstrumentationKey=<the rest of your key here>'
)
)
# the key has the following format: InstrumentationKey=<something>;IngestionEndpoint=<something>
# you can find it on the top-level banner when you access your Application Insights resource in the Azure portal So, for example, the top of your orchestrator file might now look as follows import logging
import json
import azure.functions as func
import azure.durable_functions as df
from opencensus.ext.azure.log_exporter import AzureLogHandler
logger_opencensus = logging.getLogger('opencensus')
logger_opencensus.addHandler(
AzureLogHandler(
connection_string='<the key>'
)
)
def orchestrator_function(context: df.DurableOrchestrationContext): (3) Now we can start logging! The API looks as follows: logger_opencensus.info('your message here!') So a simple orchestrator that logs to Application Insights might look like: def orchestrator_function(context: df.DurableOrchestrationContext):
if not context.is_replaying:
logger_opencensus.info('my_debug_category hi there!')
result1 = yield context.call_activity('Hello', "Tokyo")
result2 = yield context.call_activity('Hello', "Seattle")
result3 = yield context.call_activity('Hello', "London")
return [result1, result2, result3] (4) Now we can find your logs by going to your Application Insights resource in the Azure portal, then to the Logs option, and finally writing a query that looks like this: traces
| where * contains "<a prefix of your log>" So in the example above, I would query for: traces
| where * contains "my_debug_category" I'm attaching a picture of what your output should roughly look like. See below: And that's about it! Try this out with a simple test orchestration and let us know if you're able to locate your own logs in Application Insights. Afterwards, we can make your logs be more granular by including, in the prefix, the instanceID of your orchestration. You can access this value via |
Thank you very much for the detailed instructions! After a bit of trial and error I realise that the two main pieces of information I was missing were the "traces-query"
and the fact that I can get the orchestration instance Id via With this, my code which is currently using statements like import logging
...
def orchestrator_function(context: df.DurableOrchestrationContext):
"""This is an orchestrator function"""
inst_id = context._instance_id
...
if not context.is_replaying:
msg = (f"[{inst_id}] client input in orchestrator function: "
f"{len(list)} items.")
logging.info(msg) is producing logs I can find with the above mentioned query without any other additional effort! I already successfully ran a smaller batch of calculations. All orchestration instances successfully completed though, so I did not gain any additional insight for the original issue until now. |
Great! Please let us know if you're able to reproduce the error and gather this data! If a week or so goes by and no luck, we consider other options |
I ran another batch of calculation today in the storage account I am able to get at the logs of these orchestration runs using the I checked about four instances (mainly the History table, little on the traces) and all have at least one failed call to
It seems that:
Does this help in any way? |
Hi @scherand, Thanks for this update, I've been looking at the logs for A few questions about your comments.
I would appreciate it if you could answer this with concrete numbers. (2)
Could you point me to the timestamps, and maybe a screenshot or where you're seeing this? Thanks again for your help |
Also, what were the observed and expected sizes of |
Ok final request, it would also help, just to understand the internal state during these errors, if you could also provide / log the values for |
Let's see if I can explain what I mean 😄 ...
Orchestration On to the call(s) to The rest is better explained below (I have analysed the run again). 2nd:
I see this in the history table (using Azure Storage Explorer). But I think I misinterpreted something. After going through the orchestrator instance execution again, I see the following. The orchestration should process 25 inputs, first calling
As the first three characters of the Ids make them unambigous, I will just use the first three characters to refer to them. First, 25 executions of 24 of these 25 execute successfully and one (the one with EventId 19) fails. By looking at the results, I conclude that At Timestamp 2021-05-06T12:48:28.051Z (_Timestamp 2021-05-06T12:48:27.687Z) a Timer with EventId 25 is scheduled (to fire at 2021-05-06T12:47:30.573Z). This timer fires at Timestamp 2021-05-06T13:47:00.785Z (_Timestamp 2021-05-06T12:48:27.687Z), which is way after what I would expect by looking at the "fire at" when the timer was scheduled (2021-05-06T12:47:30.573Z). After the timer fired, a single call to This scheduled call (EventId 26) completes at Timestamp 2021-05-06T13:48:33.818Z/_Timestamp 2021-05-06T13:47:03.523Z and returns the expected result for a run of This is where it starts to be strange. 23 executions of At Timestamp 2021-05-06T13:49:44.905Z/_Timestamp 2021-05-06T13:48:33.226Z the timer just created (the one with EventId 50) fires and 19 completed calls to
Apprently the calls with EventIds 28, 29 and 33 are not done yet (and they should handle three of the inputs 227, 53e, 3fc, 0f9, or fd6; three of five because two were never scheduled). Next, Timestamp 2021-05-06T13:49:44.909Z/_Timestamp 2021-05-06T13:49:44.220Z another call to The missing results for EventIds 33 (Input 227) and 28 (Input fd6) arrive now. In the next run of the orchestrator, the result for EventId 29 (Input 3fc) arrives. This is were we are stuck now: all scheduled calls to Then:
They match/are OK. Both should be/are 25. I will only be able to look into your last request (about |
Hi @scherand, Thank you for this detailed answer. I'm going to write down some utilities that should help you get us data to understand the internal state of the framework when making these decisions, but I'll need some time. Let me ping back early next week. Until then, no need to log |
Hi @scherand, Just wanted to provide an update from our end. We recently discovered a few issues in the C# engine that powers Durable Functions for Python, and I believe that these might be to blame for the weird issues we've discussed in this thread. In short, whenever we build a Durable Orchestration in Python, there's a kind of translation (or compilation if you prefer that term) process that declares an orchestrator in C#. What we recently realized is that, in this translation process, due to the asynchronous nature of Durable Functions activities, it's sometimes possible for the C#-version of the orchestrator to become "out of sync" with the Python definition. This is especially likely to happen when you have timers, and failures, which create extra opportunities for race conditions. Ultimately, this problem was due to us relying on minimal metadata about the Python orchestration when doing this translation process. The good news is that I just merged a patch that addresses this issue here and the patch should be made available in a couple of days. After that, I'll need to make a few tiny changes in the Durable Python SDK to enable these patches to take place, that should take just a few extra days. I apologize for keeping you waiting. After I push all these bits online, we should try to reproduce this issue one more time. I have the gut feeling that the issue you experienced is caused, at least partially, by the aforementioned bug. I also added some extra telemetry bits that should enable me to reproduce this issue locally in case it persists. And that's all, I'll ping you again in a couple of days once the C# package is released, and then once more once the Python changes are in and we're ready to test. Thank you for your patience and for helping us debug this case, we really appreciate it! |
Thanks heaps for this update! These are exciting news. Of course I do hope that the bug(s) fixed were the root cause for the issue we experienced. But I can only guess the complexity of the system and understand that only testing will definitely determine this. For now, let me tell you that I am very impressed with the support you guys provide. The effort and transparency are exceptional. I would not have dreamed this when I created this issue. Keep that spirit up! Looking forward to hearing from you again. |
Hi @scherand. We just released a new version of the Python SDK, version Here's how to enable this new replay logic. First, you'll want to install the latest version of the Second, and this is the slightly more difficult part, you'll want to swap your built-in version of the durable-extension with version 2.5.0. By default, your app automatically obtains the durable extension via Extension Bundles. These are simply pre-deployed binaries that contain C# dependencies for your Python Functions app. We want to remove these, since they give you less control over which version of these C# dependencies you install. Removing them is simple. Go to your "extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[2.*, 3.0.0)"
} So, for example, if your {
"version": "2.0",
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[1.*, 2.0.0)"
}
} then now it should look like this: {
"version": "2.0"
} We're almost there! Now you need to select the durable extension version you need to install. We're going to be following these instructions. In short, run the following command on your CLI:
You'll know this worked if you now have an And that's it! Please let me know if you succeed with these instructions. If you do, I would love to do another test run and see if you still see these "stuck" orchestrations on the presence of activity failures. Thanks! |
Thank you so much for the update and the detailed instructions. I will look into this as soon as I can. I might have to wait a couple of days though before I can run another batch of calculations. I will update this issue as soon as I had a chance to tackle this. |
Dear @davidmrdavid It seems that I managed to swap in the AZ Function Python SDK I say "seems" because: After running the
After clicking the "reconfigure project" button present in the warning and restarting VS Code, a simple
After I started the function in debugging mode (via the VS Code button) which apparently installed the extension (again?), I will soon kick off another batch of calculations with this configuration and report on how it is going. In the meantime two questions popped up in the process. First, how can I revert to using the "regular" extension bundle again? Can I just undo the changes (i.e. re-add the snippet in Second, will this also work with the GitHub Action we use to deploy the AZ function (
there in case that matters. |
Wow. I ran the same batch of calculations again, with 1263 orchestrations. This time, only one seems to be stuck! And it seems to be a different "kind of stuck" to me :) The calculation was done on the The main difference to before seems to be that the orchestration actually finished successfully but this is not reported/recorded correctly. When I check the status URL a
I see twelve calls to But when I check the history table (using Azure Storage Explorer), I see the expected 25 calls and even messages stating "ExecutionCompleted" and "OrchestratorCompleted". The only orchestration that has a failed task, as far as I can tell, is For good measure, I am adding a couple of (randomly selected) other orchestration Ids of the same run here.
|
Hi @scherand, I'm really excited to discuss your last post! But, before getting to it, let me answer your previous post's questions. (1) On how to revert back to "regular" extension bundles
After adding it, for safety and peace of mind, you can remove the extensions.csproj file but, if all is working correctly, you should be able to keep it around. Then, after enough time goes by, the extension bundles should automatically contain the durable-extension bits that you manually downloaded with the instructions you followed. (2) On whether this affects your deployment strategy Finally, I'm just a tad concerned about some of the magically disappearing warnings you experienced, but if they're no longer there, then it must be that they got resolved on their own. Maybe it was a timing or ordering issue :) Now getting to your latest comment.... I'm cautiously optimistic about it! Just for clarity, the orchestration code no longer gets stuck, correct? It's just that the Application Insights telemetry seems to be inconsistent for a single orchestration and still reports it as Running. Does that match your understanding as well? If so, I'll take that partial win :) Looking at the logs on our end, I can also see these orchestrations appear to have completed successfully. So this appears to be "just" an issue in propagating that information to Application Insights. Please let me know if this matches your understanding as well, we can tackle this derivative issue next. Thank you! |
Thanks heaps for following up with the questions about reverting the change and GitHub Actions for deployment. Sounds great to me 😃. I agree about the warnings: since they are gone and I did not run into any issues yet I assume all is good. I need to better understand the build process at some point, that might provide additional clues. Somehow it seems that the build process (oryx?) is using a different environment (venv?) than the one configured in VS Code. For a long time (I belive before I even opened this issue) it is reporting that a newer version of From the Python shell in VS Code:
From the build process:
Now for the fun part 😃: This is about the "stuck" orchestration And it is also true that the information in Application Insights does not seem to reflect everything from the history table. Mainly I only see twelve calls to BUT, and that seems to be a bit more serious, when calling the Cheers |
@scherand, we have found this is a race condition if a crash happens during the last replay of an orchestration between updating the orchestration history and updating the state table. We have filed an issue here: Azure/durabletask#561. This bug has been around for a couple of years, so I am surprised we haven't seen it until now. I'm presuming it is incredibly rare, but we will make sure to fix this for our next Durable Functions extension release. |
Great! In that case, it appears to me that we've fixed the issue originally described in this thread :) . As Connor mentioned, the final issue you're seeing is now being tracked in another thread, and we'll look to patch it as soon as possible. I'll be closing this thread for now, but do feel free to re-open it if the original issue re-occurs. Thank you for your collaboration and patience with us throughout this process, we've fixed some major correctness concerns in the process of investigating this incident, so this was really productive on our end, thank you and reach out again if you need anything! |
Not sure if this is considered solved, but I wanted to let you know that this is still an issue with the default extension pack. We had issues with some retried IO activity functions that produced a lot of issues with stuck orchestrations. I followed the guide for installing the latest durable extensions (#280 (comment)) , and so far it seems to be solving our issues. (at least when running locally). But it took a bit of digging to find this issue, and the solution within it. |
Hi @Vidarls, thank you for reaching out and sorry that you were having this issue. As you've identified, there's a bit of a delay from the time when we make a durable-extension release to its inclusion in the default extension bundles. In the long term, my hope is to reduce that time difference so that users can leverage extension bundles to get the latest fixes automatically and almost as soon as they're released, but for now that's not possible. I'm glad to know that the issue seems resolved on your end now, please feel free to reach out if it re-appears. In the meantime, if there was anything we could have done to help bring more visibility to this thread's bug, and its mitigation, I would be open to discussing that in a new GitHub issue. Thanks! |
Description
I am using Fan-out/fan-in to do a two step calculation. The first step, "CalculateRouteProfile",
calls a third party API that returns data used in the second step "CalculateCo2FromRouteProfile".
The calculation is triggered by an HTTPStarter.
When running this for about 15'000 inputs, some of the calculations never seem to finish (most
of them do, though). The ones that do not finsh are stuck in the "Running" state.
Example of a finished calculation: c553159a6c994173bf6c4381376954dc. Examples of not finished
calculations follow below.
Expected behavior
All ~15'000 inputs are processed.
Actual behavior
Out of 754 orchestrators all but 14 eventually complete. The remaining 14 stay in "Running" state.
Relevant source code snippets
host.json
:Known workarounds
Using small batches of input (around 100 routes) seems to work reliably.
App Details
If deployed to Azure
The text was updated successfully, but these errors were encountered: