Skip to content

Orchestrations stuck in running state #280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
multipleofzero opened this issue Apr 5, 2021 · 33 comments
Closed

Orchestrations stuck in running state #280

multipleofzero opened this issue Apr 5, 2021 · 33 comments
Assignees

Comments

@multipleofzero
Copy link

Description

I am using Fan-out/fan-in to do a two step calculation. The first step, "CalculateRouteProfile",
calls a third party API that returns data used in the second step "CalculateCo2FromRouteProfile".

The calculation is triggered by an HTTPStarter.

When running this for about 15'000 inputs, some of the calculations never seem to finish (most
of them do, though). The ones that do not finsh are stuck in the "Running" state.

Example of a finished calculation: c553159a6c994173bf6c4381376954dc. Examples of not finished
calculations follow below.

Expected behavior

All ~15'000 inputs are processed.

Actual behavior

Out of 754 orchestrators all but 14 eventually complete. The remaining 14 stay in "Running" state.

Relevant source code snippets

import logging
import azure.durable_functions as df
from azure.durable_functions.models.RetryOptions import RetryOptions


MAX_ROUTES_PER_POST_REQUEST = 100
FAN_OUT_CHUNK_SIZE = 25


def chunks(lst, n):
    """Yield successive n-sized chunks from list."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]


def orchestrator_function(context: df.DurableOrchestrationContext):
    """ """
    list_of_route_params = context.get_input()
    logging.info(
        f"client input in orchestrator function: {len(list_of_route_params)} items."
    )
    if len(list_of_route_params) > 0:
        logging.info(
            f"first item: {list_of_route_params[0]}."
        )
    custom_status = {
        "msg": "just started",
        "total_number_of_routes": len(list_of_route_params),
        "current_chunk": None,
        "routes_in_current_chunk": None
    }
    context.set_custom_status(custom_status)

    out = []  # this will contain our final output
    # We need to avoid that one call immediately try to calculate thousands of
    # routes without limits.  So we split the total list of input routes into
    # chunks, with FAN_OUT_CHUNK_SIZE as the limit per chunk.
    chunk_list = list(chunks(list_of_route_params, FAN_OUT_CHUNK_SIZE))
    logging.info(
        (f"I have {len(chunk_list)} chunks  with chunk size {FAN_OUT_CHUNK_SIZE} "
         f"for calculating {len(list_of_route_params)} route items")
    )
    custom_status["number_of_chunks"] = len(chunk_list)
    context.set_custom_status(custom_status)
    for idx, chunks_of_route_params in enumerate(chunk_list):
        # this is a chained "fan-in-fan-out" call: first get the profiles,
        # then calculate the profile and the emissions

        custom_status["msg"] = "processing route profiles"
        custom_status["routes_in_current_chunk"] = len(chunks_of_route_params)
        custom_status["current_chunk"] = idx
        context.set_custom_status(custom_status)

        parallel_tasks = []
        retry_options = RetryOptions(1000, 3)
        for route_params in chunks_of_route_params:
            # first calculate the routes including the profile
            parallel_tasks.append(
                context.call_activity_with_retry("CalculateRouteProfile",
                                                 retry_options, route_params)
            )
        chunk_profiles = yield context.task_all(parallel_tasks)

        custom_status["msg"] = "processing from route profiles"
        context.set_custom_status(custom_status)
        parallel_tasks = []
        for profile in chunk_profiles:
            # calculate emission based on the profile
            parallel_tasks.append(
                context.call_activity(
                    "CalculateCo2FromRouteProfile",
                    {
                        "profile": profile["serialized_profile"],
                        "route_params": profile["route_params"]
                    },
                )
            )
        out += yield context.task_all(parallel_tasks)

    custom_status["msg"] = "done with processing"
    custom_status["routes_in_current_chunk"] = len(chunk_list[-1])
    custom_status["current_chunk"] = len(chunk_list)
    context.set_custom_status(custom_status)

    return out


main = df.Orchestrator.create(orchestrator_function)

host.json:

{
    "version": "2.0",
    "logging": {
        "applicationInsights": {
            "samplingSettings": {
                "isEnabled": true,
                "excludedTypes": "Request"
            }
        }
    },
    "extensionBundle": {
        "id": "Microsoft.Azure.Functions.ExtensionBundle",
        "version": "[2.*, 3.0.0)"
    },
    "extensions": {
        "durableTask": {
            "maxConcurrentActivityFunctions": 3,
            "maxConcurrentOrchestratorFunctions": 1
        }
    }
}

Known workarounds

Using small batches of input (around 100 routes) seems to work reliably.

App Details

  • Durable Functions extension version (e.g. v1.8.3): 1.3.0 (just called "Azure Functions" in VS Code)
  • Azure Functions runtime version (1.0 or 2.0): 2.0
  • Programming language used: Python

If deployed to Azure

  • Timeframe issue observed: "createdTime": "2021-03-31T15:29:19Z"
  • Function App name: co2-route-calculator-dev
  • Function name(s): CalculateCo2FromRouteProfile, CalculateRouteProfile
  • Azure region: Switzerland North
  • Orchestration instance ID(s): a9ee8c72713644619b3e1d1eaeed7811, 1d986a195d1f4708be368e86c9e5cd06, adacc094cf934a68a5ae45f4562ec961, ...
  • Azure storage account name: co2routecalculatordev
@ConnorMcMahon ConnorMcMahon self-assigned this Apr 6, 2021
@pcopley
Copy link

pcopley commented Apr 10, 2021

I am also seeing this; for example, instance ID 64877078bdf54fde97cc8dea383d9a78. I have an orchestrator that saves a single record to CosmosDB (Mongo API) then publishes a corresponding Service Bus message.

@ConnorMcMahon
Copy link

@davidmrdavid, it's a bit hard to tell, but it appears that a failure in the fan-out call_actvity_with_retry created something weird with the Python implementation. We may want to see if we can repro this ourselves.

@pcopley, each of these stuck orchestration bugs can have fairly independent root causes, so I would recommend opening a separate ticket for your issue.

@ConnorMcMahon
Copy link

@multipleofzero

Just a couple of questions:

  1. For the ~750 that didn't run into this isssue, did any of them have one of the initial call_activity_with_retry("CalculateRouteProfile") calls fail? If so, that orchestration instance id would be a great comparison point.
  2. Would it be possible to get a reproduction of this issue where you compare the size of chunk_list and chunk_profiles? For some reason, we don't see as many CalculateCo2FromRouteProfile executions scheduled as CalculateRouteProfile calls.

Finally, as some unrelated advice, I am noticing that your activities are returning incredibly large outputs (and passing them as inputs to subsequent orchestrations). Some of these are on the order of MB. It is worth noting that when executing an orchestration function, we have to load the entire history into memory at once, which includes all orchestration, suborchestration, and activity inputs/outputs. If you have fan-out/fan-in operations with 100+ MB worth of history, this could easily cause you to run into the 1.5 GB limit of consumption function apps.

In the meantime, I am going to transfer this to a separate repo, as this bug seems specific to the Durable Python implementation.

@ConnorMcMahon ConnorMcMahon transferred this issue from Azure/azure-functions-durable-extension Apr 12, 2021
@ConnorMcMahon
Copy link

I realize my previous instructions regarding potential memory issues didn't cover the workaround. Instead of passing huge inputs directly as activity inputs/outputs, you can have your activity store the processed data in a blob and then pass around a reference to that blob. This has huge benefits from both a perf standpoint and a storage IO cost perspective, as you only fetch the data when you actually need it.

The current downside is that it involves boiler plate code. We have an upcoming feature that will make this substantially easier.

@scherand
Copy link

scherand commented Apr 13, 2021

I just realised that I used a different Github account for this and the following comment, please excuse!

1. For the ~750 that didn't run into this isssue, did any of them have one of the initial `call_activity_with_retry("CalculateRouteProfile")` calls fail? If so, that orchestration instance id would be a great comparison point.

I will try to figure this out and will update this issue with my findings.

2. Would it be possible to get a reproduction of this issue where you compare the size of `chunk_list` and `chunk_profiles`? For some reason, we don't see as many `CalculateCo2FromRouteProfile` executions scheduled as `CalculateRouteProfile` calls.

I will see what I can come up with; and again, update this issue with my findings.

Finally, as some unrelated advice, I am noticing that your activities are returning incredibly large outputs (and passing them as inputs to subsequent orchestrations). Some of these are on the order of MB. It is worth noting that when executing an orchestration function, we have to load the entire history into memory at once, which includes all orchestration, suborchestration, and activity inputs/outputs. If you have fan-out/fan-in operations with 100+ MB worth of history, this could easily cause you to run into the 1.5 GB limit of consumption function apps.

Thanks very much for the hint (as well as the workaround you provided later!). I think I understand where the problem lies. We did see "Python exited with code 137" errors some time (but I did not notice any in the execution under investigation here).

@scherand
Copy link

Unfortunately I do not seem to be able to collect the information you asked for in 1. I cannot find the cases where CalculateRouteProfile failed.

What I did is use the Monitoring menu of the CalculateRouteProfile function (cf 01_calc_profile_monitor.png) where 124 Errors are reported in the last 30 days. They do not show up in the list below, presumably because the list shows the most recent traces (?) and the errors are older.

01_calc_profile_monitor

I ran the query in Application Insights (cf 02_app_insight_success_true.png). Still only "Succes == True" traces show up, which is more or less expected.

02_app_insight_success_true

I then changed the filter on the Success column to "not True" (cf 03_success_not_true.png).

03_success_not_true

Confusingly that does not produce any result at all (cf 04_successs_not_true_no_result.png). This surprises me because the Monitor reported Errors in the last 30 days (which is also the search time frame in App Insights).

04_successs_not_true_no_result

What is the best/most straight forward way to find the answer to your first question?

@multipleofzero
Copy link
Author

multipleofzero commented Apr 21, 2021

I have switched to using blob storage and references to blobs to pass the results from one activity function to the next now. As you assumed, this alone does not solve the issue though.

I could not find any orchestration instance that fulfills your request in 1., i.e. that did succeed but had a failed call to CalculateRouteProfile. And I see the difference in number of calls to CalculateRouteProfile vs CalculateCo2FromRouteProfile!

Currently it seems to me as if the _with_retry part in call_activity_with_retry is causing the issue. When looking at the history table of an orchestration instance, it seems that the triggers that are scheduled for the retry do not (properly) fire? Because I suspected the 1 second backoff time might be too short for the system to properly register the timer, I changed the wait time before the retry to 10 seconds (which is correctly reflected in the difference between Timestamp and FireAt in the screenshot below).

I am not sure about the Timestamp vs _Timestamp column though. This screenshot is from Azure Storage Explorer. (Note: this orchestration instance ran in a different storage account: co2routecalculatortest)

az_storage_explorer_history_table_from_test

My next test will be a run without retry. I will report back as soon as I am done with the test.

@multipleofzero
Copy link
Author

I now have an orchestration id of a run where there does not seem to be any error in CalculateRouteProfile but still it does not terminate: 9c7a8fceb0dd4ddf9f94119cc4262184. This is from a run without _with_retry which seems to make the "perpetual orchestrations" (almost) go away (the one I have listed is the only one in 1000+). However, I have to deal with failed orchestrations now (rewinding them).

I also have two orchestrations that seem to fail right after they have been rewound: 3a453dc5bcb24ca388a2f1bac253aa4e and 647205c287ee4a1a98c2468c9ae0eb0d.

These are all in the storage account named co2routecalculatortest.

@davidmrdavid
Copy link
Collaborator

davidmrdavid commented Apr 21, 2021

Hi @multipleofzero

Thank you for the detailed information, this helps us a lot. I want to take a step back and ask about this sentence you made two comments ago: "And I see the difference in number of calls to CalculateRouteProfile vs CalculateCo2FromRouteProfile!"

When you say that, do you mean that you see the difference in orchestration history log, or that you see the difference via a log statement that counts the length of elements in chunk_profiles versus chunk_list? Having the latter would be ideal for us and it would help pinpoint if this issue is occurring at the level of the Python SDK, or deeper in the framework.

cc @ConnorMcMahon.

@multipleofzero
Copy link
Author

@davidmrdavid

ask about this sentence you made two comments ago: "And I see the difference in number of calls to CalculateRouteProfile vs CalculateCo2FromRouteProfile!"
When you say that, do you mean that you see the difference in orchestration history log, or that you see the difference via a log statement that counts the length of elements in chunk_profiles versus chunk_list?

Unfortunately I was referring to the first: I see a difference in the orchestration history log table. Somehow I am not able to find the logs for my orchestration instances (other than when live streaming?). There is a log statement that should show the counts; if only I was able to get at the logs...

@davidmrdavid
Copy link
Collaborator

@multipleofzero, we should be able to help with that. Do you mind showing us how you're logging and how you're trying to retrieve those logs? and does your application have Application Insights enabled?

@scherand
Copy link

scherand commented Apr 22, 2021

@davidmrdavid

we should be able to help with that

Awesome!
And yes, Application Insights is (are :P) enabled.

I have tried two approaches, one via logging.info() and one via setting custom_status. For example, there is a custom state set when the orchestrator_function starts:

if not context.is_replaying:
    msg = (f"orchestrator has {len(chunk_list)} chunks  with chunk size {FAN_OUT_CHUNK_SIZE} "
           f"for calculating {len(list_of_route_params)} route items")
    logging.info(msg)

Then, before calling CalculateRouteProfile a custom status is set:

custom_status["msg"] = "processing route profiles"
custom_status["routes_in_current_chunk"] = len(chunks_of_route_params)
custom_status["current_chunk"] = idx
context.set_custom_status(custom_status)

And before calling CalculateCo2FromRouteProfile the custom status is updated:

custom_status["msg"] = f"processing co2 from {len(chunk_profiles)} route profiles"
context.set_custom_status(custom_status)

And I have a new Activity Function to clean up the referenced input data from the blob, CalculationCleanup:

custom_status["msg"] = (f"cleaning up after co2 calculation of {len(chunk_profiles)} "
                        "route profiles")
context.set_custom_status(custom_status)

This should all be visible for the orchestration 9c7a8fceb0dd4ddf9f94119cc4262184 in the storage account named co2routecalculatortest.

Note that I will be away for a week nowish, but I will come back to this when I am back at the latest.

@davidmrdavid
Copy link
Collaborator

Hi @scherand!

Great to hear that Application Insights is enabled, that's half the battle! My understanding is that logging to Application Insights from a Python application requires just a tad more setup. There might be an easier way to do this, but let me give you my way and hopefully that's enough to fast-track us towards getting this runtime data.

So we're going to follow the general setup outlined here. In particular, you'll want to:

(1) Install the OpenCensus handler for Azure Functions. To do this, please add the following dependency to your requirements.txt

opencensus-ext-azure

As an example, my full requirements.txt when I was reproducing this issue looks like:

azure-functions
azure-functions-durable
opencensus-ext-azure

(2) Next, we'll set-up the OpenCensus handler in your orchestrator file. You just need to add the following import statement

from opencensus.ext.azure.log_exporter import AzureLogHandler

and then provide your Application Insights InstrumentationKey to this snippet right here:

logger_opencensus = logging.getLogger('opencensus')
logger_opencensus.addHandler(
    AzureLogHandler(
        connection_string='InstrumentationKey=<the rest of your key here>'
    )
)

# the key has the following format: InstrumentationKey=<something>;IngestionEndpoint=<something>
# you can find it on the top-level banner when you access your Application Insights resource in the Azure portal

So, for example, the top of your orchestrator file might now look as follows

import logging
import json

import azure.functions as func
import azure.durable_functions as df
from opencensus.ext.azure.log_exporter import AzureLogHandler

logger_opencensus = logging.getLogger('opencensus')
logger_opencensus.addHandler(
    AzureLogHandler(
        connection_string='<the key>'
    )
)

def orchestrator_function(context: df.DurableOrchestrationContext):

(3) Now we can start logging! The API looks as follows:

logger_opencensus.info('your message here!')

So a simple orchestrator that logs to Application Insights might look like:

def orchestrator_function(context: df.DurableOrchestrationContext):
    if not context.is_replaying:
        logger_opencensus.info('my_debug_category hi there!')
    result1 = yield context.call_activity('Hello', "Tokyo")
    result2 = yield context.call_activity('Hello', "Seattle")
    result3 = yield context.call_activity('Hello', "London")
    return [result1, result2, result3]

(4) Now we can find your logs by going to your Application Insights resource in the Azure portal, then to the Logs option, and finally writing a query that looks like this:

traces
| where * contains "<a prefix of your log>"

So in the example above, I would query for:

traces
| where * contains "my_debug_category"

I'm attaching a picture of what your output should roughly look like. See below:
mylogs

And that's about it!

Try this out with a simple test orchestration and let us know if you're able to locate your own logs in Application Insights. Afterwards, we can make your logs be more granular by including, in the prefix, the instanceID of your orchestration. You can access this value via context._instance_id. With that, we should have enough to find the data we're looking for! Please let me know if anything about this was unclear, I realize this comment is rather long by now ⚡ . Thanks!

@scherand
Copy link

scherand commented May 3, 2021

Hi @davidmrdavid

Thank you very much for the detailed instructions!

After a bit of trial and error I realise that the two main pieces of information I was missing were the "traces-query"

traces
| where * contains "my_debug_category"

and the fact that I can get the orchestration instance Id via context._instance_id.

With this, my code which is currently using statements like

import logging
...
def orchestrator_function(context: df.DurableOrchestrationContext):
    """This is an orchestrator function"""
    inst_id = context._instance_id
    ...
    if not context.is_replaying:
        msg = (f"[{inst_id}] client input in orchestrator function: "
               f"{len(list)} items.")
        logging.info(msg)

is producing logs I can find with the above mentioned query without any other additional effort!

I already successfully ran a smaller batch of calculations. All orchestration instances successfully completed though, so I did not gain any additional insight for the original issue until now.

@davidmrdavid
Copy link
Collaborator

Great! Please let us know if you're able to reproduce the error and gather this data! If a week or so goes by and no luck, we consider other options

@scherand
Copy link

scherand commented May 6, 2021

I ran another batch of calculation today in the storage account co2routecalculatordev. I deliberately re-enabled call_activity_with_retry for this run, hoping to be able to get better telemetry.
The calculation started 1263 orchestration instances of which 66 seem to be stuck in processing state now.

I am able to get at the logs of these orchestration runs using the traces query @davidmrdavid provided. From these, it seems that the orchestration has the correct counts for routes/profiles in all cases (25 in this run).

I checked about four instances (mainly the History table, little on the traces) and all have at least one failed call to CalculateRouteProfile (some have two). Here are the orchestration ids:

  • 9fdf1d3ac0494b459f1372d6ac489452
  • 498e703978064766a5581efdf1b7a7d4
  • fed64ff38a5341dbbdf7c15511e9432b
  • f3728ea551434445846ff759d44a6753

It seems that:

  • the orchestrations that are "stuck" fail to schedule as many tasks for CalculateCo2FromRouteProfile as originally failed (but succeeded after the retry)
  • the orchestrations that are "stuck" do not "wake up" after the same number of CalculateCo2FromRouteProfile tasks that it has scheduled are finished (but because not all were scheduled, not all that are "available" do finish)
  • (in instance 9fdf1d3ac0494b459f1372d6ac489452) the failed call to CalculateRouteProfile then completed twice for CalculateCo2FromRouteProfile (once with empty input data, presumably from the failed call to CalculateRouteProfile and once with correct input data) whereas the first two "calculation requests" (by order of the input data) did not complete
  • I cannot tell if the "missing" results/"stuck" executions of CalculateCo2FromRouteProfile were scheduled at all or not (probably not)
  • Something strange is going on with the timers (at least if the "Timestamp" and "FireAt" fields in the History table should make sense)

Does this help in any way?

@davidmrdavid
Copy link
Collaborator

Hi @scherand,

Thanks for this update, I've been looking at the logs for 9fdf1d3ac0494b459f1372d6ac489452 which appears to have a single failed execution attempt for CalculateRouteProfile.

A few questions about your comments.
(1)
What did you mean by:

the orchestrations that are "stuck" fail to schedule as many tasks for CalculateCo2FromRouteProfile as originally failed (but succeeded after the retry)

I would appreciate it if you could answer this with concrete numbers.

(2)

(in instance 9fdf1d3ac0494b459f1372d6ac489452) the failed call to CalculateRouteProfile then completed twice for CalculateCo2FromRouteProfile (once with empty input data, presumably from the failed call to CalculateRouteProfile and once with correct input data) whereas the first two "calculation requests" (by order of the input data) did not complete

Could you point me to the timestamps, and maybe a screenshot or where you're seeing this?

Thanks again for your help

@davidmrdavid
Copy link
Collaborator

davidmrdavid commented May 7, 2021

Also, what were the observed and expected sizes of chunk_profiles for InstanceID 9fdf1d3ac0494b459f1372d6ac489452 ?

@davidmrdavid
Copy link
Collaborator

Ok final request, it would also help, just to understand the internal state during these errors, if you could also provide / log the values for context._histories right before your scheduling of CalculateCo2FromRouteProfile activities. With that, we can understand the decisions the Python SDK made. On a single orchestration execution (until it gets stuck) you should be receiving multiple snapshots of that, once per orchestration replay, so having those snapshots would be key to understanding what scheduling decisions were made. Thanks again!

@scherand
Copy link

scherand commented May 7, 2021

Hi @davidmrdavid

Let's see if I can explain what I mean 😄 ...

What did you mean by:

the orchestrations that are "stuck" fail to schedule as many tasks for CalculateCo2FromRouteProfile as originally failed (but succeeded after the retry)

Orchestration 9fdf1d3ac0494b459f1372d6ac489452 is currently "stuck" in running state (i.e. it is in running state without any change for quite some time now). When it ran, it had one failed execution attempt for CalculateRouteProfile (as you indicated). This was (automatically) re-tried and the second attempt succeeded. So we first had 25 calls to CalculateRouteProfile of which one failed, that was re-tried and succeeded. In total we have 26 calls to CalculateRouteProfile (for 25 different inputs).

On to the call(s) to CalculateCo2FromRouteProfile. We would expect 25 calls here, one for every successful output of CalculateRouteProfile we had before. But I can only see 23 calls to CalculateCo2FromRouteProfile.

The rest is better explained below (I have analysed the run again).

2nd:

(in instance 9fdf1d3ac0494b459f1372d6ac489452) the failed call to CalculateRouteProfile then completed twice for CalculateCo2FromRouteProfile (once with empty input data, presumably from the failed call to CalculateRouteProfile and once with correct input data) whereas the first two "calculation requests" (by order of the input data) did not complete

Could you point me to the timestamps, and maybe a screenshot or where you're seeing this?

I see this in the history table (using Azure Storage Explorer). But I think I misinterpreted something. After going through the orchestrator instance execution again, I see the following.

The orchestration should process 25 inputs, first calling CalculateRouteProfile and then, with the result of CalculateRouteProfile, call CalculateCo2FromRouteProfile. My inputs can be identified by an Id I gave them. These are the inputs Ids for this orchestration instance (I do not now if the order is important):

1.  53e24650050d9a48
2.  0f9ea52d96486ac7
3.  ea523befc643e630
4.  fd6ca3d38706a2b2
5.  3fccb482633e519d
6.  51fe1ede558e4342
7.  0688660722f778cb
8.  fdf8faa56187b2a9
9.  227574a6b1faa39b
10. f1e3af68a4ec72d7
11. d7ae9422578c2077
12. 6e6c30c4af06eb1f
13. 9bde252dde80baa4
14. a06c32cc1693690b
15. 965b1e49f691104e
16. d9beee6e40fd5bfe
17. d5472e4d79566ab9
18. 716bfb95896f5e06
19. fe54b25c1951b5e9
20. 7a54c38670bf7916
21. 6379e1a7c1d9b472
22. cdcf9406190790f9
23. 99d63b74a1f2f657
24. 5557b17fa530a05c
25. fdb76472bff8e78f

As the first three characters of the Ids make them unambigous, I will just use the first three characters to refer to them.

First, 25 executions of CalculateRouteProfile are scheduled (as expected):

01_call_route_profile

24 of these 25 execute successfully and one (the one with EventId 19) fails. By looking at the results, I conclude that CalculateRouteProfile failed for input 7a5 (that one is missing).

02_result_route_profile

At Timestamp 2021-05-06T12:48:28.051Z (_Timestamp 2021-05-06T12:48:27.687Z) a Timer with EventId 25 is scheduled (to fire at 2021-05-06T12:47:30.573Z).

This timer fires at Timestamp 2021-05-06T13:47:00.785Z (_Timestamp 2021-05-06T12:48:27.687Z), which is way after what I would expect by looking at the "fire at" when the timer was scheduled (2021-05-06T12:47:30.573Z).

After the timer fired, a single call to CalculateRouteProfile is scheduled (EventId 26), Timestamp 2021-05-06T13:47:00.786Z, _Timestamp 2021-05-06T13:47:00.240Z.

03_timer_route_profile

This scheduled call (EventId 26) completes at Timestamp 2021-05-06T13:48:33.818Z/_Timestamp 2021-05-06T13:47:03.523Z and returns the expected result for a run of CalculateRouteProfile for input 7a5.

This is where it starts to be strange.

23 executions of CalculateCo2FromRouteProfile are scheduled (expected: 25). They have EventIds from 27 to 49.
Also, a timer is created (EventId 50, Timestamp 2021-05-06T13:48:33.825Z/_Timestamp 2021-05-06T13:48:33.226Z) and set to fire at 2021-05-06T12:47:30.573Z. This is a "fire at" in the past? It is the same "fire at" time that was used for the previous timer (EventId 25).

04_call_calc_co2

At Timestamp 2021-05-06T13:49:44.905Z/_Timestamp 2021-05-06T13:48:33.226Z the timer just created (the one with EventId 50) fires and 19 completed calls to CalculateCo2FromRouteProfile start to roll in. Here is the breakdown (EventId/Input Id):

EventId / Input Id
27 / ea5
28
29
30 / 51f
31 / 068
32 / fdf
33
34 / f1e
35 / d7a
36 / 6e6
37 / 9bd
38 / a06
39 / 965
40 / d9b
41 / d54
42 / 716
43 / fe5
44 / 75a (this is the one that first had a failed execution of "CalculateRouteProfile"), it contains the "real" result (only possible after a successful call to "CalculateRouteProfile")
45 / 637
46 / cdc
47 / 99d
48 / 555
49 / fdb

Apprently the calls with EventIds 28, 29 and 33 are not done yet (and they should handle three of the inputs 227, 53e, 3fc, 0f9, or fd6; three of five because two were never scheduled).

Next, Timestamp 2021-05-06T13:49:44.909Z/_Timestamp 2021-05-06T13:49:44.220Z another call to CalculateRouteProfile is scheduled (EventId 51). Why?
It completes at Timestamp 2021-05-06T13:50:02.909Z/_Timestamp 2021-05-06T13:49:45.363Z and has the result of a call to CalculateRouteProfile for input 75a again (the one that initially failed, the exact same was received in Event 26 already).

05_call_route_profile_again

The missing results for EventIds 33 (Input 227) and 28 (Input fd6) arrive now. In the next run of the orchestrator, the result for EventId 29 (Input 3fc) arrives.

This is were we are stuck now: all scheduled calls to CalculateRouteProfile and CalculateCo2FromRouteProfile completed, but two (for inputs 53e, 0f9, which are the first two by the ordering of the input) are missing. Probably there never was a call to CalculateCo2FromRouteProfile for these?

Then:

what were the observed and expected sizes of chunk_profiles for InstanceID 9fdf1d3ac0494b459f1372d6ac489452

They match/are OK. Both should be/are 25.

I will only be able to look into your last request (about context._histories) next week, I am afraid.

@davidmrdavid
Copy link
Collaborator

Hi @scherand,

Thank you for this detailed answer. I'm going to write down some utilities that should help you get us data to understand the internal state of the framework when making these decisions, but I'll need some time. Let me ping back early next week. Until then, no need to log context._histories, you'll probably need some of those utils to get the right data from there anyways :)

@davidmrdavid
Copy link
Collaborator

davidmrdavid commented May 26, 2021

Hi @scherand,

Just wanted to provide an update from our end. We recently discovered a few issues in the C# engine that powers Durable Functions for Python, and I believe that these might be to blame for the weird issues we've discussed in this thread.

In short, whenever we build a Durable Orchestration in Python, there's a kind of translation (or compilation if you prefer that term) process that declares an orchestrator in C#. What we recently realized is that, in this translation process, due to the asynchronous nature of Durable Functions activities, it's sometimes possible for the C#-version of the orchestrator to become "out of sync" with the Python definition. This is especially likely to happen when you have timers, and failures, which create extra opportunities for race conditions.

Ultimately, this problem was due to us relying on minimal metadata about the Python orchestration when doing this translation process. The good news is that I just merged a patch that addresses this issue here and the patch should be made available in a couple of days. After that, I'll need to make a few tiny changes in the Durable Python SDK to enable these patches to take place, that should take just a few extra days. I apologize for keeping you waiting.

After I push all these bits online, we should try to reproduce this issue one more time. I have the gut feeling that the issue you experienced is caused, at least partially, by the aforementioned bug. I also added some extra telemetry bits that should enable me to reproduce this issue locally in case it persists.

And that's all, I'll ping you again in a couple of days once the C# package is released, and then once more once the Python changes are in and we're ready to test. Thank you for your patience and for helping us debug this case, we really appreciate it!

@scherand
Copy link

Hi @davidmrdavid

Thanks heaps for this update! These are exciting news. Of course I do hope that the bug(s) fixed were the root cause for the issue we experienced. But I can only guess the complexity of the system and understand that only testing will definitely determine this.

For now, let me tell you that I am very impressed with the support you guys provide. The effort and transparency are exceptional. I would not have dreamed this when I created this issue. Keep that spirit up!

Looking forward to hearing from you again.

@davidmrdavid
Copy link
Collaborator

Hi @scherand.

We just released a new version of the Python SDK, version 1.0.1, containing improved replay logic! When paired with the correct version of the durable-extension, version 2.5.0, this improved logic should kick in and, ideally, it should mitigate this issue. Even if it doesn't, the changes we made should still allow us to obtain improved telemetry on our end that should help us better diagnose just what is going on when your application misbehaves.

Here's how to enable this new replay logic.

First, you'll want to install the latest version of the azure-functions-durable PyPI package on your Functions app. There are many ways to do this. One of them is simply to modify your requierements.txt and fix the package target version to be 1.0.1. Then, perform a pip install -r requirements.txt.

Second, and this is the slightly more difficult part, you'll want to swap your built-in version of the durable-extension with version 2.5.0. By default, your app automatically obtains the durable extension via Extension Bundles. These are simply pre-deployed binaries that contain C# dependencies for your Python Functions app. We want to remove these, since they give you less control over which version of these C# dependencies you install.

Removing them is simple. Go to your host.json file and remove the section for property extensionBundles. Normally, that property looks as follows:

    "extensionBundle": {
        "id": "Microsoft.Azure.Functions.ExtensionBundle",
        "version": "[2.*, 3.0.0)"
    }

So, for example, if your host.json currently looks like this:

{
    "version": "2.0",
    "extensionBundle": {
        "id": "Microsoft.Azure.Functions.ExtensionBundle",
        "version": "[1.*, 2.0.0)"
    }
}

then now it should look like this:

{
    "version": "2.0"
}

We're almost there! Now you need to select the durable extension version you need to install. We're going to be following these instructions. In short, run the following command on your CLI:

func extensions install --package Microsoft.Azure.WebJobs.Extensions.DurableTask --version 2.5.0

You'll know this worked if you now have an extensions.csproj file in your project, and if a func host start succeeds without any issues.

And that's it!

Please let me know if you succeed with these instructions. If you do, I would love to do another test run and see if you still see these "stuck" orchestrations on the presence of activity failures. Thanks!

@scherand
Copy link

Hi @davidmrdavid

Thank you so much for the update and the detailed instructions. I will look into this as soon as I can. I might have to wait a couple of days though before I can run another batch of calculations. I will update this issue as soon as I had a chance to tackle this.

@scherand
Copy link

Dear @davidmrdavid

It seems that I managed to swap in the AZ Function Python SDK 1.0.1 as well as azure-functions-durable version 2.5.0 using your instructions.

I say "seems" because:

After running the func extensions install command, I did receive a warning in VS Code saying

Your project is not configured to work with "extensions.csproj".

After clicking the "reconfigure project" button present in the warning and restarting VS Code, a simple func host start produced another error:

The 'CalculateCo2FromRouteProfile' function is in error: The binding type(s) 'blob' are not registered.
Please ensure the type is correct and the binding extension is installed.

After I started the function in debugging mode (via the VS Code button) which apparently installed the extension (again?), func host start is now working. I assume that this means all is good 😅.

I will soon kick off another batch of calculations with this configuration and report on how it is going.

In the meantime two questions popped up in the process.

First, how can I revert to using the "regular" extension bundle again? Can I just undo the changes (i.e. re-add the snippet in host.json and delete the extensions.csproj file)? Or is this more involved?

Second, will this also work with the GitHub Action we use to deploy the AZ function (Azure/functions-action@v1)? We already use "remote build" (I believe it is called)

scm-do-build-during-deployment: true
enable-oryx-build: true

there in case that matters.
I have deployed the Function App via VS Code for the test I am running now, that worked as expected. But I assume using GitHub Actions could be different?

@scherand
Copy link

scherand commented Jul 1, 2021

Hi @davidmrdavid

Wow.

I ran the same batch of calculations again, with 1263 orchestrations. This time, only one seems to be stuck! And it seems to be a different "kind of stuck" to me :)

The calculation was done on the co2routecalculatortest storage account. And the orchestration Id in question is 458aa5e1e7b04f0096ac072d251f4f89.

The main difference to before seems to be that the orchestration actually finished successfully but this is not reported/recorded correctly.

When I check the status URL a runtimeStatus of Running is reported. When I check the logs in Application Insights

traces
| where * contains '458aa5e1e7b04f0096ac072d251f4f89'

I see twelve calls to CalculationCleanup, which is the last step/Activity Function, logged (expected: 25).

But when I check the history table (using Azure Storage Explorer), I see the expected 25 calls and even messages stating "ExecutionCompleted" and "OrchestratorCompleted".

AzureStorageExplorer

The only orchestration that has a failed task, as far as I can tell, is 296301a800994053960f956ecadc8c8a. "Usually" the failed activity function was CalculateRouteProfile until now, here it is a different one though (CalculationPersist which is later in the process). This orchestration later also completed correctly (i.e. the retry succeeded).

For good measure, I am adding a couple of (randomly selected) other orchestration Ids of the same run here.

965865fa677045978caeb1f15364aeef
721ea0e4bb114ed3a3de61046ca44a86
bdddb6a56cff48cf832691c711830f23
482d5f3a69264d5ea17a464f4384bcea
79818f625af545aaaea8c46227189db8

@davidmrdavid
Copy link
Collaborator

davidmrdavid commented Jul 1, 2021

Hi @scherand,

I'm really excited to discuss your last post! But, before getting to it, let me answer your previous post's questions.

(1) On how to revert back to "regular" extension bundles
You got it right, just adding back the text in host.json should suffice. In other words, just add the following:

    "extensionBundle": {
        "id": "Microsoft.Azure.Functions.ExtensionBundle",
        "version": "[2.*, 3.0.0)"
    }

After adding it, for safety and peace of mind, you can remove the extensions.csproj file but, if all is working correctly, you should be able to keep it around. Then, after enough time goes by, the extension bundles should automatically contain the durable-extension bits that you manually downloaded with the instructions you followed.

(2) On whether this affects your deployment strategy
It should not! You should be able to deploy just the same :)

Finally, I'm just a tad concerned about some of the magically disappearing warnings you experienced, but if they're no longer there, then it must be that they got resolved on their own. Maybe it was a timing or ordering issue :)

Now getting to your latest comment....

I'm cautiously optimistic about it! Just for clarity, the orchestration code no longer gets stuck, correct? It's just that the Application Insights telemetry seems to be inconsistent for a single orchestration and still reports it as Running. Does that match your understanding as well? If so, I'll take that partial win :)

Looking at the logs on our end, I can also see these orchestrations appear to have completed successfully. So this appears to be "just" an issue in propagating that information to Application Insights. Please let me know if this matches your understanding as well, we can tackle this derivative issue next.

Thank you!

@scherand
Copy link

scherand commented Jul 5, 2021

Hi @davidmrdavid

Thanks heaps for following up with the questions about reverting the change and GitHub Actions for deployment. Sounds great to me 😃.

I agree about the warnings: since they are gone and I did not run into any issues yet I assume all is good. I need to better understand the build process at some point, that might provide additional clues. Somehow it seems that the build process (oryx?) is using a different environment (venv?) than the one configured in VS Code. For a long time (I belive before I even opened this issue) it is reporting that a newer version of pip is available even though the newest version is installed in the venv used by VSCode.

From the Python shell in VS Code:

(.venv) PS C:\Users\...> pip --version
pip 21.1.3 from c:\users\...\.venv\lib\site-packages\pip (python 3.8)

From the build process:

...: WARNING: You are using pip version 20.2.3; however, version 21.1.3 is available.
...: You should consider upgrading via the '/tmp/oryx/platforms/python/3.8.6/bin/python3.8 -m pip install --upgrade pip' command.

Now for the fun part 😃:

This is about the "stuck" orchestration 458aa5e1e7b04f0096ac072d251f4f89 again. It is correct that I believe the execution actually succeeded. I deduct this from the entries in the history table of the durable function. Mainly the two entries stating "ExecutionCompleted" and "OrchestratorCompleted".

And it is also true that the information in Application Insights does not seem to reflect everything from the history table. Mainly I only see twelve calls to CalculationCleanup instead of the expected 25.

BUT, and that seems to be a bit more serious, when calling the statusQueryGetUri (the one like https://...azurewebsites.net/runtime/webhooks/durabletask/instances/458aa5e1e7b04f0096ac072d251f4f89?taskHub=...&connection=Storage&code=0e...w==) it comes back with a runtimeStatus of Running. So client code assumes the function is not done (yet). At the end of the day I do not know if this is "just" related to a failing propagation of information to Application Insights or if this is an issue on its own.

Cheers

@ConnorMcMahon
Copy link

ConnorMcMahon commented Jul 6, 2021

@scherand, we have found this is a race condition if a crash happens during the last replay of an orchestration between updating the orchestration history and updating the state table. We have filed an issue here: Azure/durabletask#561.

This bug has been around for a couple of years, so I am surprised we haven't seen it until now. I'm presuming it is incredibly rare, but we will make sure to fix this for our next Durable Functions extension release.

@davidmrdavid
Copy link
Collaborator

Great! In that case, it appears to me that we've fixed the issue originally described in this thread :) . As Connor mentioned, the final issue you're seeing is now being tracked in another thread, and we'll look to patch it as soon as possible.

I'll be closing this thread for now, but do feel free to re-open it if the original issue re-occurs. Thank you for your collaboration and patience with us throughout this process, we've fixed some major correctness concerns in the process of investigating this incident, so this was really productive on our end, thank you and reach out again if you need anything!

@Vidarls
Copy link

Vidarls commented Aug 30, 2021

Not sure if this is considered solved, but I wanted to let you know that this is still an issue with the default extension pack.

We had issues with some retried IO activity functions that produced a lot of issues with stuck orchestrations.

I followed the guide for installing the latest durable extensions (#280 (comment)) , and so far it seems to be solving our issues. (at least when running locally).

But it took a bit of digging to find this issue, and the solution within it.

@davidmrdavid
Copy link
Collaborator

Hi @Vidarls, thank you for reaching out and sorry that you were having this issue.

As you've identified, there's a bit of a delay from the time when we make a durable-extension release to its inclusion in the default extension bundles. In the long term, my hope is to reduce that time difference so that users can leverage extension bundles to get the latest fixes automatically and almost as soon as they're released, but for now that's not possible.

I'm glad to know that the issue seems resolved on your end now, please feel free to reach out if it re-appears. In the meantime, if there was anything we could have done to help bring more visibility to this thread's bug, and its mitigation, I would be open to discussing that in a new GitHub issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants