Skip to content

Allow reraising the root exception if instrumentation fails #3545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 4, 2025

Conversation

phillipuniverse
Copy link
Contributor

@phillipuniverse phillipuniverse commented May 27, 2025

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

I would rather completely fail startup in my services if instrumentation fails for whatever reason instead of just logging an exception and continuing.

Use case:

from opentelemetry import autoinstrumentation

autoinstrumentation.initialize(swallow_exceptions=False)

Fixes #3550

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

New unit test

  • Test A

Does This PR Require a Core Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@phillipuniverse phillipuniverse requested a review from a team as a code owner May 27, 2025 13:22
@tammy-baylis-swi
Copy link
Contributor

Hi thanks for your input and contribution.

Generally speaking, the opt-in does go against the Otel specification on error handling and the mission. We avoid disrupting regular service operation if instrumentation fails and, whenever possible, record errors as telemetry.

Please could you share more details about your use case? e.g. production requirements, user onboarding friction, what docs are being followed

@phillipuniverse
Copy link
Contributor Author

@tammy-baylis-swi thanks for the context. My use case is as follows:

  • I centralize my organization's otel support in a common Python library, something like mycompany-common-python. We apply various standards on top of stock otel like setting versions, making some instrumentation customizations, etc

  • All of our microservices depend on mycompany-common-python and transitively pulls in all the otel dependencies

  • In each service they call something like:

    from mycompany_common_python import otel
    
    otel.initialize(service_name="myservice")
  • The mycompany_common_python.otel.initialize() ends up calling autoinstrumentation.initialize() (from my example above)

So when I do something like upgrade the otel version, that process looks like this:

  1. Bump otel versions in mycompany-common-python
  2. Ensure tests pass, etc in mycompany-common-python
  3. Release new version of mycompany-common-python

We then automate the upgrade of mycompany-common-python with Dependabot (this isn't a super important detail but I think will help illustrate how I get into a problematic state). So over the next few days. Dependabot goes through and bumps the version of mycompany-common-python in each of my services. Eventually that code makes it into production.

Now, sometime later, it turns out maybe there's some sort of bug in the otel auto-instrumentation due to the release, or maybe even in my service, a new version of something like FastAPI or Django is released that causes auto-instrumentation to fail in some way. But I'm not notified in automated tests anywhere since it's only an exception log, not a raised exception. This has happened at various times and in various ways over the years, but it can be eventually traced back to an exception in the auto-instrumentation.

The issue is that this problem is hard to spot from the single error log. I'll end up noticing it when I go to look for some specific span and can't find it - when come to find out the instrumentation failed in some way and nothing was actually being instrumented. But now it's too late because I have already lost out on an important aspect of traceability for whatever problem I'm trying to diagnose.

I personally would much rather fail very loudly whenever there are instrumentation problems and fail a service to start up. Since I have a staging environment/k8s then a pod will fail to start if instrumentation fails which matches the level of importance I put on my services emitting spans and undergoing instrumentation.


To make matters worse, if there's an exception in 1 instrumentation it affects all instrumentations after it because of how we re-raise here:

except Exception as exc: # pylint: disable=broad-except
_logger.exception("Instrumenting of %s failed", entry_point.name)
raise exc

So 1 instrumentation failure in something like asyncpg could prevent everything loaded after it (fastapi, whatever) from being loaded too. Again, something I really want to know about as loudly as possible.

@phillipuniverse
Copy link
Contributor Author

also @tammy-baylis-swi I read through the docs you linked. Probably the most important aspect that this PR may seem at odds with is this one:

We assume that users would prefer to lose telemetry data rather than have the library significantly change the behavior of the instrumented application.

I agree with this stance, however I would argue this refers to when spans are emitted. My scenario is more about the initialization procedure. During initialization (especially if I'm the one manually invoking the auto-instrumentation) I would much rather see a hard failure and then decide on my own what to do with the hard failure (swallow it or not).

I do think keeping the log-don't-raise behavior is better (and preserved in this PR) for places where the auto-instrumentation is kicked off by e.g the opentelemetry-instrument command.

Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm onboard with this option especially since it's opt-in. The spec allows failing fast IRC

@phillipuniverse phillipuniverse requested a review from aabmass May 28, 2025 01:36
@tammy-baylis-swi
Copy link
Contributor

Thank you for that great amount of detail.

I think that's fair. This is the part of the spec Aaron might have been referring to about fail-fast, which can fit since this is about init as long as no fails later:

  1. The API or SDK MAY fail fast and cause the application to fail on initialization, e.g. because of a bad user config or environment, but MUST NOT cause the application to fail later at run time, e.g. due to dynamic config settings received from the Collector.

Alternatively, individual vendors could patch Otel init or set up some log filters for better visibility.

@phillipuniverse please could you also create and link a feature request Issue to this PR.

@phillipuniverse
Copy link
Contributor Author

@tammy-baylis-swi done!

I would rather completely fail startup in my services if instrumentation fails for whatever reason instead of just logging an exception and continuing.

Use case:

from opentelemetry import autoinstrumentation

autoinstrumentation.initialize(reraise=True)
Copy link
Contributor

@tammy-baylis-swi tammy-baylis-swi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

@xrmx xrmx enabled auto-merge (squash) June 4, 2025 08:44
@xrmx xrmx merged commit b730182 into open-telemetry:main Jun 4, 2025
625 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow failing fast for auto-instrumentation problems
5 participants