Allow reraising the root exception if instrumentation fails #3545

phillipuniverse · 2025-05-27T13:22:52Z

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

I would rather completely fail startup in my services if instrumentation fails for whatever reason instead of just logging an exception and continuing.

Use case:

from opentelemetry import autoinstrumentation

autoinstrumentation.initialize(swallow_exceptions=False)

Fixes #3550

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

New unit test

Test A

Does This PR Require a Core Repo Change?

Yes. - Link to PR:
No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

Followed the style guidelines of this project
Changelogs have been updated
Unit tests have been added
Documentation has been updated

tammy-baylis-swi · 2025-05-27T18:23:17Z

Hi thanks for your input and contribution.

Generally speaking, the opt-in does go against the Otel specification on error handling and the mission. We avoid disrupting regular service operation if instrumentation fails and, whenever possible, record errors as telemetry.

Please could you share more details about your use case? e.g. production requirements, user onboarding friction, what docs are being followed

phillipuniverse · 2025-05-27T18:42:21Z

@tammy-baylis-swi thanks for the context. My use case is as follows:

I centralize my organization's otel support in a common Python library, something like mycompany-common-python. We apply various standards on top of stock otel like setting versions, making some instrumentation customizations, etc
All of our microservices depend on mycompany-common-python and transitively pulls in all the otel dependencies

In each service they call something like:

from mycompany_common_python import otel

otel.initialize(service_name="myservice")

The mycompany_common_python.otel.initialize() ends up calling autoinstrumentation.initialize() (from my example above)

So when I do something like upgrade the otel version, that process looks like this:

Bump otel versions in mycompany-common-python
Ensure tests pass, etc in mycompany-common-python
Release new version of mycompany-common-python

We then automate the upgrade of mycompany-common-python with Dependabot (this isn't a super important detail but I think will help illustrate how I get into a problematic state). So over the next few days. Dependabot goes through and bumps the version of mycompany-common-python in each of my services. Eventually that code makes it into production.

Now, sometime later, it turns out maybe there's some sort of bug in the otel auto-instrumentation due to the release, or maybe even in my service, a new version of something like FastAPI or Django is released that causes auto-instrumentation to fail in some way. But I'm not notified in automated tests anywhere since it's only an exception log, not a raised exception. This has happened at various times and in various ways over the years, but it can be eventually traced back to an exception in the auto-instrumentation.

The issue is that this problem is hard to spot from the single error log. I'll end up noticing it when I go to look for some specific span and can't find it - when come to find out the instrumentation failed in some way and nothing was actually being instrumented. But now it's too late because I have already lost out on an important aspect of traceability for whatever problem I'm trying to diagnose.

I personally would much rather fail very loudly whenever there are instrumentation problems and fail a service to start up. Since I have a staging environment/k8s then a pod will fail to start if instrumentation fails which matches the level of importance I put on my services emitting spans and undergoing instrumentation.

To make matters worse, if there's an exception in 1 instrumentation it affects all instrumentations after it because of how we re-raise here:

opentelemetry-python-contrib/opentelemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/_load.py

Lines 104 to 106 in a164d37

    
           except Exception as exc:  # pylint: disable=broad-except 
        
               _logger.exception("Instrumenting of %s failed", entry_point.name) 
        
               raise exc

So 1 instrumentation failure in something like asyncpg could prevent everything loaded after it (fastapi, whatever) from being loaded too. Again, something I really want to know about as loudly as possible.

phillipuniverse · 2025-05-27T18:52:47Z

also @tammy-baylis-swi I read through the docs you linked. Probably the most important aspect that this PR may seem at odds with is this one:

We assume that users would prefer to lose telemetry data rather than have the library significantly change the behavior of the instrumented application.

I agree with this stance, however I would argue this refers to when spans are emitted. My scenario is more about the initialization procedure. During initialization (especially if I'm the one manually invoking the auto-instrumentation) I would much rather see a hard failure and then decide on my own what to do with the hard failure (swallow it or not).

I do think keeping the log-don't-raise behavior is better (and preserved in this PR) for places where the auto-instrumentation is kicked off by e.g the opentelemetry-instrument command.

aabmass

I think I'm onboard with this option especially since it's opt-in. The spec allows failing fast IRC

...telemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/__init__.py

tammy-baylis-swi · 2025-05-28T20:42:34Z

Thank you for that great amount of detail.

I think that's fair. This is the part of the spec Aaron might have been referring to about fail-fast, which can fit since this is about init as long as no fails later:

The API or SDK MAY fail fast and cause the application to fail on initialization, e.g. because of a bad user config or environment, but MUST NOT cause the application to fail later at run time, e.g. due to dynamic config settings received from the Collector.

Alternatively, individual vendors could patch Otel init or set up some log filters for better visibility.

@phillipuniverse please could you also create and link a feature request Issue to this PR.

phillipuniverse · 2025-05-28T22:13:31Z

@tammy-baylis-swi done!

I would rather completely fail startup in my services if instrumentation fails for whatever reason instead of just logging an exception and continuing. Use case: from opentelemetry import autoinstrumentation autoinstrumentation.initialize(reraise=True)

tammy-baylis-swi

Thanks again!

phillipuniverse requested a review from a team as a code owner May 27, 2025 13:22

phillipuniverse force-pushed the reraise-in-init branch from a262d64 to 27d9977 Compare May 27, 2025 18:08

phillipuniverse force-pushed the reraise-in-init branch from 342014b to 6606756 Compare May 27, 2025 18:46

aabmass reviewed May 27, 2025

View reviewed changes

...telemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/__init__.py Outdated Show resolved Hide resolved

...telemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/__init__.py Outdated Show resolved Hide resolved

phillipuniverse requested a review from aabmass May 28, 2025 01:36

phillipuniverse added 4 commits May 28, 2025 17:15

Fix lint

29cfe32

Type hinting, re-raise original exception

a8e2424

One more type hint to indicate None return

773d4ce

phillipuniverse force-pushed the reraise-in-init branch from c0141b8 to 773d4ce Compare May 28, 2025 22:15

xrmx approved these changes Jun 3, 2025

View reviewed changes

tammy-baylis-swi approved these changes Jun 3, 2025

View reviewed changes

emdneto approved these changes Jun 3, 2025

View reviewed changes

phillipuniverse and others added 2 commits June 3, 2025 17:11

Merge branch 'main' into reraise-in-init

0993429

Merge branch 'main' into reraise-in-init

8394f87

xrmx enabled auto-merge (squash) June 4, 2025 08:44

xrmx merged commit b730182 into open-telemetry:main Jun 4, 2025
625 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow reraising the root exception if instrumentation fails #3545

Allow reraising the root exception if instrumentation fails #3545

Uh oh!

phillipuniverse commented May 27, 2025 •

edited by xrmx

Loading

Uh oh!

tammy-baylis-swi commented May 27, 2025

Uh oh!

phillipuniverse commented May 27, 2025

Uh oh!

phillipuniverse commented May 27, 2025

Uh oh!

aabmass left a comment

Uh oh!

Uh oh!

Uh oh!

tammy-baylis-swi commented May 28, 2025

Uh oh!

phillipuniverse commented May 28, 2025

Uh oh!

tammy-baylis-swi left a comment

Uh oh!

Uh oh!

Uh oh!

Allow reraising the root exception if instrumentation fails #3545

Allow reraising the root exception if instrumentation fails #3545

Uh oh!

Conversation

phillipuniverse commented May 27, 2025 • edited by xrmx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Does This PR Require a Core Repo Change?

Checklist:

Uh oh!

tammy-baylis-swi commented May 27, 2025

Uh oh!

phillipuniverse commented May 27, 2025

Uh oh!

phillipuniverse commented May 27, 2025

Uh oh!

aabmass left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tammy-baylis-swi commented May 28, 2025

Uh oh!

phillipuniverse commented May 28, 2025

Uh oh!

tammy-baylis-swi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

phillipuniverse commented May 27, 2025 •

edited by xrmx

Loading