-
Notifications
You must be signed in to change notification settings - Fork 713
Allow reraising the root exception if instrumentation fails #3545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a262d64
to
27d9977
Compare
Hi thanks for your input and contribution. Generally speaking, the opt-in does go against the Otel specification on error handling and the mission. We avoid disrupting regular service operation if instrumentation fails and, whenever possible, record errors as telemetry. Please could you share more details about your use case? e.g. production requirements, user onboarding friction, what docs are being followed |
@tammy-baylis-swi thanks for the context. My use case is as follows:
So when I do something like upgrade the otel version, that process looks like this:
We then automate the upgrade of Now, sometime later, it turns out maybe there's some sort of bug in the otel auto-instrumentation due to the release, or maybe even in my service, a new version of something like FastAPI or Django is released that causes auto-instrumentation to fail in some way. But I'm not notified in automated tests anywhere since it's only an exception log, not a raised exception. This has happened at various times and in various ways over the years, but it can be eventually traced back to an exception in the auto-instrumentation. The issue is that this problem is hard to spot from the single error log. I'll end up noticing it when I go to look for some specific span and can't find it - when come to find out the instrumentation failed in some way and nothing was actually being instrumented. But now it's too late because I have already lost out on an important aspect of traceability for whatever problem I'm trying to diagnose. I personally would much rather fail very loudly whenever there are instrumentation problems and fail a service to start up. Since I have a staging environment/k8s then a pod will fail to start if instrumentation fails which matches the level of importance I put on my services emitting spans and undergoing instrumentation. To make matters worse, if there's an exception in 1 instrumentation it affects all instrumentations after it because of how we re-raise here: Lines 104 to 106 in a164d37
So 1 instrumentation failure in something like asyncpg could prevent everything loaded after it (fastapi, whatever) from being loaded too. Again, something I really want to know about as loudly as possible. |
342014b
to
6606756
Compare
also @tammy-baylis-swi I read through the docs you linked. Probably the most important aspect that this PR may seem at odds with is this one:
I agree with this stance, however I would argue this refers to when spans are emitted. My scenario is more about the initialization procedure. During initialization (especially if I'm the one manually invoking the auto-instrumentation) I would much rather see a hard failure and then decide on my own what to do with the hard failure (swallow it or not). I do think keeping the log-don't-raise behavior is better (and preserved in this PR) for places where the auto-instrumentation is kicked off by e.g the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm onboard with this option especially since it's opt-in. The spec allows failing fast IRC
...telemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/__init__.py
Outdated
Show resolved
Hide resolved
...telemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/__init__.py
Outdated
Show resolved
Hide resolved
Thank you for that great amount of detail. I think that's fair. This is the part of the spec Aaron might have been referring to about fail-fast, which can fit since this is about init as long as no fails later:
Alternatively, individual vendors could patch Otel init or set up some log filters for better visibility. @phillipuniverse please could you also create and link a feature request Issue to this PR. |
I would rather completely fail startup in my services if instrumentation fails for whatever reason instead of just logging an exception and continuing. Use case: from opentelemetry import autoinstrumentation autoinstrumentation.initialize(reraise=True)
c0141b8
to
773d4ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again!
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
I would rather completely fail startup in my services if instrumentation fails for whatever reason instead of just logging an exception and continuing.
Use case:
Fixes #3550
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
New unit test
Does This PR Require a Core Repo Change?
Checklist:
See contributing.md for styleguide, changelog guidelines, and more.