You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Tracing Requests and Exporting Spans](#tracing-requests-and-exporting-spans)
15
+
-[Connected Traces with Nested Spans](#connected-traces-with-nested-spans)
15
16
-[Running the OpenTelemetry Collector](#running-the-opentelemetry-collector)
16
17
-[Kubelet Configuration](#kubelet-configuration)
17
18
-[Design Details](#design-details)
@@ -43,16 +44,16 @@
43
44
44
45
Items marked with (R) are required *prior to targeting to a milestone / release*.
45
46
46
-
-[] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
47
-
-[] (R) KEP approvers have approved the KEP status as `implementable`
48
-
-[] (R) Design details are appropriately documented
49
-
-[] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
50
-
-[] (R) Graduation criteria is in place
51
-
-[] (R) Production readiness review completed
52
-
-[] Production readiness review approved
53
-
-[] "Implementation History" section is up-to-date for milestone
54
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
55
-
-[] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
47
+
-[X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
48
+
-[X] (R) KEP approvers have approved the KEP status as `implementable`
49
+
-[X] (R) Design details are appropriately documented
50
+
-[X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
51
+
-[X] (R) Graduation criteria is in place
52
+
-[X] (R) Production readiness review completed
53
+
-[X] Production readiness review approved
54
+
-[X] "Implementation History" section is up-to-date for milestone
55
+
-[X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
56
+
-[X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
56
57
57
58
## Summary
58
59
@@ -105,7 +106,7 @@ From there, OpenTelemetry trace data can be exported to a tracing backend of cho
105
106
specified within `kubernetes/component-base` with the default URL to make the kubelet send its spans to the collector.
106
107
Alternatively, I can point the kubelet at an OpenTelemetry collector listening on a different port or URL if I need to.
107
108
108
-
#### Continuous Trace Collection
109
+
#### Continuous trace collection
109
110
110
111
As a cluster administrator or cloud provider, I would like to collect gRPC and HTTP trace data from the transactions between the API server and the
111
112
kubelet and interactions with a node's container runtime (Container Runtime Interface) to debug cluster problems. I can set the `SamplingRatePerMillion`
@@ -114,7 +115,7 @@ debug, I can search span metadata or specific nodes to find a trace which displa
114
115
The sampling rate for trace exports can be configured based on my needs. I can collect each node's kubelet trace data as distinct tracing services
115
116
to diagnose node issues.
116
117
117
-
##### Example Scenarios
118
+
##### Example scenarios
118
119
119
120
* Latency or timeout experienced when:
120
121
* Attach or exec to running containers
@@ -140,6 +141,17 @@ to generate spans for sampled incoming requests and propagate context with clien
140
141
141
142
OpenTelemetry-Go provides the [propagation package](https://github.com/open-telemetry/opentelemetry-go/blob/main/propagation/propagation.go) with which you can add custom key-value pairs known as [baggage](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/baggage/api.md). Baggage data will be propagated across services within contexts.
142
143
144
+
### Connected Traces with Nested Spans
145
+
146
+
With the initial implementation of this proposal, kubelet tracing produced disconnected spans, because context was not wired through kubelet CRI calls.
147
+
With [this PR](https://github.com/kubernetes/kubernetes/pull/113591), context is now plumbed between CRI calls and kubelet.
148
+
It is now possible to connect spans for CRI calls. Nested spans with top-level traces in the kubelet will connect CRI calls together.
149
+
Nested spans will be created for the following:
150
+
* Sync Loops (e.g. syncPod, eviction manager, various gc routines) where the kubelet initiates new work.
151
+
*[top-level traces for pod sync and GC](https://github.com/kubernetes/kubernetes/pull/114504)
* Outgoing requests (CNI, CSI, device plugin, k8s API calls)
154
+
143
155
### Running the OpenTelemetry Collector
144
156
145
157
Although this proposal focuses on running the [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector), note that any
@@ -187,74 +199,32 @@ type TracingConfiguration struct {
187
199
188
200
### Test Plan
189
201
190
-
<!--
191
-
**Note:** *Not required until targeted at a release.*
192
-
The goal is to ensure that we don't accept enhancements with inadequate testing.
193
-
All code is expected to have adequate tests (eventually with coverage
194
-
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
of a node? **No, restarting the kubelet with feature-gate disabled will disable tracing**
307
285
308
286
##### Does enabling the feature change any default behavior?
309
-
No. The feature is disabled unlesss the feature gate is enabled and the TracingConfiguration is populated in Kubelet Configuration.
287
+
No. The feature is disabled unless the feature gate is enabled and the TracingConfiguration is populated in Kubelet Configuration.
310
288
When the feature is enabled, it doesn't change behavior from the users' perspective; it only adds tracing telemetry.
311
289
312
290
##### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
@@ -316,8 +294,8 @@ GA
316
294
It will start generating and exporting traces again.
317
295
318
296
##### Are there any tests for feature enablement/disablement?
319
-
Unit tests switching feature gates will be added. Manual testing of disabling, reenabling the feature on nodes, ensuring the kubelet comes up w/out error will
320
-
also be performed.
297
+
Enabling and disabling kubelet tracing is an in-memory switch. Explicit enablement/disablement tests will not provide value so will not be added.
298
+
Manual testing of disabling, reenabling the feature on nodes, ensuring the kubelet comes up w/out error will be performed and documented.
321
299
322
300
### Rollout, Upgrade and Rollback Planning
323
301
@@ -328,7 +306,17 @@ _This section must be completed when targeting beta graduation to a release._
328
306
No impact to running workloads, logs will indicate the problem.
329
307
330
308
###### What specific metrics should inform a rollback?
331
-
To be determined.
309
+
310
+
* This KEP is following the [opentelemetry-go issue #2547](https://github.com/open-telemetry/opentelemetry-go/issues/2547).
311
+
312
+
```
313
+
...using the OTLP trace exporter, it isn't currently possible to monitor (with metrics) whether or not spans are being successfully collected and exported.
314
+
For example, if my SDK cannot connect to an opentelemetry collector, and isn't able to send traces, I would like to be able to measure how many traces are collected,
315
+
vs how many are not sent. I would like to be able to set up SLOs to measure successful trace delivery from my applications.
316
+
```
317
+
318
+
* Pod Lifecycle and Kubelet [SLOs](https://github.com/kubernetes/community/tree/master/sig-scalability/slos) are the signals that should guide a rollback. In particular, the [`kubelet_pod_start_duration_seconds_count`, `kubelet_runtime_operations_errors_total`, and `kubelet_pleg_relist_interval_seconds_bucket`] would surface issues affecting kubelet performance.
319
+
332
320
333
321
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
334
322
Upgrades and rollbacks will be tested while feature-gate is experimental
@@ -357,7 +345,7 @@ _This section must be completed when targeting beta graduation to a release._
357
345
##### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
0 commit comments