Skip to content

Commit 33f7b95

Browse files
authored
Merge pull request #3714 from sallyom/kep2831-update-1.27
KEP-2831: adding beta graduation criteria
2 parents c80f656 + a9c36f7 commit 33f7b95

File tree

3 files changed

+72
-78
lines changed

3 files changed

+72
-78
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
kep-number: 2831
22
alpha:
33
approver: "@ehashman"
4+
beta:
5+
approver: "@wojtek-t"
46

keps/sig-instrumentation/2831-kubelet-tracing/README.md

+63-74
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,10 @@
99
- [Non-Goals](#non-goals)
1010
- [Proposal](#proposal)
1111
- [User Stories](#user-stories)
12-
- [Continuous Trace Collection](#continuous-trace-collection)
13-
- [Example Scenarios](#example-scenarios)
12+
- [Continuous trace collection](#continuous-trace-collection)
13+
- [Example scenarios](#example-scenarios)
1414
- [Tracing Requests and Exporting Spans](#tracing-requests-and-exporting-spans)
15+
- [Connected Traces with Nested Spans](#connected-traces-with-nested-spans)
1516
- [Running the OpenTelemetry Collector](#running-the-opentelemetry-collector)
1617
- [Kubelet Configuration](#kubelet-configuration)
1718
- [Design Details](#design-details)
@@ -43,16 +44,16 @@
4344

4445
Items marked with (R) are required *prior to targeting to a milestone / release*.
4546

46-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
47-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
48-
- [ ] (R) Design details are appropriately documented
49-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
50-
- [ ] (R) Graduation criteria is in place
51-
- [ ] (R) Production readiness review completed
52-
- [ ] Production readiness review approved
53-
- [ ] "Implementation History" section is up-to-date for milestone
54-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
55-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
47+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
48+
- [X] (R) KEP approvers have approved the KEP status as `implementable`
49+
- [X] (R) Design details are appropriately documented
50+
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
51+
- [X] (R) Graduation criteria is in place
52+
- [X] (R) Production readiness review completed
53+
- [X] Production readiness review approved
54+
- [X] "Implementation History" section is up-to-date for milestone
55+
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
56+
- [X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
5657

5758
## Summary
5859

@@ -105,7 +106,7 @@ From there, OpenTelemetry trace data can be exported to a tracing backend of cho
105106
specified within `kubernetes/component-base` with the default URL to make the kubelet send its spans to the collector.
106107
Alternatively, I can point the kubelet at an OpenTelemetry collector listening on a different port or URL if I need to.
107108

108-
#### Continuous Trace Collection
109+
#### Continuous trace collection
109110

110111
As a cluster administrator or cloud provider, I would like to collect gRPC and HTTP trace data from the transactions between the API server and the
111112
kubelet and interactions with a node's container runtime (Container Runtime Interface) to debug cluster problems. I can set the `SamplingRatePerMillion`
@@ -114,7 +115,7 @@ debug, I can search span metadata or specific nodes to find a trace which displa
114115
The sampling rate for trace exports can be configured based on my needs. I can collect each node's kubelet trace data as distinct tracing services
115116
to diagnose node issues.
116117

117-
##### Example Scenarios
118+
##### Example scenarios
118119

119120
* Latency or timeout experienced when:
120121
* Attach or exec to running containers
@@ -140,6 +141,17 @@ to generate spans for sampled incoming requests and propagate context with clien
140141

141142
OpenTelemetry-Go provides the [propagation package](https://github.com/open-telemetry/opentelemetry-go/blob/main/propagation/propagation.go) with which you can add custom key-value pairs known as [baggage](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/baggage/api.md). Baggage data will be propagated across services within contexts.
142143

144+
### Connected Traces with Nested Spans
145+
146+
With the initial implementation of this proposal, kubelet tracing produced disconnected spans, because context was not wired through kubelet CRI calls.
147+
With [this PR](https://github.com/kubernetes/kubernetes/pull/113591), context is now plumbed between CRI calls and kubelet.
148+
It is now possible to connect spans for CRI calls. Nested spans with top-level traces in the kubelet will connect CRI calls together.
149+
Nested spans will be created for the following:
150+
* Sync Loops (e.g. syncPod, eviction manager, various gc routines) where the kubelet initiates new work.
151+
* [top-level traces for pod sync and GC](https://github.com/kubernetes/kubernetes/pull/114504)
152+
* Incoming requests (exec, attach, port-forward, metrics endpoints, podresources)
153+
* Outgoing requests (CNI, CSI, device plugin, k8s API calls)
154+
143155
### Running the OpenTelemetry Collector
144156

145157
Although this proposal focuses on running the [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector), note that any
@@ -187,74 +199,32 @@ type TracingConfiguration struct {
187199

188200
### Test Plan
189201

190-
<!--
191-
**Note:** *Not required until targeted at a release.*
192-
The goal is to ensure that we don't accept enhancements with inadequate testing.
193-
All code is expected to have adequate tests (eventually with coverage
194-
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
195-
when drafting this test plan.
196-
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
197-
-->
198-
199202
[x] I/we understand the owners of the involved components may require updates to
200203
existing tests to make this code solid enough prior to committing the changes necessary
201204
to implement this enhancement.
202205

203206
##### Prerequisite testing updates
204207

205-
<!--
206-
Based on reviewers feedback describe what additional tests need to be added prior
207-
implementing this enhancement to ensure the enhancements have also solid foundations.
208-
-->
209-
210208
An integration test will verify that spans exported by the kubelet match what is
211209
expected from the request. We will also add an integration test that verifies
212210
spans propagated from kubelet to API server match what is expected from the request.
213211

214212
##### Unit tests
215213

216-
<!--
217-
In principle every added code should have complete unit test coverage, so providing
218-
the exact set of tests will not bring additional value.
219-
However, if complete unit test coverage is not possible, explain the reason of it
220-
together with explanation why this is acceptable.
221-
-->
222-
223-
<!--
224-
Additionally, for Alpha try to enumerate the core package you will be touching
225-
to implement this enhancement and provide the current unit coverage for those
226-
in the form of:
227-
- <package>: <date> - <current test coverage>
228-
The data can be easily read from:
229-
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
230-
This can inform certain test coverage improvements that we want to do before
231-
extending the production code to implement this enhancement.
232-
-->
233-
234-
- `k8s.io/component-base/traces`: no test grid results - k8s.io/component-base/traces/config_test.go
214+
- https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/validation/validation_test.go#L503-#L532
215+
- https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cri/remote/remote_runtime_test.go#L65-#L97
216+
- https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/tracing_test.go
217+
- https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/tracing/api/v1/config_test.go
235218

236219
##### Integration tests
237220

238-
<!--
239-
This question should be filled when targeting a release.
240-
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
241-
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
242-
https://storage.googleapis.com/k8s-triage/index.html
243-
-->
244-
245-
An integration test will verify that spans exported by the kubelet match what is
246-
expected from the request. We will also add an integration test that verifies
221+
Integration tests verify that spans exported by the kubelet match what is
222+
expected from the request. Also an integration test that verifies
247223
spans propagated from kubelet to API server match what is expected from the request.
248224

249-
##### e2e tests
225+
- _component-base tracing/api/v1 integration test_ https://github.com/kubernetes/kubernetes/blob/master/test/integration/apiserver/tracing/tracing_test.go
250226

251-
<!--
252-
This question should be filled when targeting a release.
253-
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
254-
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
255-
https://storage.googleapis.com/k8s-triage/index.html
256-
We expect no non-infra related flakes in the last month as a GA graduation criteria.
257-
-->
227+
##### e2e tests
258228

259229
- A test with kubelet-tracing & apiserver-tracing enabled to ensure no issues are introduced, regardless
260230
of whether a tracing backend is configured.
@@ -263,14 +233,22 @@ of whether a tracing backend is configured.
263233

264234
Alpha
265235

266-
- [] Implement tracing of incoming and outgoing gRPC, HTTP requests in the kubelet
267-
- [] Integration testing of tracing
236+
- [X] Implement tracing of incoming and outgoing gRPC, HTTP requests in the kubelet
237+
- [X] Integration testing of tracing
238+
- [X] Unit testing of kubelet tracing and tracing configuration
268239

269240
Beta
270241

271-
- [] Publish examples of how to use the OT Collector with kubernetes
272-
- [] Allow time for feedback
273-
- [] Revisit the format used to export spans.
242+
- [X] OpenTelemetry reaches GA
243+
- [X] Publish examples of how to use the OT Collector with kubernetes
244+
- [X] Allow time for feedback
245+
- [ ] Test and document results of upgrade and rollback while feature-gate is enabled.
246+
- [ ] Add top level traces to connect spans in sync loops, incoming requests, and outgoing requests.
247+
- [ ] Unit/integration test to verify connected traces in kubelet.
248+
- [ ] Revisit the format used to export spans.
249+
- [ ] Parity with the old text-based Traces
250+
- [ ] Connecting traces from container runtimes via the Container Runtime Interface
251+
- https://github.com/kubernetes/kubernetes/pull/114504
274252

275253
GA
276254

@@ -306,7 +284,7 @@ GA
306284
of a node? **No, restarting the kubelet with feature-gate disabled will disable tracing**
307285

308286
##### Does enabling the feature change any default behavior?
309-
No. The feature is disabled unlesss the feature gate is enabled and the TracingConfiguration is populated in Kubelet Configuration.
287+
No. The feature is disabled unless the feature gate is enabled and the TracingConfiguration is populated in Kubelet Configuration.
310288
When the feature is enabled, it doesn't change behavior from the users' perspective; it only adds tracing telemetry.
311289

312290
##### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
@@ -316,8 +294,8 @@ GA
316294
It will start generating and exporting traces again.
317295

318296
##### Are there any tests for feature enablement/disablement?
319-
Unit tests switching feature gates will be added. Manual testing of disabling, reenabling the feature on nodes, ensuring the kubelet comes up w/out error will
320-
also be performed.
297+
Enabling and disabling kubelet tracing is an in-memory switch. Explicit enablement/disablement tests will not provide value so will not be added.
298+
Manual testing of disabling, reenabling the feature on nodes, ensuring the kubelet comes up w/out error will be performed and documented.
321299

322300
### Rollout, Upgrade and Rollback Planning
323301

@@ -328,7 +306,17 @@ _This section must be completed when targeting beta graduation to a release._
328306
No impact to running workloads, logs will indicate the problem.
329307

330308
###### What specific metrics should inform a rollback?
331-
To be determined.
309+
310+
* This KEP is following the [opentelemetry-go issue #2547](https://github.com/open-telemetry/opentelemetry-go/issues/2547).
311+
312+
```
313+
...using the OTLP trace exporter, it isn't currently possible to monitor (with metrics) whether or not spans are being successfully collected and exported.
314+
For example, if my SDK cannot connect to an opentelemetry collector, and isn't able to send traces, I would like to be able to measure how many traces are collected,
315+
vs how many are not sent. I would like to be able to set up SLOs to measure successful trace delivery from my applications.
316+
```
317+
318+
* Pod Lifecycle and Kubelet [SLOs](https://github.com/kubernetes/community/tree/master/sig-scalability/slos) are the signals that should guide a rollback. In particular, the [`kubelet_pod_start_duration_seconds_count`, `kubelet_runtime_operations_errors_total`, and `kubelet_pleg_relist_interval_seconds_bucket`] would surface issues affecting kubelet performance.
319+
332320

333321
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
334322
Upgrades and rollbacks will be tested while feature-gate is experimental
@@ -357,7 +345,7 @@ _This section must be completed when targeting beta graduation to a release._
357345
##### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
358346

359347
- [] Metrics
360-
- Metric name: tbd
348+
- Metric name: tbd [opentelemetry-go issue #2547](https://github.com/open-telemetry/opentelemetry-go/issues/2547)
361349
- Components exposing the metric: kubelet
362350

363351
##### Are there any missing metrics that would be useful to have to improve observability
@@ -442,6 +430,7 @@ _This section must be completed when targeting beta graduation to a release._
442430
- 2022-07-22: KEP merged, targeted at Alpha in 1.24
443431
- 2022-03-29: KEP deemed not ready for Alpha in 1.24
444432
- 2022-06-09: KEP targeted at Alpha in 1.25
433+
- 2023-01-09: KEP targeted at Beta in 1.27
445434

446435
## Drawbacks
447436

keps/sig-instrumentation/2831-kubelet-tracing/kep.yaml

+7-4
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ kep-number: 2831
33
authors:
44
- "@husky-parul"
55
- "@somalley"
6+
- "@dashpole"
67
owning-sig: sig-instrumentation
78
participating-sigs:
89
- sig-architecture
@@ -12,18 +13,20 @@ creation-date: 2021-07-21
1213
reviewers:
1314
- "@dashpole"
1415
- "@ehashman"
16+
- "@wojtek-t"
1517
approvers:
1618
- "@dashpole"
1719
- "@ehashman"
20+
- "@wojtek-t"
1821
see-also:
1922
- "https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/647-apiserver-tracing"
2023
replaces:
21-
stage: alpha
22-
latest-milestone: "v1.25"
24+
stage: beta
25+
latest-milestone: "v1.27"
2326
milestone:
2427
alpha: "v1.25"
25-
beta: "v1.26"
26-
stable: "v1.27"
28+
beta: "v1.27"
29+
stable: "v1.28"
2730
feature-gates:
2831
- name: KubeletTracing
2932
components:

0 commit comments

Comments
 (0)