You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -660,8 +662,22 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
660
662
- The Feature is implemented behind `WatchList` feature flag
661
663
- Initial e2e tests completed and enabled
662
664
- Scalability/Performance tests confirm gains of this feature
665
+
- Add support for watchlist to APF
666
+
667
+
#### Beta
663
668
- Metrics are added to the kube-apiserver (see the [monitoring-requirements](#monitoring-requirements) section for more details)
664
669
- Implement `SendInitialEvents` for `watch` requests in the etcd storage implementation
670
+
- The feature is enabled for kube-apiserver and kube-controller-manager
671
+
- The generic feature gate mechanism is implemented in client-go.
672
+
It will be used to enable a new functionality for reflectors/informers.
673
+
- Implement a consistency check detector that will compare data received through a new watchlist request
674
+
with data obtained through a standard list request. The detector will be added to the reflector
675
+
and activated when an environment variable is set. The environment variable will be set for all jobs run in the Kube CI.
676
+
677
+
#### GA
678
+
- Consider using WatchProgressRequester to request progress notifications directly from etcd.
679
+
This mechanism was developed in [Consistent Reads from Cache KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2340-Consistent-reads-from-cache#use-requestprogress-to-enable-automatic-watch-updates)
680
+
and could reduce the overall latency for watchlist requests.
665
681
666
682
<!--
667
683
**Note:** *Not required until targeted at a release.*
@@ -745,9 +761,9 @@ components? What are the guarantees? Make sure this is in the test plan.
745
761
746
762
Consider the following in developing a version skew strategy for this
747
763
enhancement:
748
-
- Does this enhancement involve coordinating behavior in the control plane and
749
-
in the kubelet? How does an n-2 kubelet without this feature available behave
750
-
when this feature is used?
764
+
- Does this enhancement involve coordinating behavior in the control plane and nodes?
765
+
- How does an n-3 kubelet or kube-proxy without this feature available behave when this feature is used?
766
+
- How does an n-1 kube-controller-manager or kube-scheduler without this feature available behave when this feature is used?
751
767
- Will any other components on the node change? For example, changes to CSI,
752
768
CRI or CNI may require updating that component before the kubelet.
753
769
-->
@@ -797,23 +813,33 @@ Pick one of these and delete the rest.
797
813
798
814
-[x] Feature gate (also fill in values in `kep.yaml`)
799
815
- Feature gate name: WatchList
800
-
- Components depending on the feature gate: the kube-apiserver
816
+
- Components depending on the feature gate:
817
+
- kube-apiserver
818
+
- Feature gate name: WatchListClient (the actual name might be different because it hasn't been added yet)
819
+
- Components depending on the feature gate:
820
+
- kube-controller-manager via client-go library
801
821
-[ ] Other
802
822
- Describe the mechanism:
803
823
- Will enabling / disabling the feature require downtime of the control
804
-
plane?
824
+
plane?
805
825
- Will enabling / disabling the feature require downtime or reprovisioning
806
826
of a node?
807
827
808
828
###### Does enabling the feature change any default behavior?
809
-
No.
829
+
No. Because users must enable the feature on the client side (client-go).
810
830
<!--
811
831
Any change of default behavior may be surprising to users or break existing
812
832
automations, so be extremely careful here.
813
833
-->
814
834
815
835
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
816
-
Yes, in that scenario the kube-apiserver will reject WATCH requests with the new query parameter forcing informers to fall back to the previous mode.
836
+
Yes, by disabling `WatchList` FeatureGate for `kube-apiserver`.
837
+
In this case `kube-apiserver` will reject WATCH requests with the new query parameter forcing informers to fall back to the previous mode.
838
+
839
+
Yes, by disabling `WatchListClient` FeatureGate for `kube-controller-manager`.
840
+
In this case informers will follow standard LIST/WATCH semantics.
841
+
842
+
Note that for safety reasons, reflectors/informers will always fallback to a regular LIST operation regardless of the error that occurred.
817
843
<!--
818
844
Describe the consequences on existing workloads (e.g., if this is a runtime
819
845
feature, can it break the existing applications?).
@@ -825,7 +851,8 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
825
851
The expected behavior of the feature will be restored.
826
852
827
853
###### Are there any tests for feature enablement/disablement?
828
-
No.
854
+
Yes. There is [an integration test](https://github.com/kubernetes/kubernetes/pull/120971) that verifies the fallback mechanism
855
+
of the reflector when interacting with servers that has the `WatchList` feature enabled/disabled.
829
856
<!--
830
857
The e2e framework does not currently support enabling or disabling feature
831
858
gates. However, unit tests in each component dealing with managing data, created
@@ -839,7 +866,13 @@ conversion tests if API types are being modified.
839
866
This section must be completed when targeting beta to a release.
840
867
-->
841
868
###### How can a rollout or rollback fail? Can it impact already running workloads?
869
+
Feature does not have a direct impact on rollout/rollback.
842
870
871
+
However, faulty behavior of a feature can result in incorrect functioning
872
+
of components that rely on that feature. For the Beta version, we plan to enable it exclusively for kube-controller-manager.
873
+
The main issues can arise during the initial informer synchronization, which may result in controller failures.
874
+
875
+
Furthermore, if data consistency issues arise, such as missing data, the controllers simply do not consider the missing data.
843
876
<!--
844
877
Try to be as paranoid as possible - e.g., what if some components will restart
845
878
mid-rollout?
@@ -852,21 +885,154 @@ will rollout across nodes.
852
885
853
886
###### What specific metrics should inform a rollback?
854
887
888
+
`apiserver_terminated_watchers_total` - a large number of terminated watchers might indicate synchronization issues.
889
+
For example, we have some client-side error where we're not getting data from the server. Or we have a server-side error, and the buffer is getting cluttered.
890
+
891
+
`apiserver_request_duration_second_bucket` - in general, a large number of "short" watch requests can indicate synchronization issues.
892
+
893
+
`apiserver_watch_list_duration_seconds` - the absence of this metric may indicate that the client did not receive a special bookmark.
894
+
The issue here could be that the server never sent it due to an error or didn't even receive it from the database.
895
+
896
+
`apiserver_watch_list_duration_seconds` - long synchronization times may indicate that the server is lagging behind etcd.
897
+
Forr example, not receiving progress notifications from the database frequently.
898
+
899
+
`apiserver_watch_cache_lag` - tells how far behind the server is compared to the database.
900
+
Significant discrepancies affect the times for full data synchronization.
901
+
902
+
A good metric can also be the number of kube-controller-manager restarts.
903
+
Which may indicate issues with informers synchronization.
904
+
855
905
<!--
856
906
What signals should users be paying attention to when the feature is young
857
907
that might indicate a serious problem?
858
908
-->
859
909
860
910
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
911
+
Upgrade->downgrade->upgrade testing was done manually using the following steps:
912
+
913
+
Build and run Kubernetes from the master branch using Kind.
Check if the `kube-apiserver`(aka `kas`) has recorded the watchlist latency metric.
925
+
```
926
+
kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
927
+
# HELP apiserver_watch_list_duration_seconds [ALPHA] Response latency distribution in seconds for watch list requests broken by group, version, resource and scope.
928
+
# TYPE apiserver_watch_list_duration_seconds histogram
Check if the `kas` has recorded the watchlist latency metric.
1021
+
```
1022
+
kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
1023
+
# HELP apiserver_watch_list_duration_seconds [ALPHA] Response latency distribution in seconds for watch list requests broken by group, version, resource and scope.
1024
+
# TYPE apiserver_watch_list_duration_seconds histogram
Describe manual testing that was done and the outcomes.
864
1030
Longer term, we may want to require automated upgrade/rollback tests, but we
865
1031
are missing a bunch of machinery and tooling and can't do that now.
866
1032
-->
867
1033
868
1034
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
869
-
1035
+
No.
870
1036
<!--
871
1037
Even if applying deprecation policies, they may still surprise some users.
872
1038
-->
@@ -878,6 +1044,7 @@ This section must be completed when targeting beta to a release.
878
1044
-->
879
1045
880
1046
###### How can an operator determine if the feature is in use by workloads?
1047
+
If `apiserver_watch_list_duration_seconds` metric has some data then this feature is in use.
881
1048
882
1049
<!--
883
1050
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
@@ -887,6 +1054,15 @@ logs or events for this purpose.
887
1054
888
1055
###### How can someone using this feature know that it is working for their instance?
889
1056
1057
+
Assuming that historical data is available then comparing the number of LIST and WATCH requests to the server will tell whether the feature was enabled.
1058
+
When this feature is enabled, the number of LIST requests will be smaller.
1059
+
The difference primarily arises from switching informers to a new mode of operation.
1060
+
1061
+
Checking whether `WatchListClient` FeatureGate has been set for the given component.
1062
+
1063
+
Knowing the `username` for a component, the audit logs could be examined to see whether `sendInitialEvents=true` in the `requestURI` has been set for that user.
1064
+
1065
+
Scanning the component's logs for the phrase `Reflector WatchList`. For requests lasting more than 10 seconds, traces will be reported.
890
1066
<!--
891
1067
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
892
1068
for each individual pod.
@@ -905,6 +1081,7 @@ Recall that end users cannot usually observe component logs or access metrics.
905
1081
- Details:
906
1082
907
1083
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
1084
+
None have been defined yet.
908
1085
909
1086
<!--
910
1087
This is your opportunity to define what "normal" quality of service looks like
@@ -928,16 +1105,15 @@ Pick one more of these and delete the rest.
928
1105
-->
929
1106
930
1107
-[ ] Metrics
931
-
- Metric name: apiserver_cache_watcher_buffer_length (histogram, what was the buffer size)
932
-
- Metric name: apiserver_watch_cache_lag (histogram, for how far the cache is behind the expected RV)
933
1108
- Metric name: apiserver_terminated_watchers_total (counter, already defined, needs to be updated (by an attribute) so that we count closed watch requests due to an overfull buffer in the new mode)
1109
+
- Metric name: apiserver_watch_list_duration_seconds (histogram, measures latency of watch-list requests)
934
1110
-[Optional] Aggregation method:
935
1111
- Components exposing the metric:
936
1112
-[ ] Other (treat as last resort)
937
1113
- Details:
938
1114
939
1115
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
940
-
1116
+
No.
941
1117
<!--
942
1118
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
943
1119
implementation difficulties, etc.).
@@ -950,7 +1126,7 @@ This section must be completed when targeting beta to a release.
950
1126
-->
951
1127
952
1128
###### Does this feature depend on any specific services running in the cluster?
953
-
1129
+
No.
954
1130
<!--
955
1131
Think about both cluster-level services (e.g. metrics-server) as well
956
1132
as node-level agents (e.g. specific version of CRI). Focus on external or
@@ -1067,8 +1243,18 @@ details). For now, we leave it here.
1067
1243
1068
1244
###### How does this feature react if the API server and/or etcd is unavailable?
1069
1245
1070
-
###### What are other known failure modes?
1246
+
When the kube-apiserver is unavailable then this feature will also be unavailable.
1071
1247
1248
+
When etcd is unavailable, requests attempting to retrieve the most recent state of the cluster will fail.
1249
+
1250
+
###### What are other known failure modes?
1251
+
- kube-controller-manager is unable to start.
1252
+
- Detection: How can it be detected via metrics? Examine the prometheus `up` time series or examine the pod status or the number of restarts.
1253
+
- Mitigations: What can be done to stop the bleeding, especially for already
1254
+
running user workloads? Disable the feature. Pass `WatchList=false` to `feature-gates` command line flag.
1255
+
- Diagnostics: What are the useful log messages and their required logging
1256
+
levels that could help debug the issue? N/A
1257
+
- Testing: Are there any tests for failure mode? If not, describe why. Yes, if kube-controller-manager is unable to start then a lot of existing e2e tests will fail.
1072
1258
<!--
1073
1259
For each of them, fill in the following information by copying the below template:
1074
1260
- [Failure mode brief description]
@@ -1083,6 +1269,7 @@ For each of them, fill in the following information by copying the below templat
1083
1269
-->
1084
1270
1085
1271
###### What steps should be taken if SLOs are not being met to determine the problem?
0 commit comments