- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Currently for most volume plugins kubelet applies fsgroup ownership and permission-based changes by recursively chown
ing and chmod
ing the files and directories inside a volume. For certain CSI drivers this may not be possible because chown
and chmod
are unix primitives and underlying CSI driver
may not support them. This enhancement proposes providing the CSI driver with fsgroup as an explicit field so as CSI driver can apply this on mount time.
Since some CSI drivers(AzureFile for example) - don't support chmod/chown - we propose that fsgroup of pod to be provided to the CSI driver on NodeStageVolume
and NodePublishVolume
CSI RPC calls. This will allow the CSI driver to apply fsgroup
as a mount option on NodeStageVolume
or NodePublishVolume
and kubelet can be freed
from responsibility of applying recursive ownership and permission change.
This feature hence becomes a prerequisite of CSI migration of Azure file driver and removal of Azure Cloud Provider.
- Allow CSI driver to mount volumes with provided fsgroup.
- We are not supplying
fsGroup
as a generic ownership and permission handle to the CSI driver. We do not expect CSI drivers tochown
orchmod
files.
We are updating CSI specs by adding additional field called volume_mount_group
to NodeStageVolume
and NodePublishVolume
RPC calls. The CSI proposal is available at - container-storage-interface/spec#468 .
The CSI spec change is deliberately trying to avoid asking drivers to use supplied fsGroup
as a generic handle for ownership and permissions. The reason being - Kubernetes may expect ownership and permissions to be in a way that is very platform/OS specific. We do not think CSI driver is right place to enforce all kind of different permissions expected by Kubernetes. The full scope of that discussion is out of scope for this enhancement and interested folks can follow along on - container-storage-interface/spec#449
I am not aware of any associated risks. If a driver can not support using fsgroup
as a mount option, it can always use FileFSGroupPolicy
and let kubelet handle the ownership and permissions.
We are proposing that when kubelet determines a CSI driver has VOLUME_MOUNT_GROUP
node capability, the kubelet will use proposed CSI field volume_mount_group
to pass pod's fsGroup
to the CSI driver. Kubelet will expect that driver will use
this field for mounting volume with given fsGroup
and no further permission/ownerhip change will be necessary.
It should be noted that if a CSI driver advertises VOLUME_MOUNT_GROUP
node capability then value defined in CSIDriver.Spec.FSGroupPolicy
will be ignored and kubelet will always use fsGroup
as a mount option.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
N/A.
The new coded added should have complete test coverage, even though the module does not:
pkg/volume/csi
:<Thu 06 Oct 2022>
-76.2
No integration tests are required. This feature is better tested with e2e tests.
We already have quite a few e2e tests that verify generic fsgroup functionality for existing drivers - https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/testsuites/fsgroupchangepolicy.go. This should give us a reasonable confidence that we won't break any existing drivers.
- [sig-storage] CSI mock volume Delegate FSGroup to CSI driver [LinuxOnly] should pass FSGroup to CSI driver if it is set in pod and driver supports VOLUME_MOUNT_GROUP [Suite:openshift/conformance/parallel] [Suite:k8s]
- [sig-storage] In-tree Volumes [Driver: azure-file] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents [Suite:openshift/conformance/parallel] [Suite:k8s]
- [sig-storage] In-tree Volumes [Driver: azure-file] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with different fsgroup applied to the volume contents [Suite:openshift/conformance/parallel] [Suite:k8s]
- [sig-storage] In-tree Volumes [Driver: azure-file] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (Always)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with same fsgroup applied to the volume contents [Suite:openshift/conformance/parallel] [Suite:k8s]
- [sig-storage] In-tree Volumes [Driver: azure-file] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents [Suite:openshift/conformance/parallel] [Suite:k8s]
- [sig-storage] In-tree Volumes [Driver: azure-file] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with different fsgroup applied to the volume contents [Suite:openshift/conformance/parallel] [Suite:k8s]
- [sig-storage] In-tree Volumes [Driver: azure-file] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, volume contents ownership changed via chgrp in first pod, new pod with same fsgroup skips ownership changes to the volume contents [Suite:openshift/conformance/parallel] [Suite:k8s]
- [sig-storage] In-tree Volumes [Driver: azure-file] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s]
- Since this feature is a must-have for azurefile CSI migration, we will perform testing of the driver.
- Currently CSI spec change is being introduced as alpha change and we will work to move the API change in CSI spec to stable.
- CSI spec change should be stable.
- Tested via e2e and manually using azurefile CSI driver.
Currently there is no way to make a volume readable/writable using azurefile and fsGroup
unless
pod was running as root.
When feature-gate is disabled, kubelet will no longer pass fsGroup
to CSI drivers and such volumes will not be readable/writable by the Pod. This feature is currently broken anyways.
-
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: DelegateFSGroupToCSIDriver
- Components depending on the feature gate:
- Kubelet
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
- Feature gate (also fill in values in
-
Does enabling the feature change any default behavior? Enabling this feature-gate could result in
CSIDriver.Spec.FSGroupPolicy
to be ignored for a driver that now hasVOLUME_MOUNT_GROUP
capability. Enabling this feature will cause volume to be mounted withfsGroup
of the pod rather than kubelet performing permission/ownership change. This should result in quicker pod startup but still may surprise some users. This will be covered via release notes.We expect that once a CSI driver accepts provided
fsGroup
viavolume_mount_group
and mount operation is successful - the permissions should be correct on the volume. A user may write additional healthcheck to determine the permissions if necessary. -
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Yes - feature gate can be disabled once enabled and kubelet will fallback to its current behavior, which will result in any CSI driver that requires using
fsGroup
as a mount option to not use the mount option. In which case mounted volumes will be unwritable by any pods other than those running as root. -
What happens if we reenable the feature if it was previously rolled back? It will cause kubelet to use
volume_mount_group
field of CSI whenver applicable as discussed in above design. The pods running on affected nods have to restarted for feature to take affect though. -
Are there any tests for feature enablement/disablement? Not yet.
-
How can a rollout fail? Can it impact already running workloads? One of the ways the rollout could fail is CSI driver defines
VOLUME_MOUNT_GROUP
capability but somehow does not implement correctly. This change should not affect running workloads (i.e running Pods).Rolling out the feature however may cause group permissions to be applied correctly which weren't applied before.
-
What specific metrics should inform a rollback? If after enabling this feature a spike in
storage_operation_status_count{operation_name="volume_mount", status="fail-unknown"}
metric is observed then cluster admin should look into identifying root cause and rolling back the feature. -
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and can't do that now.
-
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? Even if applying deprecation policies, they may still surprise some users.
This section must be completed when targeting beta graduation to a release.
-
How can an operator determine if the feature is in use by workloads? The feature is in use if the feature gate DelegateFSGroupToCSIDriver is enabled in kubelet, and the CSI driver supports the
VOLUME_MOUNT_GROUP
node service capability.We have considered introducing a new metric with a label that identifies which fsgroup logic is used (kubernetes/kubernetes#98667), but because this feature is small and simple enough, the benefit of such a label would be marginal.
-
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: csi_operations_seconds
- [Optional] Aggregation method: filter by
method_name=NodeStageVolume|NodePublishVolume
,driver_name
(CSI driver name),grpc_status_code
. - Components exposing the metric: kubelet
- Other (treat as last resort)
- Details:
The
csi_operations_seconds
metrics reports a latency histogram of kubelet-initiated CSI gRPC calls by gRPC status code. Filtering byNodeStageVolume
andNodePublishVolume
will give us latency data for the respective gRPC calls which include FSGroup operations for drivers withVOLUME_MOUNT_GROUP
capability, but analyzing driver logs is necessary to further isolate the problem to this feature.An SLI isn't necessary for kubelet logic since it just passes the FSGroup parameter to the CSI driver.
- Metrics
-
What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
For a particular CSI driver, per-day percentage of gRPC calls with
method_name=NodeStageVolume|NodePublishVolume
returning error status codes (as defined by the CSI spec) <= 1%.Latency SLO would be specific to each driver.
-
Are there any missing metrics that would be useful to have to improve observability of this feature?
No
-
Does this feature depend on any specific services running in the cluster? This feature depends on presence of
fsGroup
field in Pod and CSI driver havingVOLUME_MOUNT_GROUP
capability.- [CSI driver]
- CSI driver must have
VOLUME_MOUNT_GROUP
capability:- If this capability is not available in CSI driver then kubelet will try to use default mechanism of applying
fsGroup
to volume (which is basicallychown
andchomid
). If underlying driver however does not support applying group permissions viachown
andchmod
then Pods will not run correctly. - Workloads may not run correctly or volume has to be mounted with permissions
777
.
- If this capability is not available in CSI driver then kubelet will try to use default mechanism of applying
- CSI driver must have
- [CSI driver]
For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.
-
Will enabling / using this feature result in any new API calls?
No.
-
Will enabling / using this feature result in introducing new API types?
No.
-
Will enabling / using this feature result in any new calls to the cloud provider?
No.
-
Will enabling / using this feature result in increasing size or count of the existing API objects?
No.
-
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? Think about adding additional work or introducing new steps in between (e.g. need to do X to start a container), etc. Please describe the details.
This feature does not have any impact on scalability.
-
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? Things to keep in mind include: additional in-memory state, additional non-trivial computations, excessive access to disks (including increased log volume), significant amount of data sent and/or received over network, etc. This through this both in small and large cases, again with respect to the supported limits.
Not in Kubernetes components. CSI drivers may vary in their implementation and may increase resource usage.
The Troubleshooting section currently serves the Playbook
role. We may consider
splitting it into a dedicated Playbook
document (potentially with some monitoring
details). For now, we leave it here.
This section must be completed when targeting beta graduation to a release.
-
How does this feature react if the API server and/or etcd is unavailable?
This feature is part of the volume mount path in kubelet, and does not add extra communication with the API server, so this does not introduce new failure modes in the presence of API server or etcd downtime.
-
What are other known failure modes? For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way: how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already running user workloads?
- Diagnostics: What are the useful log messages and their required logging levels that could help debug the issue? Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
In addition to existing k8s volume and CSI failure modes:
- Driver fails to apply FSGroup (due to a driver error).
- Detection: SLI above, in conjunction with the
DelegateFSGroupToCSIDriver
feature gate andVOLUME_MOUNT_GROUP
node service capability in the CSI driver to determine if this feature is being used. - Mitigations: Revert the CSI driver version to one without the issue, or avoid specifying an FSGroup in the pod's security context, if possible.
- Diagnostics: Depends on the driver. Generally look for FSGroup-related messages in
NodeStageVolume
andNodePublishVolume
logs. - Testing: Will add an e2e test with a test driver (csi-driver-host-path) simulating a FSGroup failure.
- Detection: SLI above, in conjunction with the
- [Failure mode brief description]
-
What steps should be taken if SLOs are not being met to determine the problem?
The CSI driver log should be inspected to look for NodeStageVolume
and/or NodePublishVolume
errors.
- Feb 2 2021: KEP merged into kubernetes/enhancements repo
- Sep 26 2022: Use latest template
- N/A
- An alternative to supplying fsgroup of the pod to the CSI driver is to mount volumes with 777 permissions or run pods. Both of these alternatives are not great and not backward compatible.
- Azure Cloudprovider access