owned this note
owned this note
Published
Linked with GitHub
# Resolution
- In upstream Kubernetes, and therefore within OpenShift, there exists two issues when Persistent Volumes were mounted to pods upon container creation that can significantly delay a container's ability to start:
- File permissions relabeling causes a significant delay in container startup.
- SELinux file context relabeling causes a significant delay in container startup.
- The below section describes the current state of resolution for each issue.
- For a deeper understanding of the technical aspect of the issue, proceed to the below "Root Cause" section.
### File Permissions Relabeling
- This issue can be mitigated by applying `fsGroupChangePolicy` to the security context of a pod with the value of `OnRootMismatch`, which will prevent the entire volume from having file permissions re-applied (although it does not prevent it completely).
- As an example, the `securityContext` of a pod may have the following line to avoid this problem:
~~~
securityContext:
fsGroupChangePolicy: "OnRootMismatch"
~~~
- The volume being mounted into the pod *must* support `fsGroup` permissions functionality, otherwise the above parameter will have no effect.
- A more direct fix, allowing skipping of recursive file permission changes, is being worked on and can be read about in the "Root Cause" section below.
### SELinux File Content Relabeling
- There currently exist two workarounds for skipping the SELinux relabeling for a volume.
- Note, both of these workarounds are implemented in the CRI-O level. There is work being done tracking a proper fix upstream.
- This issue is being tracked upstream as well as within Red Hat.
- Please contact [Red Hat Technical Support](https://access.redhat.com/support/contact/technicalSupport) for direct assistance with this issue.
- For appropriate links and technical explanations, please refer to the "Root Cause" section below.
#### Skip SELinux Relabeling if `spc_t`
- This approach is the simplest, but requires the user to have SCC permission to update the SELinux type of the pod. Further, it gives them access to any file on the host. As such, it should only be used in cases that the workload being deployed by a trusted user.
- This approach is the only one that will be made available in 4.7, first appearing in 4.7.37. It also is present in 4.8.16 and 4.9.2.
- To implement this workaround, the user must specify `type: "spc_t"` either on a pod or container `securityContext`:
```
securityContext:
seLinuxOptions:
type: "spc_t"
```
If a pod `ecurityContext` has the type `spc_t` set, then this type will be inherited by containers having no type specified at all.
- When this option is configured, CRI-O will skip the relabel, leaving it as it previously was.
- `spc_t` is a special SELinux type, standing for "super privileged container type", meaning containers with this type will be able to access anything on the host.
#### Skip SELinux Relabeling if already done with an annotation
- This option is a bit more complex, but is also more secure. It involves adding a custom `RuntimeClass`, which, when configured correctly in CRI-O, can interpret an annotation to skip the relabel if the top-level of the volume is found to have the correct label.
- It requires 4.8.16, 4.9.2 or any release >4.10
- A drawback of this approach is that the volume will have to be labeled at least once.
- This will be done automatically by CRI-O, but could incur a container creation timeout.
- A consequence of this is that the container processes may fail to access sub-paths of the volume *if they're relabeled*.
- An improvement in the SELinux relabeling code causes the top-level of the directory to be labeled last. Thus, assuming another process doesn't attempt to relabel a file in the volume, and assuming CRI-O doesn't crash during the intial relabel, the volume should be accessible to the container after the initial relabel.
1. First, a MachineConfig will need to be created to configure CRI-O to have a customized runtime class. This runtime class will be the same as the default one, but configure an `allowed_annotation`. The resulting CRI-O configuration file and subsequent MachineConfig could look like:
```toml
[crio.runtime.runtimes.selinux]
runtime_path = "/usr/bin/runc"
runtime_root = "/run/runc"
runtime_type = "oci"
allowed_annotations = ["io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel"]
```
```yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-worker-selinux-configuration
spec:
config:
storage:
files:
- contents:
source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS5ydW50aW1lcy5zZWxpbnV4XQpydW50aW1lX3BhdGggPSAiL3Vzci9iaW4vcnVuYyIKcnVudGltZV9yb290ID0gIi9ydW4vcnVuYyIKcnVudGltZV90eXBlID0gIm9jaSIKYWxsb3dlZF9hbm5vdGF0aW9ucyA9IFsiaW8ua3ViZXJuZXRlcy5jcmktby5UcnlTa2lwVm9sdW1lU0VMaW51eExhYmVsIl0K
mode: 750
overwrite: true
path: /etc/crio/crio.conf.d/01-selinux.conf
osImageURL: ""
```
2. Next, a RuntimeClass must be created in the API server. The name should match that described in the CRI-O config above:
```yaml
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: selinux
handler: selinux
```
3. Finally, the pod should be configured to have the annotation configured in the metadata, as well as the runtime class configured
```
apiVersion: v1
kind: Pod
metadata:
name: sandbox
annotations:
io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel: "true"
...
spec:
runtimeClassName: selinux
...
```
- Now, when the pod is created, CRI-O will run it with the RuntimeClass `selinux`, which is configured to be allowed to process the annotation `io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel`. This pod has the value "true" for `io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel`, which means the SELinux relabel will be skipped *if the volume is already correctly labeled*.
# Root Cause
There are two distinct issues described in this article. The cause for each is discussed below.
## Recursive File Permissions Delays
- When a pod is created and requires a volume to be mounted, the node's `kubelet` process performed a recursive file permission change across the entire volume causing significant delays in pod creation.
- This issue, while not technically "solved", can be worked around by instructing the `kubelet` to perform the bare minimum of file system permissions changes by implementing the workaround mentioned in the above "Resolution" section.
- This [Kubernetes Enhancement Proposal](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/695-skip-permission-change) discusses a more long-term fix to completely avoid permissions checking by the `kubelet`.
- Work is ongoing upstream to include this into Kubernetes, and provided the solution is accepted into the community should be rebased into the Red Hat OpenShift Container Platform product.
- Please contact Red Hat Support if you believe you are experiencing this issue.
## Recursive SELinux File Context Delays
- When a pod is created and requires a volume to be mounted, the container runtime (either Docker or CRI-O on OpenShift nodes depending on version) is instructed by the node's `kubelet` process upon pod creation to relabel the entire volume with proper SELinux contexts.
- There exist some ways to work around this. By default, volume SElinux context relabeling happens for every volume on container startup and can cause significant delays when the volume has many files and directories as the procedure has to occur asynchronously, recursively through the entire volume. However, if a pod is configured to have `spc_t` type, or is correctly configured to have `io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel` and the volume is already correctly labeled, then the relabel will be skipped.
- A [Kubernetes Enhancement Program Issue](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling) exists, with work from Red Hat within it, attempting to provide a method in future version of Kubernetes (and therefore OpenShift) to avoid recursive relabeling that won't require as invasive/insecure changes to the pod spec.
- The official technical solution will require an API change, and is likely to be very complex, and therefore has no targeted release upstream at this time.
- Discussion of relying on the upcoming feature `fsGroupChangePolicy` has occurred [within this GitHub issue](https://github.com/kubernetes/kubernetes/pull/96376) and can be read about within [this KEP update](https://github.com/kubernetes/enhancements/pull/1879) but is subject to change.
- Please contact Red Hat Support if you believe you are experiencing this issue.