File and Object Storage

 View Only

Advanced Static Volume Provisioning with IBM Spectrum Scale on Red Hat OpenShift

By GERO SCHMIDT posted Sat April 30, 2022 10:23 AM

  

Abstract

As clustered parallel file system, IBM Spectrum Scale can provide a global namespace for all your data, supporting POSIX file access as well as access through various protocols like NFS, SMB, Object, and HDFS. This enables data ingest from a variety of data sources and provides data access to different analytics and big data platforms like High Performance Computing (via POSIX direct file access), Hadoop (via HDFS transparency) and OpenShift (via IBM Spectrum Scale CNSA and IBM Spectrum Scale Container Storage Interface Driver / CSI) without the need to duplicate or copy data from one storage silo to another. This reduces costs (no waste of storage capacity on duplicate data) and time to insights (no waiting for data to be copied).

In OpenShift/Kubernetes persistent storage for containerized applications is consumed through persistent volumes and persistent volume claims. A storage provider in OpenShift typically creates a new and empty persistent volume (PV) in response to a persistent volume claim (PVC) through dynamic provisioning using storage classes. But what about providing and sharing access to pre-existing data in IBM Spectrum Scale and making the data available to containerized applications running in OpenShift - without the need to copy and duplicate the data?

In this article we take a closer look at how containerized applications in OpenShift can consume persistent storage in IBM Spectrum Scale. Specifically, we explore how we can leverage static volume provisioning to provide and share access to pre-existing data in IBM Spectrum Scale across OpenShift users and namespaces. We discuss some of the necessary considerations that need to be taken into account with regard to controlling the binding of PVs to PVCs (by using Kubernetes labels and claimRef) as well as using a proper container security context and custom SCCs (Security Context Constraints) with regard to user ID (uid), group ID (gid) and, especially, SELinux MCS labels for safely sharing data access across namespaces. Furthermore, the article also briefly touches on the general impact of SELinux relabeling on persistent volumes with large amounts of files and discusses options to disable SELinux relabeling on selected pods/containers using SELinux type "spc_t".

Note 1: The article was originally based on Red Hat OpenShift 4.9.22 with IBM Spectrum Scale Container Native Storage Access (CNSA) v5.1.2 and IBM Spectrum Scale CSI Driver 2.4.0.

Note 2: Starting with IBM Spectrum Scale CNSA 5.1.7.0 (released on March 16, 2023) the SELinux default behavior has changed. IBM Spectrum Scale CNSA 5.1.7.0 will now - by default - mount the file system with a container permissive SELinux context by setting the mount context of the file system to system_u:object_r:container_file_t:s0. All files inside the file system will be considered to have the SELinux context defined on the files system mount and allow all containers running as container_t SELinux type to access files on the file system if permitted by standard file permissions. See  IBM Spectrum Scale container native and SELinux for more details.

Disclaimer: Please note that the following statements and examples in this article are no formal support statements by IBM. This article is an exploratory journey to discuss the different ways of volume provisioning in OpenShift/Kubernetes with the IBM Spectrum Scale CSI Driver and how to use standard Kubernetes methodologies to create statically provisioned PVs and bind them to specific PVCs with the goal to provide and share access to existing data hosted in IBM Spectrum Scale. Kubernetes and OpenShift are quickly evolving projects so options and behaviors may change and the IBM Spectrum Scale Driver might need to adapt accordingly. So always make sure to test the proposed options carefully and when necessary obtain a proper support statement from IBM for future directions where needed. The article is based on behaviors observed in Red Hat OpenShift 4.9.22 with IBM Spectrum Scale Container Native Storage Access (CNSA) v5.1.2 and IBM Spectrum Scale CSI Driver 2.4.0. The author and IBM do assume no responsibility or liability for any errors or omissions, or for any results obtained from the use of this information provided in the content of this article. The information is provided on an "as is" basis with no guarantees of completeness, accuracy, usefulness or fitness for a particular purpose.

Table of contents

A pdf version of this blog post for download is available at: Advanced Static Volume Provisioning with IBM Spectrum Scale on Red Hat OpenShift (pdf)

Basic concepts of volume provisioning in OpenShift/Kubernetes

In OpenShift and Kubernetes the fundamental ways of provisioning persistent storage in form of persistent volumes (PVs) to your containerized applications are

  1. Dynamic provisioning of volumes through a storage class, and
  2. Static provisioning of volumes through the cluster administrator.

With dynamic provisioning the cluster admin only needs to create a storage class and the corresponding PV and backing directory in IBM Spectrum Scale is automatically created (provisioned) by the CSI driver and bound to the originating persistent volume claim (PVC) and OpenShift namespace (also referred to as project in OpenShift) on demand in a self-service fashion. The PVC can then be used in all pods within that namespace to provide persistent storage to the containerized applications. However, dynamic provisioning provides a fresh and empty volume for the user to start with. Therefore it does not provide access to existing data.

Static provisioning, on the other hand, requires the cluster admin to manually create each persistent volume (PV) that can later be claimed by users through persistent volume claims (PVCs). A user's PVC will be bound to a PV based on a best match with regard to requested storage size and access mode. Here, the cluster admin typically would create a whole pool of static PVs backed by pre-created empty directories in IBM Spectrum Scale to ensure that each user will get an empty PV that is bound to the user's PVC request and namespace on demand. By default, users in any namespace would be able to claim any PV from the pool that is a best match to the requested storage size and access mode in the PVC request. 

However, static provisioning also allows the cluster admin to create PVs which are actually backed by non-empty directories in IBM Spectrum Scale with existing data. For example, one could think of directories that contain huge amounts of data, such as training data sets for Deep Learning (DL) projects, which we want to share and make available to a larger group of users in OpenShift. So these users can develop, train and run their own AI/DL models without the need to copy and duplicate the data. Here we can leverage static provisioning to make existing data in IBM Spectrum Scale available to a specific user, multiple users, or multiple namespaces. The first problem to solve is to ensure that a user can selectively request a specific PV from the pool of static PVs that actually holds the data that the user is interested in and that the binding of any PV to the PVC request does not happen by chance. We introduce methods of how we can control the binding between a statically provisioned PV and a user's PVC request in the section Advanced static volume provisioning below. Before we dive deeper into static volume provisioning we first take a quick look at dynamic provisioning as its configuration (e.g. with or without a default storage class) directly affects the general behavior of static provisioning. 

Dynamic volume provisioning

Dynamic volume provisioning was promoted to stable in Kubernetes 1.6 release (see Dynamic Provisioning and Storage Classes in Kubernetes) and is the preferred way of providing persistent storage to users in OpenShift/Kubernetes today. Prior to the availability of dynamic provisioning a cluster admin had to manually pre-provision the storage and create the persistent volume (PV) objects for the users' persistent volume claims (PVCs). With dynamic provisioning the cluster admin only needs to create a storage class (or multiple storage classes) and a new empty volume is automatically provisioned and created on-demand for each user's PVC request. Storage classes use provisioners that are specific to the individual storage backend, here, in this article we focus on the IBM Spectrum Scale Container Storage Interface Driver or, in short, the IBM Spectrum Scale CSI Driver.

IBM Spectrum Scale CSI Driver v2.5.0 supports different storage classes for creating

  • lightweight volumes (backed by directories in IBM Spectrum Scale),
  • fileset-based volumes (backed by independent/dependent filesets in IBM Spectrum Scale) ,
  • consistency group volumes (backed by dependent filesets jointly embedded in an independent fileset for consistent snapshots of all contained volumes).

Please refer to the Storage class section in the IBM Spectrum Scale CSI Driver documentation for more details about these different storage classes. Here will briefly look at the full deployment cycle for using a storage class for lightweight volumes.

The cluster admin only needs to create a storage class. An example for a storage class for lightweight volumes is given below:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ibm-spectrum-scale-light-sc
provisioner: spectrumscale.csi.ibm.com
parameters:
volBackendFs: "fs1"
volDirBasePath: "pvc-volumes"
reclaimPolicy: Delete

All PVs provisioned from this storage class will be located in individual directories located under [mount point fs1]/pvc-volumes/ in IBM Spectrum Scale. The target directory pvc-volumes in the IBM Spectrum Scale file system fs1 must exist prior to creating the storage class.

A user can now issue a persistent volume claim (PVC) against this storage class ibm-spectrum-scale-light-sc as shown below:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ibm-spectrum-scale-pvc
spec:
storageClassName: ibm-spectrum-scale-light-sc
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi

The IBM Spectrum Scale CSI driver will automatically create a new directory, here pvc-01a53f89-8d14-4862-abe0-98fe6fe57dfc, in the IBM Spectrum Scale file system fs1 under the mount path /mnt/fs1/pvc-volumes/, create a new PV object backed by this directory, and bind this PV to the user's PVC request in the user's namespace (note that a PVC is always a namespaced object):

# oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
ibm-spectrum-scale-pvc Bound pvc-01a53f89-8d14-4862-abe0-98fe6fe57dfc 10Gi RWX ibm-spectrum-scale-light-sc 5s

# oc get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS AGE
pvc-01a53f89-8d14-4862-abe0-98fe6fe57dfc 10Gi RWX Delete Bound user-namespace/ibm-spectrum-scale-pvc ibm-spectrum-scale-light-sc 3s

# ls -al /mnt/fs1/pvc-volumes/
drwxrwx--x. 2 root root 4096 Mar 29 18:26 pvc-01a53f89-8d14-4862-abe0-98fe6fe57dfc

The user can mount and use this PVC in all pods in the user's namespace as shown below, simply by referring to the PVC name, here ibm-spectrum-scale-pvc:

apiVersion: v1
kind: Pod
metadata:
name: ibm-spectrum-scale-test-pod
spec:
containers:
- name: ibm-spectrum-scale-test-pod
image: registry.access.redhat.com/ubi8/ubi-minimal:latest
command: [ "/bin/sh" ]
args: [ "-c","while true; do echo $(hostname) $(date +%Y%m%d-%H:%M:%S) | tee -a /data/stream1.out ; sleep 5 ; done;" ]
volumeMounts:
- name: vol1
mountPath: "/data"
volumes:
- name: vol1
persistentVolumeClaim:
claimName: ibm-spectrum-scale-pvc

In this example, the PVC/PV will be mounted under the local path /data in the pod's container. It is located in the IBM Spectrum Scale at /mnt/fs1/pvc-volumes/pvc-01a53f89-8d14-4862-abe0-98fe6fe57dfc with /mnt/fs1 being the mount point of the IBM Spectrum Scale file system fs1 on the OpenShift cluster nodes.

The default storage class

The cluster admin can define multiple storage classes if needed and mark one of the storage classes as default storage class. In this case any persistent volume claim (PVC) that does not explicitly request a storage class through the storageClassName (i.e. this line is omitted in the PVC manifest) will be provisioned by the default storage class.

An existing storage class can be marked as default storage class as follows:

# oc patch storageclass ibm-spectrum-scale-light-sc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
ibm-spectrum-scale-light-sc (default) spectrumscale.csi.ibm.com Delete Immediate false 5d18h
ibm-spectrum-scale-sample spectrumscale.csi.ibm.com Delete Immediate false 23d

You can unmark the default storage class with

# oc patch storageclass ibm-spectrum-scale-light-sc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

Note that the behavior of static or dynamic provisioning for persistent volume claims (PVCs) may change in the presence of a default storage class!

With no default storage class defined:

  • A PVC with no or an empty storageClassName uses static provisioning and is matched with available PVs from the pool of statically provisioned volumes.
  • A PVC with a provided storageClassName uses dynamic provisioning with the specified storage class.

With a default storage class defined:

  • A PVC with no storageClassName uses dynamic provisioning with the default storage class.
  • A PVC with an empty ("") storageClassName uses static provisioning and is matched against available PVs from the pool of statically provisioned volumes.
  • A PVC with a provided storageClassName uses dynamic provisioning with the specified storage class.

We will extensively make use of the empty ("") storageClassName in a PVC in the following sections to make sure that we explicitly request a statically provisioned volume even in the presence of a default storage class.

Static volume provisioning

Static provisioning was the way of providing persistent storage in Kubernetes before dynamic provisioning became generally available. Today, you would typically use dynamic provisioning for the provisioning of new volumes to users. However, static provisioning offers ways of providing and sharing access to specific directories and thus existing data in IBM Spectrum Scale as we will explore in the coming sections. In order to better understand how static provisioning generally works we will briefly walk through the involved steps in the following paragraph.

With static provisioning the cluster admin would need to manually provision the storage (i.e. creating a directory in IBM Spectrum Scale) and create the persistent volume (PV) object as follows:

apiVersion: v1
kind: PersistentVolume
metadata:
name: pv01
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteMany
csi:
driver: spectrumscale.csi.ibm.com
volumeHandle: "835838342966509310;099B6A7A:5EB99721;path=/mnt/fs1/data/pv01"

The volumeHandle given in this example is the original volumeHandle as used up to IBM Spectrum Scale CSI Driver v2.1.0. It is still compatible with IBM Spectrum Scale CSI Driver v2.5.0 for static volumes. The following parameters need to be provided in the volumeHandle:

  • 835838342966509310 is the clusterID of the local (primary) IBM Spectrum Scale CNSA cluster
  • 099B6A7A:5EB99721 is the file system ID of the IBM Spectrum Scale file system
  • /mnt/fs1/data/pv01 is the local path in CNSA (i.e. on the OpenShift nodes) to the backing directory in the specified IBM Spectrum Scale file system.

Ensure that the backing directory /mnt/fs1/data/pv01 (or any subdirectory within) has the proper POSIX user (uid) and group (gid) file access permissions set in the IBM Spectrum Scale file system so that the intended scope of read/write/execute (rwx) access to the PV for the user process in the container is granted. For example, a regular non-privileged user in OpenShift, who is running under the restricted SCC (Security Context Constraints) policy, is assigned an arbitrary uid and gid 0 (root) for the user process in a container by default when accessing the file system in the PV. In this case you may want to set the file access permissions to "rwx" for the root group (gid 0) on the backing directory in IBM Spectrum Scale to grant full read/write access to the static PV, e.g.

 drwxrwxr-x. root root /mnt/fs1/data/pv01

Please refer to the section User ID (uid) / group ID (gid) security context for more information on this topic.

The cluster admin would typically create a pool of pre-provisioned PVs (e.g., pv01, pv02, ...) each backed by pre-provisioned empty directories in IBM Spectrum Scale to ensure that each user obtains an empty PV bound to the user's PVC request. The IBM Spectrum Scale CSI Driver provides a script to help with the generation of static PVs, see Generating static provisioning manifests.  

The PV that was created in the previous step can be claimed by any user through a regular persistent volume claim (PVC) such as

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ibm-spectrum-scale-pvc
spec:
storageClassName: ""
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi

Typically the storageClassName: "" line could be omitted if no default storage class is present in the OpenShift cluster. We explicitly use an empty ("") storageClassName here in our manifest to ensure that we always skip dynamic provisioning especially in the presence of a default storage class. A PVC with storageClassName: "" is always interpreted to be requesting a PV without dynamic provisioning through a storage class (and the acssociated storage provider).

In general, the PVC will be bound to any available PV from the pool of pre-provisioned PVs based on a best match with regard to requested storage size and access mode. This means that a PV with a larger capacity (e.g. 100Gi instead of a requested 10Gi) and a broader access mode (i.e. RWX instead of the requested RWO) may be matched to a PVC that requests less capacity and a narrower access mode. Claims will remain unbound and wait for a proper PV to become available indefinitely if a matching volume does not immediately exist. They will be bound once matching volumes become available due to the declarative paradigm of Kubernetes/OpenShift. Refer to Binding for more details on the binding between a PV and a PVC.

The access modes in Kubernetes are:

  • ReadWriteOnce (RWO): the volume can be mounted as read-write by a single node. ReadWriteOnce access mode still can allow multiple pods to access the volume when the pods are running on the same node.
  • ReadOnlyMany (ROX): the volume can be mounted as read-only by many nodes.
  • ReadWriteMany (RWX): the volume can be mounted as read-write by many nodes.
  • ReadWriteOncePod (RWOP): the volume can be mounted as read-write by a single pod. Use ReadWriteOncePod access mode if you want to ensure that only one pod across the whole cluster can access this PVC. This is only supported for CSI volumes and Kubernetes version 1.22+.

Note that although the access mode appears to be controlling access to the volume, it is actually used similarly to labels to match a PVC to a proper PV dependent on what the resource provider supports - there are currently no access rules enforced based on the selected accessModes. See Access Modes for more information.

After the PV is bound to a PVC in a user namespace it can be consumed by all pods in that namespace simply referencing the PVC name in the volumes section of the pod manifest as shown in the example above. In our example, the user will obtain a new and empty PV from the pre-provisioned pool of available PVs that best matches the requested criteria and that is exclusively bound to the user's PVC request in the user's namespace. Note that a PVC is a namespaced object, which means it is namespace-bound in contrast to a PV. Once a PVC is deleted the associated static PV is released based on its reclaim policy (see Reclaiming). The default reclaim policy (persistentVolumeReclaimPolicy) is Retain which means that the PV still exists but is in a released state (not in an available state) so it cannot be claimed and bound to another PVC request. The cluster admin needs to manually decide what to do with the released PV and reclaim it (e.g. delete and recreate the PV, delete the user data, etc.).

The IBM Spectrum Scale CSI Driver volumeHandle

In our example and throughout this article we use the original volumeHandle as it was used up to IBM Spectrum Scale CSI Driver v2.1.0, see Creating a persistent volume (PV):

  csi:                                                            IBM Spectrum Scale CSI Driver v2.1.0
driver: spectrumscale.csi.ibm.com
volumeHandle: "835838342966509310;099B6A7A:5EB99721;path=/mnt/fs1/data/pv01"

It is still compatible with IBM Spectrum Scale CSI Driver v2.5.0 for static volumes although the volumeHandle itself has changed with release v2.5.0 as follows:

 csi:                                                             IBM Spectrum Scale CSI Driver v2.5.0
driver: spectrumscale.csi.ibm.com
volumeHandle: "0;0;835838342966509310;099B6A7A:5EB99721;;;/mnt/fs1/data/pv01"

The new volumeHandle for IBM Spectrum Scale CSI Driver v2.5.0 has introduced additional fields as described in Creating a persistent volume (PV) for v2.5.0 with 0;[Volume type];[Cluster ID];[Filesystem UUID];;[Fileset name];[Path to the directory or fileset linkpath]. For statically provisioned PVs the 1st field is "0" and the 5th field is always empty. Volume type is "0" for directory based volumes, "1" for dependent fileset based volumes and "2" for independent fileset based volumes. For directory based volumes, fileset name is always empty.

In any case, when defining a static PV with IBM Spectrum Scale CSI Driver we require specific information from IBM Spectrum Scale for the volumeHandle:

    volumeHandle: "835838342966509310;099B6A7A:5EB99721;path=/mnt/fs1/data/pv01"
[local cluster ID];[file system UID];[local path to directory]

(1) First we need the cluster ID of the local (primary) IBM Spectrum Scale CNSA cluster that is running on OpenShift. This can be retrieved from any of the IBM Spectrum Scale CNSA core pods (here we pick the pod worker1a in the ibm-spectrum-scale namespace) by executing the mmlscluster command as follows:

# oc exec worker1a  -n ibm-spectrum-scale -- mmlscluster -Y | grep clusterSummary | tail -1 | cut -d':' -f8
Defaulted container "gpfs" out of: gpfs, logs, mmbuildgpl (init), config (init)
835838342966509310

Alternatively, you can also retrieve this information from the IBM Spectrum Scale CSI custom resource (CR) csiscaleoperators.csi.ibm.com in the ibm-spectrum-scale-csi namespace as follows:

# oc get csiscaleoperators.csi.ibm.com ibm-spectrum-scale-csi -n ibm-spectrum-scale-csi -o yaml | grep -A5 " id:"
- id: "835838342966509310"
primary:
inodeLimit: ""
primaryFs: fs1
primaryFset: primary-fileset-fs1-835838342966509310
remoteCluster: "215057217487177715"
--
- id: "215057217487177715"
primary:
inodeLimit: ""
primaryFs: ""
primaryFset: ""
remoteCluster: ""

It will be the entry that has the primaryFs defined (i.e. non-empty), here, primaryFs: fs1, with fs1 being the local IBM Spectrum Scale file system name in the local IBM Spectrum Scale CNSA cluster on OpenShift. Note that IBM Spectrum Scale CNSA will use the local file system name (fs1) and local cluster ID (835838342966509310) also as part of the default name for the primary fileset that will be created and used by the IBM Spectrum Scale CSI Driver, here primary-fileset-fs1-835838342966509310. So you might also be able to tell these parameters from the primary fileset name even on the remote storage cluster:

# mmlsfileset fs1 -L
Filesets in file system 'fs1':
Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment
root 0 3 -- Mon May 11 20:19:22 2020 0 15490304 500736 root fileset
primary-fileset-fs1-835838342966509310 8 2621443 0 Fri Mar 11 17:59:18 2022 5 1048576 52224 Fileset created by IBM Container Storage Interface driver

(2) The second parameter that we need is the UID of the IBM Spectrum Scale file system where our target directory for the PV object will be located. We can obtain the file system UID from any of the IBM Spectrum Scale CNSA core pods (here we pick the pod worker1a in the ibm-spectrum-scale namespace) by executing the mmlsfs fs1 --uid command as follows:

# oc exec worker1a -n ibm-spectrum-scale -- mmlsfs fs1 --uid
Defaulted container "gpfs" out of: gpfs, logs, mmbuildgpl (init), config (init)
flag value description
------------------- ------------------------ -----------------------------------
--uid 099B6A7A:5EB99721 File system UID

In this example, our local file system fs1 is remotely mounted from an IBM Spectrum Scale storage cluster, for example, an IBM Elastic Storage Server (ESS), as we can see with the mmremotefs command:

# oc exec worker1a -- mmremotefs show all
Defaulted container "gpfs" out of: gpfs, logs, mmbuildgpl (init), config (init)
Local Name Remote Name Cluster name Mount Point Mount Options Automount Drive Priority
fs1 ess3000_1M ess3000.bda.scale.ibm.com /mnt/fs1 rw yes - 0

We can identify the local path /mnt/fs1 where the file system is mounted on all participating OpenShift worker nodes and which we will need in the next step.
Note, if the local file system fs1 is a remote mount from a remote file system, here ess3000_1M, on a remote storage cluster, then both will have the same UID, i.e. running mmlsfs ess3000_1M --uid on the storage cluster will provide the same file system UID.

(3) The third parameter is the full local path to the destination or target directory in IBM Spectrum Scale that we want to use as backing directory for the PV. This path includes of two parts: the local mount point of the file system (/mnt/fs1) and the actual target directory (/data/pv01) within the file system where all the data of the PV will be located. So the complete local path will be /mnt/fs1/data/pv01.

For more details about the usual options for static provisioning with IBM Spectrum Scale CSI driver please refer to Static provisioning in the IBM Spectrum Scale Container Storage Interface Driver documentation.

Advanced static volume provisioning

Now that we understand the workflow for static provisioning with IBM Spectrum Scale CSI Driver, let's take a look at some use cases where static provisioning can help. While we would preferably use dynamic provisioning to provide new (and empty) volumes to users, we can make use of static provisioning if we want to provide or share access to existing data in IBM Spectrum Scale, especially, if we want to use IBM Spectrum Scale as a "Data Lake" for a variety of applications and across different data analytics platforms and architectures without the need to copy and duplicate all the data.

Static provisioning offers ways of providing and sharing access to specific directories and existing data in IBM Spectrum Scale. These could be selected directories that only specific users (i.e. OpenShift namespaces/projects) are allowed to access or shared directories where multiple users should be able to easily claim access to.

For selected directories accessed only by specific users (i.e. user namespaces) we need to make sure that these PVs can only be bound to specific PVC requests in specific namespaces and not by any PVC in any namespace. For this use case we will describe how to work with static PVs using the claimRef option.

Shared directories on the other hand can be directories where huge amounts of data are stored, for example, data to train and run Machine Learning (ML) / Deep Learning (DL) models and where multiple users should be able to easily request access to by simply claiming a static PV from a pre-provisioned pool. Here, the use of regular Kubernetes labels attached to the pre-provisioned static PVs can help to characterize the data (i.e. type: training, dept: ds, etc.) and would generally offer a way to allow users to use selectors with matchLabels in their PVC manifests to ensure that only selected PVs with the data of interest bind to the user's PVC request.

Other use cases may include advanced features of IBM Spectrum Scale, like Active File Management (AFM), where data from a home location (AFM Home cluster) is made available on a remote edge location (AFM Cache cluster). Here, static provisioning can also be used to make data in AFM filesets available to containerized applications in OpenShift.

Advanced static volume provisioning using labels

First, we will look at a use case with shared directories with existing data that we want to make available to multiple user namespaces in OpenShift through statically provisioned PVs. Here, users shall be able to simply claim such a PV from a pre-provisioned pool and also be able to select between different kinds of PVs based on labels that characterize the data behind the PV. Kubernetes labels are key/value pairs that are attached to objects and can be used to select objects based on their labels through label selectors (see Labels and Selectors). 

Let's assume we want to share access to a specific data directory in IBM Spectrum Scale to a group of data scientists. This directory holds huge amounts of data to be processed either for training new models or for applying models and making proper classifications and predictions. Each data scientist works in a private namespace/project on OpenShift. The local path to the destination directory is /mnt/fs1/training-data. And we will characterize the data through two labels, type: training and dept: ds.

The cluster admin would need to manually prepare a pool of PVs (each with a unique PV name like train-pv01, train-pv02, train-pv03,...) with the following persistent volume (PV) manifest:

apiVersion: v1
kind: PersistentVolume
metadata:
name: train-pv01
labels:
type: training
dept: ds
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteMany
csi:
driver: spectrumscale.csi.ibm.com
volumeHandle: "835838342966509310;099B6A7A:5EB99721;path=/mnt/fs1/training-data"

The PVs use the labels type: training and dept: ds so a PVC from a user can claim a specific volume from the all available PVs characterized specifically by these labels. The labels should characterize the data behind the PV and its backing directory.

Any user can now claim a PV from this pool and ensure to get a PV with the data that the user is actually interested in by using a selector with the respective labels in a corresponding persistent volume claim (PVC) as follows:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-data-pvc
spec:
storageClassName: ""
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
selector:
matchLabels:
type: training
dept: ds

We explicitly use an empty storageClassName: "" here in order to ensure that we always skip dynamic provisioning with a default storage class (in case a default storage class is present). Otherwise the PVC will get stuck in a pending state when the default storage class is invoked with a selector in the manifest:

Events:
Type Reason Message
------- ------ -------
Warning ProvisioningFailed failed to provision volume with StorageClass "ibm-spectrum-scale-light-sc": claim Selector is not supported

A PVC with a non-empty selector is not supposed to have a PV dynamically provisioned for it (see Persistent Volumes). The PVC above will be bound to any available PV from the pool of pre-provisioned PVs that is available and matches the specific labels given under matchLabels in the selector section. In addition to the labels the requested storage capacity as well as the accessModes are still taken into account in the overall matching criteria.

Once the PV is bound to the user's PVC it can be used in a pod in the same way as provided in the pod example above (in the Dynamic volume provisioning section) simply by referencing the PVC name in the volumes section of the pod manifest. The PV is bound to the PVC in the user's namespace and no longer available to other users in other namespaces. Other users in other namespaces can issue an identical PVC request which will bind to another PV from the pool matching the requested labels (if any such PV is available). So if multiple users in different namespaces need access to the same data directory in IBM Spectrum Scale, the admin would need to create multiple identical PVs with the same labels and same volumeHandle but each with its own unique PV name.

IMPORTANT: When using multiple static PVs to share data access to the exact same backing directory in IBM Spectrum Scale across different users or multiple namespaces then additional measures need to be taken to ensure that identical SELinux MCS labels (container security context) are used on the data. Otherwise different users may accidentally access the data with different SELinux MCS labels and lock out other users from their access to the data (due to SELinux relabeling). This also applies to nested backing directories. Please refer to the section OpenShift SELinux and user (uid) / group (gid) security context below for important additional considerations when sharing access to the same data in IBM Spectrum Scale across users and namespaces in OpenShift!

A variant of the above approach is that the cluster admin can also create static PVs associated with a fake (i.e. a non-existent) storageClassName of their own, such as a "static" as we use in the example below:

apiVersion: v1
kind: PersistentVolume
metadata:
name: train-pv01
labels:
type: training
dept: ds
spec:
storageClassName: static
capacity:
storage: 1Gi
accessModes:
- ReadWriteMany
csi:
driver: spectrumscale.csi.ibm.com
volumeHandle: "835838342966509310;099B6A7A:5EB99721;path=/mnt/fs1/training-data"

In addition to the labels type: training and dept: ds the user's PVC request now also would need to reference this specific storageClassName with storageClassName: static rather than referencing the empty ("") storage class (to skip dynamic provisioning if a default storage class is defined):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-data-pvc
spec:
storageClassName: static
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
selector:
matchLabels:
type: training
dept: ds

This would help to improve overall volume management as all statically provisioned PVs can now - in addition to the labels - also be associated and grouped with their own storage class (although there is no provisioner and no real storage class associated with it). Associating these static volumes with their own  storageClassName not only skips dynamic provisioning through a default storage class, it also allows to make use of storage resource quotas (see section Storage Resource Quota) to restrict access to these volumes by number or capacity per namespace. The OpenShift documentation provides a similar example for using  a storageClassName in a manually created PV in Persistent storage using hostPath.

Note that the chosen storageClassName for the static PVs must be different from any existing real storage class that is or will be present in the OpenShift cluster!

Preventing access by namespace

Associating the created PVs with their own storageClassName allows us to make use of Storage Resource Quota in order to limit access to these statically provisioned PVs based on their associated storageClassName. For example, by applying the following ResourceQuota manifest with an allowed maximum number of 0 (zero) persistent volume claims from the storage class static we would ensure that a user in the namespace dean cannot claim any PVs associated with the storageClassName static:

apiVersion: v1
kind: ResourceQuota
metadata:
name: storage-quota
namespace: dean
spec:
hard:
static.storageclass.storage.k8s.io/persistentvolumeclaims: 0

Claiming any PVs from the pool which are associated with the storage class static is now prevented for any PVC request in the namespace dean:

Error from server (Forbidden): error when creating "pvc01.yaml": persistentvolumeclaims "pvc01" is forbidden: exceeded quota: storage-quota, requested: static.storageclass.storage.k8s.io/persistentvolumeclaims=1, used: static.storageclass.storage.k8s.io/persistentvolumeclaims=0, limited: static.storageclass.storage.k8s.io/persistentvolumeclaims=0

For more information about the use of storage resource quota and the available options please refer to the respective Storage Resource Quota section below.

Enforcing read-only access

In other cases you may, for example, want to ensure that multiple users can access the shared data for read but shall not be able to write or change the data. In this case you might think of the admin creating the PVs with the additional readOnly CSI option as described in Kubernetes Volumes - Out-of-tree volume plugins - CSI:

apiVersion: v1
kind: PersistentVolume
spec:
  [...]
  csi:
    driver: spectrumscale.csi.ibm.com
    volumeHandle: "ClusterId;FSID;path=/gpfs/fs0/data"
    readOnly: true

Unfortunately, the CSI readOnly flag does not seem to be properly honored as of today by the Container Storage Interface (CSI) and therefore it is not yet implemented in IBM Spectrum Scale CSI Driver. See Kubernetes issues 61008 and 70505.

Of course, a user can always use the readOnly flag in the volumeMounts section of the pod manifest to ensure that a volume is mounted in read-only mode inside the container to prevent any changes to the data in that volume:

  spec:
containers:
- name: ibm-spectrum-scale-test-pod
image: registry.access.redhat.com/ubi8/ubi-minimal:latest
volumeMounts:
- name: vol1
mountPath: "/data"
readOnly: true

However, this is no option for an admin to generally protect the shared data from any changes by regular users as it would be up to the individual user to actually honor the request to mount the shared volume in read-only mode (i.e. trust the user to do the right thing). It might be an option for selected automated workloads started by admins, workload schedulers,  or other privileged users.

Other options to ensure read-only access on shared data in IBM Spectrum Scale would be to carefully work with POSIX file permissions (uid, gid, mode bits), ACLs, or the immutability options in IBM Spectrum Scale as described in Immutability and appendOnly features or Creating immutable filesets and files.
For example, the storage admin could set the immutable flag in IBM Spectrum Scale on file level by running

# mmchattr -i yes my_immutable_file

# mmlsattr -L my_immutable_file
file name: my_immutable_file
metadata replication: 1 max 2
data replication: 1 max 2
immutable: yes
appendOnly: no
flags:
storage pool name: system
fileset name: root
snapshot name:
creation time: Tue Apr 12 18:09:10 2022
Misc attributes: ARCHIVE READONLY
Encrypted: no

# ls -alZ my_immutable_file
-rw-rw-rw-. 1 1000680000 root system_u:object_r:container_file_t:s0:c15,c26 10133 Apr 14 10:24 my_immutable_file

Setting the immutable flag in IBM Spectrum Scale would prevent any changes to the file when accessed from containerized applications in OpenShift:

$ oc rsh test-pod
sh-4.4$ id
uid=1000680000(1000680000) gid=0(root) groups=0(root),1000680000

sh-4.4$ ls -alZ /data/my_file.out
-rw-rw-rw-. 1 1000680000 root system_u:object_r:container_file_t:s0:c15,c26 10133 Apr 14 10:24 my_immutable_file

sh-4.4$ echo xxxxxx >> /data/my_immutable_file
sh: /data/my_file.out: Read-only file system

Depending on when the immutable flag was applied, i.e. before or after the PV was mounted in a container, the error message in the container on an attempt to write to an immutable file shows either "Read-only file system" or "Permission denied", respectively. Note, that once a file or directory in the backing directory of a statically provisioned PV is set to immutable then the SELinux labeling for the entire PV may need to be applied manually (see section SELinux relabeling).  

Advanced static volume provisioning using claimRef

In this section we look at selected directories in IBM Spectrum Scale that we want to make available only to specific users or, more precisely, to specific namespaces. So we need to make sure that these PVs can only be bound to the specific namespace and not be claimed by any other user in any other namespace.

Instead of using labels as before we will now make use the claimRef option as described in Reserving a PersistentVolume. By using claimRef we can declare (and enforce) a bi-directional binding between the statically provisioned PV and a PVC based on the PVC name and its originating namespace. Therefore, we do not need to make use of ResourceQuota to control which namespace can or cannot consume the static PVs. Although the volume binding with claimRef happens based on the PVC name and namespace the control plane still verifies that storage class, access modes, and requested storage size are valid and meet the matching criteria for the PVC request (e.g. the specified capacity in the PV manifest must be equal or larger than the requested size in the PVC to meet the matching criteria and achieve a successful binding but it does not necessarily need to reflect the exact size of the volume which, however, may help for better manageability). You also need to reference an empty storage class (storageClassName: "") in the persistent volume claim (PVC) to ensure to skip dynamic volume provisioning from the default storage class.

Let's assume we want to provide access to a selected directory in IBM Spectrum Scale with confidential business data only to the financial department which runs their analytic applications in a specific namespace called finance in OpenShift. On a user's request, the cluster admin would prepare a persistent volume (PV) with the following manifest:

# cat create-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: business-pv01
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteMany
claimRef:
name: business-data-pvc
namespace: finance
csi:
driver: spectrumscale.csi.ibm.com
volumeHandle: "835838342966509310;099B6A7A:5EB99721;path=/mnt/fs1/business-data"

A user in the finance namespace can now easily claim the pre-provisioned persistent volume (PV) through a persistent volume claim (PVC) by using the exact same PVC name business-data-pvc as specified in the claimRef section:

# cat create-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: business-data-pvc
spec:
storageClassName: ""
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi

By using claimRef we can declare a 1:1 bidirectional binding between a PV and a PVC and make sure that this PV will only bind to a PVC request in the finance namespace with the PVC name business-data-pvc. This way we can safely provide access to the confidential directory /mnt/fs1/business-data in IBM Spectrum Scale to the selected namespace. The binding with claimRef happens regardless of other volume matching criteria. However, the OpenShift/Kubernetes control plane still checks that storage class, access modes, and requested storage size are valid.

The cluster admin can also issue the PV creation request (oc apply -f create-pv.yaml) immediately followed by the PVC request (oc apply -f create-pvc.yaml -n finance) and verify the proper binding (oc get pvc -n finance) instead of waiting for the user to properly issue the PVC request with the specified name in the specified namespace and complete the binding.

Only with claimRef we can ensure that the created PV is bound to the corresponding persistent volume claim (PVC) and namespace and that it is not bound to any other pending PVC request from users in other namespaces that would meet the regular volume matching criteria.

In case the finance department in our example has multiple namespaces where access to the same data is needed, the cluster administrator could prepare another PV with a similar manifest but with a different PV name (business-pv02) and a different target namespace in the claimRef section. In this case we would share access to the same data across more than one namespace and so the cluster admin would have to create multiple PVs, i.e one PV per target namespace as each PV binds to a PVC from another namespace. Note that only one PV per namespace is needed as the PVC can be used in all pods within the same namespace.

IMPORTANT: When using multiple static PVs to share data access to the exact same backing directory in IBM Spectrum Scale across different users or multiple namespaces then additional measures need to be taken to ensure that identical SELinux MCS labels (container security context) are used on the data. Otherwise different users may accidentally access the data with different SELinux MCS labels and lock out other users from their access to the data (due to SELinux relabeling). This also applies to nested backing directories. Please refer to the section OpenShift SELinux and user (uid) / group (gid) security context below for important additional considerations when sharing access to the same data in IBM Spectrum Scale across users and namespaces in OpenShift!

Storage Resource Quota

Storage resource quota allow to limit the total sum of storage resources that can be consumed in a given namespace including the number of persistent volume claims (PVCs). The consumption of these storage resources can even be limited selectively based on associated storage classes which, of course, primarily aims at dynamic provisioning. However, as we did show in the Preventing access by namespace paragraph above with static PVs using labels, this can be used to exclude selected user namespaces from being able to claim any statically provisioned PVs associated with a specific storageClassName (i.e. a nonexisting "fake" storage class). Resource quota would not need to be applied with static PVs using claimRef which already ensures a 1:1 binding to a specific namespace and persistent volume claim (PVC) only.

Here is a more general example of setting storage resource quota for the namespace dean:

apiVersion: v1
kind: ResourceQuota
metadata:
name: storage-quota
namespace: dean
spec:
hard:
requests.storage: 500Gi
persistentvolumeclaims: 10
static.storageclass.storage.k8s.io/persistentvolumeclaims: 0
ibm-spectrum-scale-sc.storageclass.storage.k8s.io/requests.storage: 100Gi
ibm-spectrum-scale-sc.storageclass.storage.k8s.io/persistentvolumeclaims: 5

The user in the namespace "dean" cannot claim any PVs from the static PVs associated with the storage class static, can only have a maximum of 10 persistent volumes claims (PVCs) and a maximum claimed storage capacity of 500Gi in total. Furthermore the user can only consume a maximum of 5 persistent volume claims (PVCs) and a maximum storage capacity of 100Gi from the dynamic storage class ibm-spectrum-scale-sc. For more details about storage resource quota please refer to Storage Resource Quota.

Summary of volume provisioning use cases

Typical use cases for the discussed volume provisioning methodologies above may be summarized as follows:

  • Dynamic provisioning with storage classes is the default method for providing new volumes to users in OpenShift on demand. The persistent volume (PV) is automatically created, provisioned, and bound to the persistent volume claim (PVC) of a user in a self-service fashion. The cluster admin only needs to create the storage class(es).
  • Basic static provisioning (i.e. without labels or claimRef) is generally not the preferred way of providing new volumes to users in OpenShift as this can more conveniently be achieved with dynamic provisioning and storage classes. It can be applied, if an admin needs to use dedicated paths and directories in IBM Spectrum Scale for persistent volumes (PVs) in OpenShift. In this case the admin would need to manually provision new directories in IBM Spectrum Scale and create the related PVs - each backed by its own empty directory or fileset. Typically the admin would provide a pool of PVs backed by their own empty directories in IBM Spectrum Scale so that users can claim these PVs through persistent volume claims (PVCs) on demand. However, the admin has no control which of the pre-provisioned PVs binds to a given PVC request or namespace. A PV will be bound to any PVC from any namespace that best meets the volume matching criteria: storage size and access mode. Therefore it is only an option for empty directories when providing new and empty volumes to users or projects. Using dynamic provisioning with a storage class for creating lightweight volumes still might be a better alternative to consider here.
  • Advanced static provisioning with labels can be used to provide access to existing data in IBM Spectrum Scale to multiple users on demand in a self-service fashion. The labels help to characterize the data behind a PV so that users can selectively request different PVs representing different data and backing directories in IBM Spectrum Scale through a regular persistent volume claim (PVC) with the proper selector. A PV is bound to the PVC based on the requested selector labels (matchLabels) and access modes. Each static PV needs to be created manually by the cluster admin in advance. Typically the admin would create a pool of pre-provisioned PVs so that users can claim these on demand. However, the admin has no control which user or namespace can claim these PVs and potentially gain access to the data in the backing directory behind the PV. A PV will be bound to any PVC request from any user and any namespace that best meets the volume matching criteria and labels. Within a given namespace a PVC (exclusively bound to a PV) can be used in multiple pods across nodes with RWX (ReadWriteMany) access mode. If the same data is to be accessed in multiple namespaces then the admin would need to create one statically provisioned PV per target namespace and also ensure that correct SELinux MCS labels are used on access.
    If the data needs to accessed in multiple namespaces then the admin would need to create at least one PV per target namespace. Within a given namespace a PVC (exclusively bound to a PV) can be used in multiple pods across nodes (RWX - ReadWriteMany access mode).
    • Use of storageClassName: ""
      In order to ensure that we always skip dynamic provisioning with the default storage class (if one is defined in the cluster) the user needs to refer to an empty storageClassName: "" in the PVC manifest. Otherwise a new (and empty) PV would be automatically provisioned by the default storage class and bound to the PVC request. 
    • Use of storageClassName: "static"
      Associating the statically provisioned PVs with a "fake" storageClassName in their PV manifests, e.g. here we use storageClassName: static, can provide additional ease of use. The user can now simply refer to a storage class like "static" (same ways as the user is used to for dynamic provisioning) even for statically provisioned PVs. This also improves overall volume management for the admin as all statically provisioned PVs can now be associated with their own storage class (although there is no provisioner and no real storage class related to it). Such a PV is bound to a PVC based on the selector labels (matchLabels), access mode, and storage class. A huge advantage is that it allows to make use of ResourceQuota in order to control which namespace can actually claim a statically provisioned PV associated with a specific storageClassName. Note that the chosen storageClassName for the static PVs must be different from any existing real storage class that is or will be present in the OpenShift cluster!
  • Advanced static provisioning with claimRef can be used to provide access to selected data in IBM Spectrum Scale to specific user namespaces in a controlled fashion. With claimRef we can ensure a 1:1 binding between a statically provisioned PV and a persistent volume claim (PVC) from a selected namespace. Only the PVC with the name and namespace as specified in the claimRef section of the PV manifest can bind to the PV. With claimRef we do not need to make use of ResourceQuota to control which namespace can or cannot consume the static PVs. Only by using claimRef we can make sure that a specific PV is actually bound to a specific PVC request and user namespace. In all other static volume provisioning cases without claimRef any persistent volume claim (PVC) from any user in any namespace that meets the volume matching criteria can bind to the PV once it is created by the admin.
    This level of control over the namespace (project) makes static provisioning with claimRef the preferred choice for sharing access to the same data in IBM Spectrum Scale on OpenShift (in contrast to using static PVs with labels) because the cluster admin can ensure that only selected namespaces which can apply the proper security context on their pods can bind to the PV and provide access to the shared data. For regular, non-privileged users the security context is typically determined by the default attributes of the namespace and the available service accounts within a namespace.
    Although the volume binding with claimRef happens regardless of other volume matching criteria (including the specified storage class in the PVC) it is generally a good idea to reference an empty storage class (storageClassName: "") in the persistent volume claim (PVC) to ensure that a PVC will not accidentally bind to a newly provisioned volume from the default storage class in case the static PV has not yet been provisioned by the admin, there is a typo in the PVC request name, or the PVC request is being made from the wrong namespace. Each static PV needs to be created manually by the cluster admin in advance. Within a given namespace a PVC (exclusively bound to a PV) can be used in multiple pods across nodes with RWX (ReadWriteMany) access mode.  If the same data is to be accessed in multiple namespaces then the admin would need to create at least one statically provisioned PV per target namespace and also ensure that correct SELinux MCS labels are used on access. 
    With claimRef we reserve a statically provisioned PV for a selected PVC request in a selected namespace so we have full control over the binding of a PV to a PVC from a given namespace. In addition, we can also apply regular Kubernetes labels to the static PVs simply for ease of management and filtering, e.g. oc get pv -l pv=home to display all PVs with the additional label pv=home.

IMPORTANT: In any case when access to the same data in IBM Spectrum Scale is shared across different users or multiple namespaces by creating multiple static PVs with the exact same backing directory (or nested backing directories), then additional measures need to be taken to ensure that identical SELinux MCS labels (container security context) are applied to the shared data when accessed simultaneously. Otherwise different users may accidentally access the data with different SELinux MCS labels and lock out other users from their access to the data (due to SELinux relabeling). Furthermore, the admin also needs to ensure that the backing directories (or subdirectories within) have the proper POSIX user (uid) and group (gid) access permissions set in the IBM Spectrum Scale file system so that the intended scope of read/write/execute (rwx) access to the PV for the user process in the container is granted.
Please refer to the next section OpenShift SELinux and user (uid) / group (gid) security context for important additional considerations when sharing access to the same data in IBM Spectrum Scale across users and namespaces in OpenShift!

OpenShift SELinux and user (uid) / group (gid) security context

OpenShift applies strict security standards. Users interacting with Red Hat OpenShift Container Platform (OCP) authenticate through configurable identity providers (for example, HTPasswd or LDAP, see Understanding identity provider configuration). Through role-based access control (RBAC) objects like rules, roles, and role bindings OpenShift determines whether a user is authorized to perform an action within a namespace (or project).

In addition, Security Context Constraints (SCCs) define a set of conditions which a pod must comply with. Pods eventually run the user workloads, and SCCs control the permissions for these pods and determine the actions that the pods (and their collections of containers) can perform. SCCs are composed of settings and strategies that control the security features that a pod can access. For more information, see Managing Security Context Constraints. A pod submitted to the OpenShift cluster is authorized by the OpenShift user credentials but a pod itself is running under its associated service account. If no dedicated service account is specified then the pod runs under the default service account in the namespace. Based on the OpenShift user, group and service account authorization a set of available SCCs is evaluated and verified against the requested security context of the pod. If the validation reveals no match with any available SCCs the pod is rejected.

User workloads in OpenShift are running in the security context of pods and their containers which are scoped to namespaces. Data access to IBM Spectrum Scale is provided through persistent volumes (PVs) binding to persistent volume claims (PVCs) which are bound to namespaces. OpenShift and Kubernetes make use of namespaces to isolate resources (see Namespaces) while POSIX file systems like IBM Spectrum Scale rely on POSIX user ID (uid) and group ID (gid) file permissions and ACLs to control user access to the data.

Without any further customization, a regular (non-privileged) user in OpenShift is running under the restricted SCC by default and is not associated with a fixed user ID (uid) or group ID (gid) that correlates with a uid and gid in the IBM Spectrum Scale file system. Typically, when such a user starts a pod in OpenShift, by default, the pod is running with the default service account in the namespace and assigned an arbitrary uid from a pre-allocated range as well as a pre-allocated SELinux security context with a SELinux MCS label which is recursively applied to all files in the mounted PV (SELinux relabeling). Furthermore, these pre-allocated values for the uid and SELinux MCS label vary by namespace. Even pods from the same OpenShift user default to different pre-allocated values in different namespaces! The pre-allocated values for the security context of the pods are derived from the namespace annotations (see About pre-allocated security context constraints values) as shown in the example below:

apiVersion: v1
kind: Namespace
metadata:
name: nmspc1
annotations:
openshift.io/requester: user1
openshift.io/sa.scc.mcs: s0:c27,c24
openshift.io/sa.scc.supplemental-groups: 1000750000/10000
openshift.io/sa.scc.uid-range: 1000750000/10000
[...]

By default, a pod in this namespace that is submitted by a non-privileged user under the restricted SCC with no customized securityContext will run with the user ID (uid) 1000750000, primary group ID 0 (root), fsGroup ID 1000750000 and recursively enforce the SELinux MCS label "s0:c27,c24" on all the data in the attached PVs (and the backing directories in IBM Spectrum Scale).

Default Security Context with restricted SCC

A cluster admin can manage and scope these security settings in OpenShift through custom SCCs (security context constraints) so that pods/containers can run with a selected securityContext to access shared data in IBM Spectrum Scale with proper user ID (uid)/group ID (gid) ranges and, most importantly, a uniform SELinux MCS label

Security Context with Custom SCC and Pod securityContext

If pods with different SELinux labels access the same data in IBM Spectrum Scale then some pods will experience access loss to the shared data at the moment the data is accessed by another pod that is using a different SELinux MCS label!

SELinux is preventing shared data access across namespaces with restricted SCC

Custom SCCs (security context constraints) can be associated with users, groups and service accounts to allow pods to run with a customized security context. Typically, a privileged user like a cluster admin would define custom SCCs and service accounts in the user namespace to grant the required privileges and permissions for an application. Such a service account can be associated with a custom SCC through RBAC roles and rolebindings. An application (i.e. pods) in that namespace can now request a customized security context in its pods through this service account associated with a custom SCC that grants the necessary privileges and permissions. Alternatively, the cluster admin can also create custom SCCs, which are derived from the default SCCs with carefully scoped privileges and permissions, and assign them directly to OpenShift users and groups (of users).

When sharing access to the same data in IBM Spectrum Scale on OpenShift the cluster admin has to ensure the following:

  1. The POSIX file permissions of the PV's backing directory in IBM Spectrum Scale with its user ID (uid), group ID (gid) and mode bits (rwxrwxrwx) - and potentially configured ACLs - need to be aligned with the user ID (runAsUser) and group ID (runAsGroup, fsGroup, supplementalGroups) of the security context of the pods in order to successfully grant access to the files and subdirectories in the PV to all selected OpenShift users and applications.

  2. All pods accessing the same backing directory (including all nested subdirectories within the backing directory) in IBM Spectrum Scale must apply the same SELinux context (i.e. an identical SELinux MCS label) through their associated security context. Shared data access from pods in different namespaces or from different users will fail if different SELinux MCS labels are used!

Note: The article was originally based on Red Hat OpenShift 4.9.22 with IBM Spectrum Scale Container Native Storage Access (CNSA) v5.1.2 and IBM Spectrum Scale CSI Driver 2.4.0. The concepts described here generally explain the fundamental behavior of how the SELinux context is typically applied to volumes in OpenShift. It can be observed with IBM Spectrum Scale CNSA releases up to 5.1.6.0. Starting with IBM Spectrum Scale CNSA 5.1.7.0 (released on March 16, 2023) the SELinux default behavior has changed. IBM Spectrum Scale CNSA 5.1.7.0 will now - by default - mount the file system with a container permissive SELinux context by setting the mount context of the file system to system_u:object_r:container_file_t:s0. All files inside the file system will be considered to have the context defined on the files system mount and allow all containers running as container_t SELinux type to access files on the file system if permitted by standard file permissions. See IBM Spectrum Scale container native and SELinux for more details.

User ID (uid) / group ID (gid) security context

The POSIX file system permissions comprise three classes of users called owner, group, and other. Each of them is associated with a set of permissions. With regard to these file permissions the storage admin needs to ensure that the uid, gid (using chown) and mode bits (using chmod) of each file and directory in the backing directory of a PV in IBM Spectrum Scale (plus potentially configured ACLs) allow proper access to all selected OpenShift users and applications.

By default, a pod running under the restricted SCC uses pre-allocated values from its namespace for the security context, i.e. it runs with an arbitrary user ID (e.g. uid=1000750000) and a primary group ID 0 (gid=0/root), for example, as determined by the given namespace in the previous section above:

sh-4.4$ id
uid=1000750000(1000670000) gid=0(root) groups=0(root),1000750000

Here, the storage admin may simply choose to set the access permissions (read/write/execute) for the root group (gid 0) on the backing directory in IBM Spectrum Scale to "r-x" for read-only or "rwx"for full read/write access so that any pod running under the restricted SCC with an arbitrary user ID and gid 0 has either read or full read/write access to the PV:

## READ-ONLY ACCESS:
# chmod g+rx-w /mnt/fs1/data/training-data
# ls -al
drwxr-xr-x. root root 4096 Apr 16 14:29 /mnt/fs1/data/training-data

## READ/WRITE ACCESS:
# chmod g+rwx /mnt/fs1/data/models
# ls -al
drwxrwxr-x. root root 4096 Apr 16 14:30 /mnt/fs1/data/models

This might be sufficient when providing quick access to user-owned (non-shared) data or uniformly shared data in IBM Spectrum Scale.

However, with shared access to backing directories with many subdirectories owned by different users or groups (similar to a /home directory), a more granular approach may typically be advised to also control access within the shared backing directory by using specific user ID (uid), group ID (gid) and file access permissions, for example, like shown below:

/mnt/fs1/ocp-home:
drwxr-xr-x. 7 root root 4096 Apr 16 14:29 .
drwxr-xr-x. 26 root root 262144 Apr 16 13:58 ..
drwxrwxrwx. 2 root root 4096 Apr 16 14:39 scratch
drwxrwx---. 2 root 5500 4096 Apr 16 14:31 shared
drwxr-x---. 2 5001 5001 4096 Apr 16 14:04 user1
drwxr-x---. 2 5002 5002 4096 Apr 16 14:04 user2
drwxr-x---. 2 5003 5003 4096 Apr 16 14:04 user3
drwx------. 2 5004 5004 4096 Apr 16 14:04 user4

The way to align the file permissions in the file system with the security context in the pods of users and workloads in OpenShift is to make use of custom SCCs that allow users to apply the required securityContext to their pods with runAsUser (uid) and runAsGroup (gid) - for the user ID/uid and primary group ID/gid of the process in the containers - as well as supplemental groups defined through fsGroup and supplementalGroups as shown in the example below:

apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: registry.access.redhat.com/ubi8/ubi-minimal:latest
securityContext:
runAsUser: 5001
runAsGroup: 5001
command: [ "/bin/sh", "-c", "--" ]
args: [ "while true; do ... ; done;" ]
volumeMounts:
- name: vol1
mountPath: "/data"
serviceAccountName: shared
securityContext:
fsGroup: 5500
supplementalGroups: [5002, 5003]
volumes:
- name: vol1
persistentVolumeClaim:
claimName: home

In this example, we use a service account ("shared") in the pod manifest which is associated with a custom SCC through a role and a rolebinding (see section (1) Shared data access using custom SCCs and service accounts for more details). The custom SCC grants the necessary privileges and permissions to request a custom security context for a pod or container with:

  • runAsUser: user ID (uid) of process in the container
  • runAsGroup: primary group ID (gid) of process in the container
  • fsGroup: set as group ID/SGID on the mounted volume & added as supplemental group ID
    (Note: The fsGroup option must be supported by the storage provider for the volume. All files and directories created in this directory inherit the fsGroup ID through the special SGID bit.)
  • supplementalGroups: (additional) supplemental groups for shared group access; must be set in the securityContext on pod level

When applying the pod from the example above, a user process in the container would run as follows:

sh-4.4$ id
uid=5001(5001) gid=5001 groups=5001,5002,5003,5500

Note that SCCs do not currently provide a way to impose any constraints on the primary gid (runAsGroup) that a user can request in the securityContext of a pod or a container. Any primary gid including 0 (root) can be selected with runAsGroup for the user process in the container.

A note on fsGroup: The behavior of applying fsGroup by setting the fsGroup and SGID bit (drwxrwsrwx) on the mounted directory depends on the type of volume and its support of ownership management. It differs for volumes managed by Kubernetes like EmptyDir {} and CSI volumes managed by a CSI driver. By default, Kubernetes recursively changes ownership and permissions for the contents of each volume to match the fsGroup specified in a pod's securityContext when the volume is mounted (see Configure volume permission and ownership change policy for Pods). This process can slow down pod startup times and be influenced by the fsGroupChangePolicy option (Kubernetes v1.23). With CSI volumes the process of setting file ownership and permissions based on the fsGroup is performed by the CSI driver instead of Kubernetes and depends on the scope of support and the specific implementation in the CSI driver.

Generally, for sharing access to existing data in IBM Spectrum Scale with static provisioning we would not want the existing file permissions in the shared directory to be changed nor apply the SGID bit on the mounted directory when we have carefully aligned the file permissions for users and groups across OpenShift and POSIX users. It may be an option (if supported by the CSI driver) when used similar to EmptyDir and aiming at a simpler version of flat data sharing among containers in pods within OpenShift.

The fsGroup is always configured (but not necessary applied on the mounted volume) for a container in OpenShift and added to the supplemental groups of the user process in the container. Either the default pre-allocated value from the namespace annotation is used for fsGroup (openshift.io/sa.scc.supplemental-groups: the minimum value is the only allowed value for fsGroup) or the value specified in the securityContext if permitted by the SCC.

See Configure a Security Context for a Pod or Container to learn more about how to specify a securityContext for a pod or container and manage the uid/gid of the process in the container. Refer to the tutorial in Tutorial: Use SCCs to restrict and empower OpenShift workloads to see examples how to create a custom SCC, define a service account, apply RBAC roles and rolebindings, and run pods with the proper user ID (uid) and group ID (gid) context.

SELinux security context

A solid understanding and management of the applied security context and especially the SELinux security context of pods in OpenShift is key to successful data sharing across users or namespaces with IBM Spectrum Scale. This allows to take full advantage of advanced features of IBM Spectrum Scale as a parallel file system, like, for example, parallel file access across multiple namespaces and cluster nodes (RWX access mode), a single file system namespace for all the data across different platforms and protocols or even a global file system namespace across sites around the world with Active File Management (AFM).

Note: The article was originally based on Red Hat OpenShift 4.9.22 with IBM Spectrum Scale Container Native Storage Access (CNSA) v5.1.2 and IBM Spectrum Scale CSI Driver 2.4.0. The concepts described here generally explain the fundamental behavior of how the SELinux context is typically applied to volumes in OpenShift. It can be observed with IBM Spectrum Scale CNSA releases up to 5.1.6.0. Starting with IBM Spectrum Scale CNSA 5.1.7.0 (released on March 16, 2023) the SELinux default behavior has changed. IBM Spectrum Scale CNSA 5.1.7.0 will now - by default - mount the file system with a container permissive SELinux context by setting the mount context of the file system to system_u:object_r:container_file_t:s0. All files inside the file system will be considered to have the SElinux context defined on the files system mount and allow all containers running as container_t SELinux type to access files on the file system if permitted by standard file permissions. See IBM Spectrum Scale container native and SELinux for more details.

With the default settings in CNSA 5.1.7.0 (and higher) for the SELinux context you do no longer need to manually manage and ensure the proper SELinux context for shared data access in IBM Spectrum Scale.

SELinux relabeling

The SELinux security context is applied through recursively relabeling of all the files and nested sub-directories in the backing directory of a PV with a SELinux MCS label at the moment a pod is started and attaching the PVC/PV. The SELinux relabeling does not happen when a PV is bound to a PVC and its namespace. It happens at the very moment when a pod consuming this PVC is started and attaches the associated volumes to its containers (ContainerCreating phase). If pods from different namespaces with different SELinux contexts are started and are accessing the same data on the backend in IBM Spectrum Scale simultaneously, then the last pod started "wins" and takes over exclusive access to all data in the shared backing directory of the PV while all other pods lose their previous access to the data. This loss of data access happens almost undetected as it will only be seen in the SELinux logs on the OpenShift worker node console where the pod is running (i.e. by logging in as core user directly on the worker node console and using commands like sudo aureport -a or sudo ausearch -m avc to reveal SELinux violations).

If a persistent volume is statically provisioned with the IBM Spectrum Scale CSI driver using a regular directory (or a dependent fileset) in IBM Spectrum Scale as backing directory in the volumeHandle of the static PV manifest then SELinux relabeling takes place for all pods running under a regular non-privileged user in OpenShift. All contents of the backing directory will be recursively relabeled with the SELinux MCS label of the pod/container accessing the PV, which is determined by the security context of pod and dependent on the OpenShift Security Context Constraints (SCCs) as well as the user, group, service account and pre-allocated defaults from the namespace.

If a persistent volume is statically provisioned with the IBM Spectrum Scale CSI driver using an independent IBM Spectrum Scale fileset as backing directory in the volumeHandle of the static PV manifest (e.g. an AFM fileset) then the contents of this fileset may remain untouched and not undergo SELinux relabeling. In this case the pre-existing attributes and SELinux labels in the file system take precedence but also may prevent access to the PV if no proper SELinux MCS labels were set manually in the file system by the storage admin. This can lead to Permission denied errors in the pod/container when trying to access the mount point of the static PV or any contents within that PV. To grant access for pods/containers in OpenShift the storage admin would need to manually set the SELinux MCS labels on and in the fileset accordingly. For example, the storage admin could grant access to the data in a fileset accessed by pods/containers in OpenShift through a static PV by setting the SELinux MCS label on an entire fileset (or just selected sub-directories within the fileset) to "system_u:object_r:container_file_t:s0" on the IBM Spectrum Scale storage cluster (Note: The option -R will recursively change the SELinux label on all files and directories at the specified destination):

# chcon -R "system_u:object_r:container_file_t:s0" /[absolute-path-to-fileset]/[fileset-link-point]

This would only have to be done once as the applied SELinux settings will persist. If SELinux is not enabled on the storage cluster the manual SElinux relabeling could also be done by a cluster admin on OpenShift by running a debug pod, for example:

# oc debug node/[worker-node] -- chroot /host chcon -R 'system_u:object_r:container_file_t:s0' /mnt/[local-Scale-file-system-name]/[relative-path-to-fileset]/[fileset-link-point]

Note: As always when executing commands as root user: Be very cautious and careful when doing the SELinux relabeling by running the chcon command directly in a debug pod on an OpenShift worker node. When done wrong you can harm your system!

Without adding a specific category (like c7, c28) to the SELinux MCS label "container_file_t:s0" the data can be accessed by any pod/container independent of the specific SELinux MCS category that is assigned to the container process by OpenShift like, for example, "container_t:s0:c7,c28" because the empty set of SELinux categories is also part of every process set. Any pod/container means, of course, any pod/container that has been given explicit access to the data through a statically provisioned PV.

The storage admin can even set the SELinux MCS labels more selectively and granularly with specific categories on selected (or all) objects in the fileset if needed but that would entail to carefully align the SELinux labels in the file system with the SELinux security context of the pods/containers in OpenShift which is dependent on the OpenShift Security Context Constraints (SCCs) as well as the user, group, service account and pre-allocated defaults from the namespace. For example, setting the SELinux categories of a sub-directory within the fileset to "container_file_t:s0:c14,c27" would restrict access to that sub-directory and only grant access to containers running specifically with the matching categories "s0:c14,c27".

Please refer to How SELinux separates containers using Multi-Level Security and Why you should be using Multi-Category Security for your Linux containers for more information about SELinux MCS labels for containers.

Note that recursive SELinux relabeling of all files in an attached volume also does not take place when hostPath volumes are used to directly mount a given path from IBM Spectrum Scale into a container (i.e. not using CSI volumes). However, the use of hostPath volumes is generally discouraged (!) as it requires the container to run in a privileged security context which poses a high security risk. An example of a pod manifest using a hostPath volume is given below:

apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: [...]
securityContext:
privileged: true
volumeMounts:
- name: vol1
mountPath: "/data"
volumes:
- name: vol1
hostPath:
path: /mnt/fs1/data/pv01

hostPath volume can also be used in static PV manifests with labels or claimRef as described in the previous chapters to properly control the binding to PVC requests. See Persistent storage using hostPath for a brief overview.

Disabling SELinux relabeling for selected pods/containers

SELinux relabeling on persistent volumes (PVs) with large amounts of files can lead to a significant delay of a pod entering the Running state or even lead to a CreateContainerError when the recursive SELinux relabeling of all directories and files in the attached volumes does not succeed in time and the container creation process hits a timeout of 120 seconds. SELinux relabeling takes place during the startup phase (ContainerCreating) of a pod with all its containers before the pod actually enters the Running state. It imposes a huge write IO load on the storage backend. Left in this state the pod will repeatedly keep trying to restart and apply the SELinux MCS labels to all files in the attached PVs in the background, imposing a steady additional write IO load on the storage backend which may even further impact other pods or workloads. If such a pod is repeatedly not succeeding to come up after a reasonable amount of time it should be deleted as it may even prevent other pods from starting up and finishing the SELinux relabeling successfully due to the increased workload on the storage backend. This behavior and potential workarounds are described in the following Red Hat solution article: When using Persistent Volumes with high file counts in OpenShift, why do pods fail to start or take an excessive amount of time to achieve "Ready" state?.
The Red Hat solution article proposes a workaround with SELinux type "spc_t" as one of two ways to skip the SELinux relabeling for a volume and overcome the CreateContainerError for volumes with large amounts of files that would otherwise run into the SELinux relabeling timeout issue. The option "spc_t" is a special SELinux type, standing for super privileged container type. It instructs CRI-O to skip the SELinux relabeling. Containers with this type will not be constrained by SELinux policies at all. They do not apply an SELinux MCS label and therefore completely skip the SELinux relabeling process on the data in an attached volume. This also means that SELinux as an additional protection and isolation layer is de facto disabled for containers running with SELinux type "spc_t" in case of a container escape! The SELinux protection comes in addition to the regular security and isolation mechanisms that already come with standard Linux containerization in general (e.g. Linux kernel namespaces, capabilities, control groups, seccomp policies, etc.). In order to ensure that the container process even with SELinux type "spc_t" can not harm the host system even in case of a container escape (or container break-out) we must ensure that the container process always runs with a non-root user ID (uid). This is the default for regular regular users in OpenShift running under the restricted SCC (security context constraints) which enforces a MustRunAsRange strategy for the runAsUser (uid) security context. The restricted SCC is the most restrictive of the default SCCs and used by default for the group of authenticated users on OpenShift.
The SELinux type "spc_t" can be applied in multiple ways, for example, with a custom SCC that is derived from the restricted SCC and where we change the seLinuxContext strategy from MustRunAs to RunAsAny:
seLinuxContext:
type: RunAsAny
Here, pods or containers using this custom SCC would need to explicitly request the SELinux type "spc_t" through their securityContext to skip SELinux relabeling:
 securityContext:
seLinuxOptions:
type: "spc_t"
The securityContext can either be set for an individual container or for a whole pod in which case it will be inherited by all containers having no type specified at all. A running pod/container with this configuration shows the following SELinux context in its runtime manifest and skips SELinux relabeling on attached volumes:
# oc get pod test -o yaml| grep -A3 -i selinux
seLinuxOptions:
type: "spc_t"
Note: A custom SCC with a seLinuxContext of RunAsAny does allow a pod or container to apply any SELinux security context - including the type "spc_t".
Such a custom SCC can be made available to selected pods and containers in a given namespace through a service account with a role and rolebinding as described in section (1) Shared data access using custom SCCs and service accounts or it can be assigned to users and groups as described in section (2) Shared data access with custom SCCs for OpenShift users or groups. Another option would be to add the custom SCC (e.g. named "any-selinux-scc") to an existing service account like the default service account in a given namespace with
# oc adm policy add-scc-to-user any-selinux-scc -z default -n [namespace]
This will create a new ClusterRole to allow the use of the custom SCC "any-selinux-scc" (ClusterRole/system:openshift:scc:any-selinux-scc) and a new rolebinding (system:openshift:scc:any-selinux-scc) in the namespace to associate the default service account in that namespace with the ClusterRole which allows to use the custom SCC "any-selinux-scc".
You can remove the custom SCC from the default service account in the given namespace with
# oc adm policy remove-scc-from-user any-selinux-scc -z default -n [namespace]
This removes the rolebinding in the namespace but the automatically created ClusterRole remains and must be deleted separately:
# oc delete clusterrole system:openshift:scc:any-selinux-scc
Another approach would be to use a custom SCC with a seLinuxContext that already sets the SELinux type "spc_t"  together with a predefined and unique SELinux MCS label (e.g. level: "s0:c26,c0") with a MustRunAs strategy (so we do not generally grant the privilege of RunAsAny to containers to run with any SELinux security context):
apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
name: shared-scc
allowPrivilegedContainer: false
[...]
runAsUser:
type: MustRunAsRange
seLinuxContext:
type: MustRunAs
seLinuxOptions:
type: "spc_t"
level: "s0:c26,c0"
[...]
Setting the SELinux type "spc_t" under seLinuxOptions in the seLinuxContext of the custom SCC skips SELinux relabeling in attached volumes. In addition, setting a unique SELinux MCS label (e.g. level: "s0:c26,c0") in the custom SCC will attach this predefined SELinux MCS label to all containers using this SCC even if it will not be applied on the files in the attached volumes due to the SELinux type "spc_t". In this case the pods or containers must explicitly request the SELinux type "spc_t" together with the predefined SELinux MCS label "s0:c26,c0" from the custom SCC in their securityContext:
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- name: test
image: registry.access.redhat.com/ubi8/ubi-minimal:latest
securityContext:
seLinuxOptions:
type: "spc_t"
level: "s0:c26,c0"
[...]
A pod/container without this specific securityContext section (or none at all) may otherwise select to run with a different, less privileged SCC (e.g. restricted SCC) and not skip SELinux relabeling or may even fail to start at all if the pod admission process cannot find a suitable SCC for the overall security context (see Note 2).
Here, the root directory of the container will bear the SELinux label provided in the SCC while the SELinux labels on the files in the attached volume will remain unchanged and keep the original SELinux labels as defined in the file system. A running container using this custom SCC will also show a SELinux MCS label along with the SELinux type "spc_t" in its runtime manifest:
# oc get pod test -o yaml| grep -A3 -i selinux
seLinuxOptions:
level: "s0:c26,c0"
type: "spc_t"
This approach is generally similar to the approach that we describe in the upcoming sections when using custom SCCs with a predefined unique SELinux MCS label to enable shared data access across namespaces. The custom SCC can be made available to pods and containers in the same way as already described above. The custom SCC should ideally be derived from the restricted SCC with least privileges and then further customized to fit your specific needs (e.g., specify ranges for runAsUser, fsGroup, supplementalGroups, etc.). As containers with SELinux type "spc_t" will not be constrained by SELinux policies we must ensure that the custom SCC sets the uid with runAsUser to a MustRunAs or MustRunAsRange strategy to properly exclude the uid 0 (root) so that even in the event of a container escape no harm can be done to the host system!
In the referenced article above, Red Hat considers using SELinux type "spc_t" a safe option if the pod/container is running with a SCC that is having runAsUser set to MustRunAsRange like the restricted SCC which enforces that the container process runs with a random, non-root uid. File access is constrained as the container process is running as a non-root user ID (uid) and has no permissions to modify anything on the host system and would only have read access to world-readable files in case of a container escape.
DISCLAIMER: Note, that this section is based on a solution proposed by Red Hat to overcome issues with SELinux relabeling for volumes with large amounts of files. This blog post hereby does not suggest or recommend to effectively disable SELinux security on pods or containers in OpenShift by using the "spc_t" option! This section is only providing an example of how to apply the workaround described in the mentioned solution article by Red Hat. When applying the "spc_t" option and disabling SELinux you do this at your own risk! Contact Red Hat Technical Support for assistance.

Ensuring an identical SELinux security context for shared data access

We must ensure that all pods simultaneously accessing the same data in IBM Spectrum Scale (including all nested subdirectories) do apply the same SELinux MCS label through their associated security context. This is required in order to avoid that the shared backing directory is relabeled with a different SELinux label when a new pod accessing the same shared data is started from another namespace with another SELinux MCS label. Subsequently such a pod would lock out all previous pods with a different SELinux label from accessing the data in the shared volume (PV).

SELinux MCS labels can be configured in the securityContext of an individual pod or container or defined in a custom SCC. The default SELinux MCS label that is applied to all pods in a namespace of a regular OpenShift user under the restricted SCC is pre-allocated from the annotations in the associated OpenShift namespace (openshift.io/sa.scc.mcs). A regular (non-privileged) user under the restricted SCC cannot simply edit these pre-allocated defaults in the annotations of a namespace nor request a custom SELinux MCS label in the securityContext of a pod or container other than the associated default MCS label. Only a cluster admin or privileged user can edit the SELinux MCS label of a user namespace or start a specific pod with a custom SELinux MCS label defined in the pod's securityContext. Therefore, the preferred way to share access to the same data in IBM Spectrum Scale with a properly defined SELinux MCS label would be to have the cluster admin create a custom SCC (for example, a custom SCC derived from the restricted SCC) with a pre-selected SELinux MCS label defined in this custom SCC which will then be made available to selected namespaces through service accounts (and RBAC roles and rolebindings) or can directly be assigned to selected OpenShift users or groups (of users).

Examples for safely providing shared data access

In the next sections we will show different ways how a proper security context with selected user IDs (uid), group IDs (gid) and SELinux context can be applied when pods share access to the same data in IBM Spectrum Scale with statically provisioned PVs. Creating custom SCCs is generally the preferred and recommended way to properly manage privileges and permissions in OpenShift and allow users, groups and service accounts to run pods with the required security context. In addition, we also introduce alternate ways that safely enable shared data access across different OpenShift users or namespaces without the need to create any custom SCCs.

The following sections discuss the following approaches for safely sharing data access in OpenShift:

Note: The article was originally based on Red Hat OpenShift 4.9.22 with IBM Spectrum Scale Container Native Storage Access (CNSA) v5.1.2 and IBM Spectrum Scale CSI Driver 2.4.0. The concepts described here generally explain the fundamental behavior of how the SELinux context is typically applied to volumes in OpenShift. It can be observed with IBM Spectrum Scale CNSA releases up to 5.1.6.0. Starting with IBM Spectrum Scale CNSA 5.1.7.0 (released on March 16, 2023) the SELinux default behavior has changed. IBM Spectrum Scale CNSA 5.1.7.0 will now - by default - mount the file system with a container permissive SELinux context by setting the mount context of the file system to system_u:object_r:container_file_t:s0. All files inside the file system will be considered to have the SElinux context defined on the files system mount and allow all containers running as container_t SELinux type to access files on the file system if permitted by standard file permissions. See IBM Spectrum Scale container native and SELinux for more details.

With the default settings in CNSA 5.1.7.0 (and higher) for the SELinux context you do no longer need to manually manage and ensure the proper SELinux context for shared data access in IBM Spectrum Scale.

(1) Shared data access using custom SCCs and service accounts

The preferred way to ensure that a proper SELinux context is used on PVs when sharing data access would be to create custom SCCs. Custom SCCs can be associated with service accounts in user namespaces through RBAC roles and rolebindings. All pods in these namespaces that share access to the same data in IBM Spectrum Scale and use these service accounts will automatically enforce a security context with identical SELinux MCS labels and apply a common set of uid/gid ranges accordingly as needed. 

Using custom SCCs to share data access: service accounts

Here is an example of a custom SCC that would enforce a specific SELinux MCS label on all pods using this SCC and sharing access to the same data in IBM Spectrum Scale from any namespace:

apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
name: shared-scc
allowPrivilegedContainer: false
runAsUser:
type: MustRunAsRange
uidRangeMin: 5000
uidRangeMax: 5499
seLinuxContext:
type: MustRunAs
seLinuxOptions:
level: "s0:c26,c0"
fsGroup:
type: MustRunAs
ranges:
- min: 5000
max: 6000
supplementalGroups:
type: MustRunAs
ranges:
- min: 5000
max: 6000

In addition, this custom SCC also defines ID ranges for the user process in the associated containers, i.e. runAsUser (uid), fsGroup and supplementalGroups.  

By enforcing a uniform SELinux MCS label (level: "s0:c26,c0") in the custom SCC with a MustRunAs strategy we ensure that any pod running under this custom SCC from any namespace will apply the same uniform SELinux MCS label on the mounted volumes and safely allow shared access to the data in IBM Spectrum Scale:

seLinuxContext:
type: MustRunAs
seLinuxOptions:
level: "s0:c26,c0"

In addition, we also define ranges for user IDs (uid) and supplemental group IDs (gid) in this example with our custom SCC named "shared-scc" through runAsUser, fsGroup and supplementalGroups that allow to select specific user (uid) and supplemental group (gid) IDs in the pod securityContext when accessing the shared data in IBM Spectrum Scale (see section User ID (uid) / group ID (gid) security context ):

runAsUser:
type: MustRunAsRange
uidRangeMin: 5000
uidRangeMax: 5499
fsGroup:
type: MustRunAs
ranges:
- min: 5000
max: 6000
supplementalGroups:
type: MustRunAs
ranges:
- min: 5000
max: 6000

Of course, we can apply even more granular custom SCCs with smaller ID ranges or even dedicated uid and gid values as needed. See Security context constraints strategies for more information on the available options and ranges.

To grant access to the custom SCC for an application or a pod in any given namespace (here we use "target-namespace1" in our example) the cluster admin must create a service account (here "shared") in that namespace and associate the service account through a role and rolebinding (RBAC) with the custom SCC (here "shared-scc") as shown in the example below:

apiVersion: v1
kind: ServiceAccount
metadata:
name: shared
namespace: target-namespace1
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: use-shared-scc
namespace: target-namespace1
rules:
- apiGroups:
- security.openshift.io
resourceNames:
- "shared-scc"
resources:
- securitycontextconstraints
verbs:
- use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: use-shared-scc
namespace: target-namespace1
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: use-shared-scc
subjects:
- kind: ServiceAccount
name: shared
namespace: target-namespace1

This manifest creates the necessary role, rolebinding and service account in the target namespace:

$ oc get sa,role,rolebinding -n target-namespace1
NAME SECRETS AGE
serviceaccount/builder 2 11d
serviceaccount/default 2 11d
serviceaccount/deployer 2 11d
serviceaccount/shared 2 36m

NAME CREATED AT
role.rbac.authorization.k8s.io/use-shared-scc 2022-09-21T11:25:33Z

NAME ROLE AGE
rolebinding.rbac.authorization.k8s.io/admin ClusterRole/admin 11d
rolebinding.rbac.authorization.k8s.io/system:deployers ClusterRole/system:deployer 11d
rolebinding.rbac.authorization.k8s.io/system:image-builders ClusterRole/system:image-builder 11d
rolebinding.rbac.authorization.k8s.io/system:image-pullers ClusterRole/system:image-puller 11d
rolebinding.rbac.authorization.k8s.io/use-shared-scc Role/use-shared-scc 36m

For more information on role-based access control (RBAC) through roles and rolebindings in OpenShift please refer Using RBAC to define and apply permissions.

The cluster admin needs to adapt and apply the above manifest to every target namespace that requires access to the shared data in IBM Spectrum Scale. Any pod and application in these target namespaces can now make use of the new service account ("shared") and the associated custom SCC simply by referencing the service account in their pod manifests (serviceAccountName: shared).

Another option would be to add the custom SCC (e.g. "shared-scc") to an existing service account like the default service account in a given namespace with

# oc adm policy add-scc-to-user shared-scc -z default -n [namespace]

This will create a new ClusterRole to allow the use of the custom SCC "shared-scc" (ClusterRole/system:openshift:scc:shared-scc) and a new rolebinding (system:openshift:scc:shared-scc) in the namespace to associate the default service account in that namespace with the ClusterRole which grants access (rule "use") to the custom SCC, here named "shared-scc".

You can remove the custom SCC from the default service account with

# oc adm policy remove-scc-from-user shared-scc -z default -n [namespace]

This removes the rolebinding in the namespace but the automatically created ClusterRole remains and must be deleted separately:

# oc delete clusterrole system:openshift:scc:shared-scc

Note: When the custom SCC is added to the default service account in a namespace then a pod /container in that namespace would not need to explicitely set the serviceAccountName in the manifest - it is used by default!

Pods in these namespaces can now request a specific securityContext that aligns with the required file permissions in the shared data in IBM Spectrum Scale and is scoped by the custom SCC associated with the selected service account:

apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: registry.access.redhat.com/ubi8/ubi-minimal:latest
securityContext:
runAsUser: 5001
runAsGroup: 5001
command: [ "/bin/sh", "-c", "--" ]
args: [ "while true; do ... ; done;" ]
volumeMounts:
- name: vol1
mountPath: "/data"
serviceAccountName: shared
securityContext:
fsGroup: 5500
supplementalGroups: [5002, 5003]
volumes:
- name: vol1
persistentVolumeClaim:
claimName: home

Here, the cluster admin would already have provided access to the shared data in IBM Spectrum Scale through statically provisioned PVs for each target namespace using the claimRef option as described in the section Advanced static volume provisioning using claimRef with one static PV for each (!) target namespace:

apiVersion: v1
kind: PersistentVolume
metadata:
name: home-ns1
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteMany
csi:
driver: spectrumscale.csi.ibm.com
volumeHandle: "835838342966509310;099B6A7A:5EB99721;path=/mnt/fs1/ocp-home"
claimRef:
name: home
namespace: target-namespace1

Each of these static PVs has the same backing directory and is reserved to only bind to a specific PVC name ("home") and target namespace ("target-namespace1") as specified in its claimRef section:

$ oc get pv 
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM AGE
home-ns1 1Gi RWX Retain Available target-namespace1/home 48s
home-ns2 1Gi RWX Retain Available target-namespace2/home 55s
home-ns3 1Gi RWX Retain Available target-namespace3/home 59s

A user in each namespace can claim one of these PVs through a similar PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: home
spec:
storageClassName: ""
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi

The reserved PV is then bound to the specific PVC and its namespace. It can be used in all pods in that namespace.

Static volume provisioning for sharing data access across different namespaces
Note 1: The described way of using a custom SCC with a service account in this section is just an example to demonstrate how the use of SCCs, service accounts and pod/container securityContexts generally work together. It allows to enforce an identical SELinux MCS label on pods in different namespaces with a custom SCC - in addition to enabling the use of custom user ID (uid) and group ID (gid) ranges. These are typical steps when packaging a containerized application for deployment and scoping the required permissions and privileges through SCCs. The example can be extended and adapted to fit specific needs.
However, this example certainly has its limitations as a PVC can still be mounted by any pod in the namespace, not just the pod that is using the correct service account and securityContext. We cannot enforce that the PVC is only used by pods that apply the correct service account and securityContext with the specific custom SCC. Should a regular user - even by error - mount the PVC in a regular pod using the "default" service account from the namespace then the user might not be able to access the data in the PV but - most importantly - the user will immediately impact others by using a wrong SELinux MCS label and removing all access to the shared data from all previously started pods that were using the correct service account and SELinux MCS label (i.e. starting one pod with a wrong SELinux label immediately causes a relabeling of the whole backing directory and locks out everyone else)!
The same will happen if the user is using the correct service account in the pod manifest but forgot to request any explicit user and group attributes in the securityContext so that the associated custom SCC is not even selected by the admission controller for the pod as the restricted SCC with less privileges is considered sufficient to run the pod. In this case the pod is using the pre-allocated SELinux MCS label from the namespace.
Similar considerations apply when using specific user and (supplemental) group ID ranges in the custom SCC for a group of different OpenShift users who are sharing access to the same data. There is no enforcement in place that will ensure that these users will actually apply the intended user IDs and group IDs in their pod securityContext section. They can apply any valid value from the allowed range in the SCC for their pods!
Furthermore, SCCs currently do not offer a way to restrict the primary group ID of the user process in a container (runAsGroup). Even under the restricted SCC a user can start a pod with any primary group ID wanted - including gid 0 (root)! So in this example with a custom SCC with broader uid/gid ranges we generally have to trust the user (deployer) within the given scope to do the right thing. Of course, you can always further narrow down the scope of ID ranges in custom SCCs even down to custom SCCs for each user with specific user and (supplemental) group IDs. 
The given example is generally a good approach for applications that need to run in multiple user namespaces and require shared access to data in IBM Spectrum Scale with carefully scoped security contexts, privileges and permissions that are not necessarily modified by any users themselves.
Alternatively, if you focus more on users and groups that are developing and deploying their own pods and require shared access to the same data in IBM Spectrum Scale then you can also directly give access to custom SCCs to users and groups without creating service accounts, roles and rolebindings. This solution is introduced in the following section (2) Shared data access with custom SCCs for OpenShift users or groups.
Note 2: When submitting a pod on OpenShift all available SCCs for the involved user, groups, and service accounts will be evaluated for the security context. So when deploying a pod as cluster admin with access to higher privileged SCCs like the privileged or anyuid SCC, a different security context may be applied to the deployes pod than compared to a non-privileged user associated with the restricted SCC. A privileged user can run a pod as a less-privileged user to test which SCC will actually be selected for the pod as follows:
# oc apply -f pod.yaml --as=non-privileged-user

For example, when deploying the pod manifest from the example above by a privileged user like the cluster admin, the SCC that is finally chosen may not necessarily be the one associated with the service account if another SCC available to the privileged user that meets the requested security context in the pod. In this case the pod may run with the specified service account (here: shared) but the evaluation of all available SCCs by the admission process picks the anyuid SCC that also is available to the privileged cluster admin user who submitted the pod:

# oc get pod my-pod -o yaml | grep serviceAccount:
serviceAccount: shared

# oc get pod my-pod -o yaml | grep scc
openshift.io/scc: anyuid

Therefore a pod may not necessarily pick the custom SCC associated with its service account if other SCCs are available to the user who submits the pod! In this case, when using the anyuid SCC it would even lead to a wrong (!) SELinux MCS label being applied to the shared data because the specific MCS label "s0:c26,c0" from the custom SCC ("shared-scc") associated with the service account is not applied here. Instead, the default SELinux MCS label from the pre-allocated value in the namespace annotation is used. In order to avoid such an ambiguity it is best to have the user always request the very same SELinux MCS label in the securityContext of the pod as defined in the custom SCC to make the deployment bullet-proof (see Assign SELinux labels to a Container): 

apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: registry.access.redhat.com/ubi8/ubi-minimal:latest
securityContext:
runAsUser: 5001
runAsGroup: 5001
seLinuxOptions:
level: "s0:c26,c0"
command: [ "/bin/sh", "-c", "--" ]
args: [ "while true; do ... ; done;" ]
volumeMounts:
- name: vol1
mountPath: "/data"
serviceAccountName: shared
securityContext:
fsGroup: 5500
supplementalGroups: [5002, 5003]
volumes:
- name: vol1
persistentVolumeClaim:
claimName: home

This ensures that the correct SELinux MCS label is always applied by the pod. The custom SCC from our example, here "shared-scc" with SELINUX set to MustRunAs with a predefined and matching SELinux MCS label (seLinuxOptions: level: "s0:c26,c0"), is preferred in the selection process by the admission controller even for a privileged user because it requests the least privileges and permissions with SELINUX=MustRunAs compared to other privileged SCCs with SELINUX=RunAsAny.

Using SCCs to share data access: Pod admission process
If no suitable SCC is available to the pod through the user (who is submitting the pod) or the service account then the attempt to start the pod will fail right away!
Should the pod be submitted by a non-privileged user and request a different (wrong) SELinux context (e.g., "s0:c26,c270") which is not the one permitted by the custom SCC ("shared-scc") associated with the service account, then the attempt fails. The error message reveals in detail how the admission controller tried to find a suitable SCC from all available SCCs for the pod:
Error from server (Forbidden): error when creating "pod.yaml": pods "my-pod" is forbidden: unable to validate against any security context constraint: 
[
provider "anyuid": Forbidden: not usable by user or serviceaccount,
spec.containers[0].securityContext.seLinuxOptions.level: Invalid value: "s0:c26,c270": must be s0:c27,c19, <-- restricted SCC: not matching pre-allocated SElinux label from namespace
spec.containers[0].securityContext.seLinuxOptions.level: Invalid value: "s0:c26,c270": must be s0:c26,c0, <-- shared-scc SCC: not matching the predefined SELinux label in SCC
provider "nonroot": Forbidden: not usable by user or serviceaccount,
provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount,
provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount,
provider "hostnetwork": Forbidden: not usable by user or serviceaccount,
provider "hostaccess": Forbidden: not usable by user or serviceaccount,
provider "spectrum-scale-csiaccess": Forbidden: not usable by user or serviceaccount,
provider "node-exporter": Forbidden: not usable by user or serviceaccount,
provider "ibm-spectrum-scale-privileged": Forbidden: not usable by user or serviceaccount,
provider "privileged": Forbidden: not usable by user or serviceaccount
]

Please refer to Get started with security context constraints on Red Hat OpenShift to learn more about how to use Security Context Constraints (SCCs) on Red Hat OpenShift to manage the security context of your pods on a granular level.

(2) Shared data access with custom SCCs for OpenShift users or groups

Another option to allow pods in different namespaces to safely share access to the same data in IBM Spectrum Scale is using custom SCCs assigned to OpenShift users or groups (of users) instead of service accounts. Here, we focus on OpenShift users and groups that are developing and deploying their own pods in their own namespaces and require shared access to the same data in IBM Spectrum Scale. A group of data scientists, for example, would fall into this category when they develop and train new deep learning models and require access to huge amounts of shared training data without wasting space and time on data copies. Here, the cluster admin would carefully scope the required permissions and privileges in custom SCCs with selected user IDs (uid) and group IDs (gid) that align with the user and group IDs in the IBM Spectrum Scale file system as well as enforcing a uniform SELinux MCS label for safe data sharing (see Assign SELinux labels to a Container).

The cluster admin can organize OpenShift users in groups as follows:

1. Create a new group:

# oc adm groups new homegrp

2. Add users to a group:

# oc adm groups add-users homegrp user1 user2

Having created a custom SCC like the "shared-scc" from the previous section (1) Shared data access using custom SCCs and service accounts the cluster admin can assign the custom SCC directly to a group or to individual users with:

# oc adm policy add-scc-to-group shared-scc homegrp
# oc adm policy add-scc-to-user shared-scc user3

OpenShift users and groups can also be removed from an SCC easily by using similar commands:

# oc adm policy remove-scc-from-group shared-scc homegrp
# oc adm policy remove-scc-fr-user shared-scc user3

To list all groups and users bound to a custom SCC use (and adjust the -A10 option accordingly):

$ oc describe clusterrolebinding.rbac | grep -A10 scc:shared-scc
Name: system:openshift:scc:my-scc
Labels: <none>
Annotations: <none>
Role:
Kind: ClusterRole
Name: system:openshift:scc:shared-scc
Subjects:
Kind Name Namespace
---- ---- ---------
User user3
Group homegrp

Here, we can see that our custom SCC "shared-scc" is bound to the user "user3" and the group "homegrp". All these users and groups have now access to this custom SCC in addition to their default SCCs like the restricted SCC. No further service accounts, roles and rolebindings are required.

For shared access to the data in IBM Spectrum Scale the cluster admin needs to create one static PV for each (!) target namespace with the same backing directory and use the claimRef option to reserve each PV for a specific PVC in a selected target namespace. This is the exact same procedure as described and depicted in the previous section (1) Shared data access using custom SCCs and service accounts.

Using SCCs to share data access: OpenShift users & groups
Users with access to the custom SCC and shared data in IBM Spectrum Scale can now run pods in their own namespaces with a specific securityContext that properly aligns with the required file permissions in the IBM Spectrum Scale file system and apply a uniform SELinux MCS label for safe data sharing.
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: registry.access.redhat.com/ubi8/ubi-minimal:latest
securityContext:
runAsUser: 5001
runAsGroup: 5001
command: [ "/bin/sh", "-c", "--" ]
args: [ "while true; do ... ; done;" ]
volumeMounts:
- name: vol1
mountPath: "/data"
securityContext:
fsGroup: 5500
supplementalGroups: [5002, 5003]
volumes:
- name: vol1
persistentVolumeClaim:
claimName: home
Here the pods need to request capabilities or specific uid/gid ranges that are only available from the custom SCC in order to ensure that the pod admission process (see Note 2 , section (1) Shared data access using custom SCCs and service accounts) actually selects the intended custom SCC. If no specific capabilities or uid/gid ranges are requested in the pod security context then the restricted SCC is selected which does not have a common SELinux MCS label defined (in which case the default MCS labels from the namespaces are applied preventing shared data access across namespaces).
Similarly, if the users have access to higher privileged SCCs like the "anyuid" SCC then we might end up with the pod admission process picking another SCC instead of our intended custom SCC. In these cases the pods would need to explicitly request the very same SELinux MCS label in their securityContext as defined in the custom SCC so that the correct SCC is selected by the admission controller:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: registry.access.redhat.com/ubi8/ubi-minimal:latest
securityContext:
seLinuxOptions:
level: "s0:c26,c0"
runAsUser: 5001
runAsGroup: 5001
command: [ "/bin/sh", "-c", "--" ]
args: [ "while true; do ... ; done;" ]
volumeMounts:
- name: vol1
mountPath: "/data"
securityContext:
fsGroup: 5500
supplementalGroups: [5002, 5003]
volumes:
- name: vol1
persistentVolumeClaim:
claimName: home
Here, the same limitations with regard to uid/gid ranges apply as already discussed in Note 1 in section (1) Shared data access using custom SCCs and service accounts. Of course, we can apply even more granular custom SCCs with smaller ranges or even dedicated uid and gid values to each involved OpenShift user or group depending on our needs. See Security context constraints strategies for more information on the available options, capabilities and how to define ID ranges.
Please refer to Managing SCCs in OpenShift to learn more about assigning SCCs to users and groups as well as how an SCC gets selected during the admission process when a pod is submitted by a user. Also have a look at Syncing LDAP groups how you can sync LDAP records with internal OpenShift Container Platform records, enabling you to manage your groups in one place.

(3) Shared data access using a shared namespace for multiple users

A single OpenShift user typically works on multiple projects in multiple namespaces. A single namespace can also be shared by multiple users in different roles (admin|edit|view). Therefore the easiest way to share access to the same data in IBM Spectrum Scale would be to follow these steps:

  • Create a new namespace (aka project) for the collaboration project.
  • Add all users who collaborate on the same project in this namespace with their respective role:
    # oc adm policy add-role-to-user admin user1 -n shared-namespace
    # oc adm policy add-role-to-user edit user2 -n shared-namespace
    # oc adm policy add-role-to-user edit user3 -n shared-namespace
    # oc adm policy add-role-to-user view user4 -n shared-namespace
  • Create a static PV for the shared backing directory in IBM Spectrum Scale using claimRef as option to control the proper binding of this PV to a PVC from this namespace only.
  • Bind the static PV to the PVC in this collaboration namespace for the shared project.

This is quite a safe solution as it does not depend on the users to apply a correct securityContext in their pods. By default, all non-privileged users in the namespace are bound to the same pre-allocated default values for the SELinux MCS label and uid/gid ranges. This enables safe data sharing as all pods run under the same SELinux security context by default - provided no other SCCs are applied to specific service accounts or users in that namespace. This approach does not involve custom SCCs and is a good solution for non-privileged users and simple data sharing use cases (collaboration). It is well suited for Proof of Concepts (PoCs) to easily demonstrate shared and parallel data access capabilities with IBM Spectrum Scale across OpenShift users and even physical OpenShift worker node boundaries (RWX access mode).

(4) Shared data access by modifying SELinux namespace annotation

If volumes backed by the same data location in IBM Spectrum Scale are accessed in different namespaces - either by the same or different non-privileged users - another option not involving custom SCCs would be to have the cluster admin edit the SELinux MCS label annotation of the involved namespaces in OpenShift and set an identical SELinux MCS label as shown below, for example, by using

# oc edit ns [namespace]
apiVersion: v1
kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c26,c0
openshift.io/sa.scc.supplemental-groups: 1000670000/10000 openshift.io/sa.scc.uid-range: 1000670000/10000 [...]

By configuring the same SELinux MCS label in the annotation (openshift.io/sa.scc.mcs) of the selected namespaces any pods submitted by non-privileged users in these namespaces (running under the restricted SCC) will automatically pick up the same pre-allocated SELinux MCS label and safely allow shared access to the data in IBM Spectrum Scale through statically provisioned PVs. Here, using claimRef would be the preferred option for the static provisioning of the PVs to have full control of the binding of these PVs to PVCs in the selected namespaces.

In the example above, data in IBM Spectrum Scale (/mnt/fs1/shared-data) is accessed by non-privileged pods in namespace #1 through a PVC bound to a static PV. By default, these pods apply pre-allocated values for the SELinux MCS label as well as the uid and supplemental gid ranges when running under the restricted SCC for regular non-privileged users. These pre-allocated values are defined in the annotations of namespace #1 (e.g., openshift.io/sa.scc.mcs: s0:c26,c0). If we plan for other pods in a second namespace (namespace #2) - either owned by the same or a different OpenShift user - to access the same data, then a cluster admin can create a new namespace and manually set the SELinux MCS label in the annotation of this new namespace to the same SELinux MCS label as defined in the first namespace. This ensures that the same SELinux MCS label is applied by default in both namespaces when pods running under the restricted SCC access the same data in IBM Spectrum Scale through statically provisioned PVs. Furthermore, other pre-allocated defaults for the two namespaces like the ID ranges of uids or supplemental groups can also be adjusted as needed.
This solution does not depend on the users to apply a correct securityContext in their pods as required in the previous examples with custom SCCs. By default, all non-privileged users in both namespaces are bound to the same pre-allocated default values for the SELinux MCS label and uid/gid ranges. This enables safe data sharing across namespaces as all pods run under the same SELinux security context by default - provided no other SCCs are applied to specific service accounts or users in these namespaces. This might be a good solution if the same user or application needs access to the same data in IBM Spectrum Scale from different OpenShift clusters or from different sites (for example, two sites with a stretched IBM Spectrum Scale cluster, or, on a global scale, with Active File Management).

Presentation and demo

A brief summary of the contents of this blog post can be found in my presentation from the IBM Spectrum Scale Strategy Days 2022 in Cologne in cooperation with the Spectrum Scale User Group. It explains all the necessary steps how to provide and share access to the same data in IBM Spectrum Scale to users and applications on OpenShift across namespaces and worker nodes, taking the proper OpenShift security context with regard to SELinux MCS labels and uid/gid settings into account.

The presentation is available here: Secure Data Sharing in OpenShift with IBM Spectrum Scale

Below you find two links to demo videos showing how to safely share access to the same data in IBM Spectrum Scale by using static provisioning (with claimRef) and a custom SCC to set the proper security context in OpenShift with regard to the SELinux MCS label and uid/gid file permissions.

A short version of the demo is available here: Shared Data Access in OpenShift with IBM Spectrum Scale - Demo (13min)

An extended version of the demo is available here: Shared Data Access in OpenShift with IBM Spectrum Scale - Extended Demo (16min)

The extended version of the demo (16 min) shows the same steps as the first demo but includes an example of how the SELinux security context prevents access to the shared data for non-privileged users if the SELinux security context is NOT properly taken into account! It also shows how the default pre-allocated values from the annotations of the namespace are applied to define the default security context of pods submitted by regular, non-privileged users in OpenShift who are running under the "restricted" SCC.

In addition, there is a three-part video sequence of the individual steps with some more details:

  • Part 1: Static provisioning of persistent volumes
    This video shows how to use advanced static volume provisioning of persistent volumes (PVs) with IBM Spectrum Scale Container Native Storage Access (CNSA) on Red Hat OpenShift to share parallel access to pre-existing data in IBM Spectrum Scale across user namespaces.
  • Part 2: SELinux preventing shared access to the same data across namespaces (bad path example)
    This video shows how a different SELinux context in different namespaces generally prevents shared access to the same data in IBM Spectrum Scale on Red Hat OpenShift for non-privileged users (running under the "restricted" SCC) across different user namespaces. This is a "bad path" example and demonstrates what happens when not paying attention to a proper SELinux security context for shared data access across namespaces. The video also shows how the default security context of a pod from a non-privileged user running under the "restricted" SCC (security context constraints) is determined by pre-allocated values from the annotations of the namespace (or project) in OpenShift with regard to the uid, gid, fsGroup and SELinux MCS label of the user process in the container.
  • Part 3: Using a custom SCC to safely share data access across namespaces (good path example)
    This video shows how a custom SCC (security context constraints) with a predefined SELinux MCS label can be used to allow non-privileged users (running under the "restricted" SCC) to run pods in different user namespaces with the same SELinux context and safely share simultaneous access to the same data in IBM Spectrum Scale across different user namespaces and even worker nodes. This is a "good path" example.

Note: The demo was originally based on Red Hat OpenShift 4.10.39 with IBM Spectrum Scale Container Native Storage Access (CNSA) v5.1.5 and IBM Spectrum Scale CSI Driver 2.7.0. The concepts described here generally explain the fundamental behavior of how the SELinux context is typically applied to volumes in OpenShift. It can be observed with IBM Spectrum Scale CNSA releases up to 5.1.6.0. Starting with IBM Spectrum Scale CNSA 5.1.7.0 (released on March 16, 2023) the SELinux default behavior has changed. IBM Spectrum Scale CNSA 5.1.7.0 will now - by default - mount the file system with a container permissive SELinux context by setting the mount context of the file system to system_u:object_r:container_file_t:s0. All files inside the file system will be considered to have the SElinux context defined on the files system mount and allow all containers running as container_t SELinux type to access files on the file system if permitted by standard file permissions. See IBM Spectrum Scale container native and SELinux for more details.

With the default settings in CNSA 5.1.7.0 (and higher) for the SELinux context you do no longer need to manually manage and ensure the proper SELinux context for shared data access in IBM Spectrum Scale.

The setup used for all these demos is depicted in the figure below. We provide and share access to a directory "ocp-home" on the IBM Spectrum Scale storage cluster for users dean, petra and gero in different namespaces on OpenShift. The admin creates three PVs through static provisioning and reserves these PVs with claimRef exclusively for the three namespaces dean4, petra4 and gero4. Then the admin defines a custom SCC ("shared-scc") as well as a service account, role and role binding in the user namespaces for dean (dean4) and petra (petra4) while also assigning the custom SCC directly to the user gero. The three non-privileged users log in to OpenShift and start a pod accessing the shared data in IBM Spectrum Scale. We see that all user pods have now access to the same data in IBM Spectrum Scale in parallel across OpenShift namespaces and worker nodes with properly configured user IDs and group IDs as well as identical SELinux MCS labels. Being able to run workloads with parallel access to the same data beyond physical node boundaries and across namespaces in OpenShift is an outstanding feature that a clustered parallel file system like IBM Spectrum Scale can offer.

Demo Setup

References

​​​​​​​

12 comments
256 views

Permalink

Comments

Wed March 22, 2023 07:41 AM

Updates v2.125 (2023-03-21)

  • Added notes with regard to the changed default behavior for the SELinux context in IBM Spectrum Scale CNSA 5.1.7.0 (released on March 16, 2023).

The article was originally based on Red Hat OpenShift 4.9.22 with IBM Spectrum Scale Container Native Storage Access (CNSA) v5.1.2 and IBM Spectrum Scale CSI Driver 2.4.0. The concepts described here generally explain the fundamental behavior of how the SELinux context is typically applied to volumes in OpenShift. It can be observed with IBM Spectrum Scale CNSA releases up to 5.1.6.0.

Starting with IBM Spectrum Scale CNSA 5.1.7.0 (released on March 16, 2023) the SELinux default behavior has changed. IBM Spectrum Scale CNSA 5.1.7.0 will now - by default - mount the file system with a container permissive SELinux context by setting the mount context of the file system to system_u:object_r:container_file_t:s0. All files inside the file system will be considered to have the SElinux context defined on the files system mount and allow all containers running as container_t SELinux type to access files on the file system if permitted by standard file permissions. See IBM Spectrum Scale container native and SELinux for more details.

With the default settings in CNSA 5.1.7.0 (and higher) for the SELinux context you do no longer need to manually manage and ensure the proper SELinux context for shared data access in IBM Spectrum Scale.

Wed February 15, 2023 11:02 AM

Updates v2.12 (2023-02-15)

  • Added some contents to the "SELinux relabeling" section by pointing out that SELinux relabeling is handled differently for static provisioning depending on the backing directory being a regular directory or a fileset in IBM Spectrum Scale. 
  • Provided an example how to manually relabel a fileset that is used as backing directory in a static PV using chcon in order to provide data access in OpenShift.

A new pdf version of this blog post is available at:

Advanced Static Volume Provisioning with IBM Spectrum Scale on Red Hat OpenShift (pdf)

Mon January 30, 2023 04:20 AM

Updates v2.11 (2023-01-27)

  • Reorganized the section "Disabling SELinux relabeling for selected pods/containers" and updated the included figure.
  • Added new figures to section "OpenShift SELinux and user (uid) / group (gid) security context" to provide a visual presentation of the described content.
  • Added the option to assign a custom SCC to the default service account in a given namespace with "oc adm policy" in section "(1) Shared data access using custom SCCs and service accounts" (which is similar to adding a custom SCC to users or groups as described in the subsequent section).

A new pdf version of this blog post is available at:

Advanced Static Volume Provisioning with IBM Spectrum Scale on Red Hat OpenShift (pdf)

Wed October 26, 2022 08:41 AM

Updates v2.10 (2022-10-26)

  • Added a new chapter  Presentation and demo with links to a presentation and two demo videos.
  • Added new sections to the SELinux security context chapter and table of contents to make the navigation within this blog post easier:
    • SELinux relabeling
    • Disabling SELinux to skip SELinux relabeling (using "spc_t")
    • Ensuring an identical SELinux security context for shared data access

A new pdf version of this blog post is available at:

Advanced Static Volume Provisioning with IBM Spectrum Scale on Red Hat OpenShift (pdf)

Fri October 21, 2022 12:33 PM

I presented a summary of the contents in my presentation at the IBM Spectrum Scale Strategy Days 2022 in Cologne yesterday in cooperation with the Spectrum Scale User Group.

It explains all the necessary steps how to provide and share access to the same data in IBM Spectrum Scale to users and applications on OpenShift across namespaces and worker nodes, taking the proper OpenShift security context with regard to SELinux MCS labels and uid/gid settings into account.

The presentation is available here:

A short version (13 min) of the demo is available here:

This demo shows how to safely share access to the same data in IBM Spectrum Scale using static provisioning and a custom SCC to set the proper security context in OpenShift with regard to the SELinux MCS label and uid/gid file permissions.

An extended version (16 min) of the demo is available here:

This extended version of the demo (16 min) shows the same steps as the video above but includes an example of how the SELinux security context prevents access to the shared data for non-privileged users if the SELinux security context is NOT properly taken into account! It also shows how the default pre-allocated values from the annotation of the namespace are applied to define the default security context of pods submitted by regular, non-privileged users in OpenShift who are running under the "restricted" SCC.

Fri October 21, 2022 11:49 AM

If you want to see the proposed methods in action, please take look at the following demo recordings for sharing access to the same data in IBM Spectrum Scale on OpenShift (across worker nodes):

Demos

(1) ADVANCED VOLUME PROVISIONING FOR DATA SHARING

This demo shows how to create and provision the PVs with OpenShift (and CNSA/CSI) to share parallel access to the same data in IBM Spectrum Scale for users in three namespaces. Here we want to provide and share access to the data in the directory "/gpfs/ess3000_1M/ocp-home" on the Spectrum Scale storage cluster. The Spectrum Scale file system is remotely mounted on the OpenShift compute cluster with IBM Spectrum Scale Container Native (CNSA). The subsequent demos show how the data can be accessed from different users in three namespaces on OpenShift.

(2) DATA SHARING DONE WRONG WITH ACCESS DENIED BY SELINUX

This demo shows what happens when the same data in IBM Spectrum Scale is accessed by three non-privileged users in OpenShift from three namespaces without taking the SELinux security context into account. The users are regular users running in their own namespaces under the "restricted" SCC (security context constraints). Here the "restricted" SCC enforces a MustRunAs policy on the SELinux security context and the default value for the SELinux MCS label is taken from the pre-allocated values given in the annotations of the namespace where the pod is running. So the pod in each namespace runs with another default SELinux MCS label and as soon as a pod is started and mounts the volume the whole data in the mounted volume is relabeled with the default SELinux MCS label of the namespace. In this case each user shuts out all other users from accessing the pods if no further precautions with regard to SELinux relabeling are taken. The PVs have been provisioned as shown in the first demo above.

This demo shows how the pre-allocated default values from the namespace for the SELinux MCS label, uid and fsGroup are applied and how each user locks out any previous user from accessing the shared data as soon as the user starts a pod.

(3) DATA SHARING DONE RIGHT WITH CUSTOM SCC

This demo shows how you can define and apply a custom SCC to ensure that different non-privileged users in three namespaces can safely access the same data in IBM Spectrum Scale across worker nodes in a RWX (read-write-many) access mode with a proper SELinux context and individual uid/gid file permissions. We show how to grant access to a custom SCC ("shared-scc") via a service account (plus role and rolebinding) in the user namespace as well as by assigning the custom SCC directly to the user (could also be a group of users). Here the pods in each namespace run with the same SELinux MCS label as defined in the custom SCC ("shared-scc"). The PVs have been provisioned as shown in the first demo above.

Tue October 04, 2022 05:20 PM

Updates v2.00 (10/2022)

  • Extended SELinux and uid/gid section
    • (1) Shared data access using custom SCCs and service accounts
    • (2) Shared data access with custom SCCs for OpenShift users or groups
    • (3) Shared data access using a shared namespace for multiple users
    • (4) Shared data access by modifying SELinux namespace annotation

A pdf version of this blog post is available at:

Advanced Static Volume Provisioning with IBM Spectrum Scale on Red Hat OpenShift (pdf)

Fri September 02, 2022 08:18 AM

Due to the interest in this article in applying the proposed methods in recent PoCs I made some minor additions to help clarify things.

Updates v1.01 (09/2022)

  • If you create a directory in IBM Spectrum Scale that you want to access through a static PV you would need to make sure that the permissions on the backing directory are properly set. For example, a regular user in OpenShift who is running under the restricted SCC (Security Context Constraints) policy will be assigned an arbitrary uid and gid 0 (root) when accessing the file system in the PV. In this case you may want to set "rwx" access permissions for the gid 0 (root) on the backing directory in IBM Spectrum Scale in order to grant full read/write access to the static PV, e.g. drwxrwxr-x. root root /mnt/fs1/data/pv01.
  • Note that recursive SELinux relabeling of all files in a mounted volume is not taking place when hostPath is used to directly mount a given path from IBM Spectrum Scale into a pod/container (instead of provisioning a volume through CSI). However, the use of hostPath is generally discouraged as it requires the pod/container to run in a privileged security context which poses a high security risk.
  • Please note, that creating customized SCCs and associated service accounts that grant the users the necessary privileges to apply a common securityContext with proper SELinux labels (and additional uid/gid settings as needed) in the pods that share data access to the same backing directory in IBM Spectrum Scale would generally be the preferred option! See Get started with security context constraints on Red Hat OpenShift to learn more about how to use security context constraints (SCCs) on Red Hat OpenShift.

As requested a pdf version of this blog post is available for download here:
Advanced Static Volume Provisioning with IBM Spectrum Scale on Red Hat OpenShift (pdf)

Fri September 02, 2022 04:49 AM

Hi @renar Grunenberg,
thanks for the comment. A request to add the "--uid" option to the mmlsfs documentation has been made.

Sun May 15, 2022 04:49 PM

Hallo @GERO SCHMIDT,

thanks for the clarification. The fileset mapping is now understand. 

As a request to the scale people the man page an the short-cmd list should be updated to explain the --uid option. On scale 5.1.3.1 this option are currently not documented.

Regards Renar​

Sat May 14, 2022 10:44 AM

Hi @renar Grunenberg,
thanks for the feedback. To answer your questions:

(1) IBM Spectrum Scale CSI Driver v2.5.0 introduced a new storage class for creating consistency group volumes (see Storage class for creating consistency group volumes). This storage class allows to create a consistent point-in-time snapshot of all PVs created from that storage class. It is aimed at applications where a consistent snapshot of multiple PVs is needed (not just a snapshot of an individual PV). By  placing all PVs from a consistency group storage class (each PV is created as a dependent fileset) into one independent fileset (as the root fileset to host all these independent filesets) we can use snapshots on the independent fileset to create a consistent point-in-time snapshot of all the nested dependent filesets (each backing a PV from the storage class).

(i) A consistency group is mapped to an independent fileset.
(ii) A volume (PV) in a consistency group is mapped to dependent fileset. within the independent fileset.

(2) The mm-command to retrieve the filesystem ID is indeed a little buried in the article. You can find it in paragraph (2) in the CSI volumeHandle section):

(2) The second parameter that we need is the UID of the IBM Spectrum Scale file system [...] We can obtain the UID [...] by executing the mmlsfs command as follows:
# oc exec worker1a -n ibm-spectrum-scale -- mmlsfs fs1 --uid

So you can obtain the filesystem ID simply by running:

mmlsfs fs1 --uid

with fs1 being the IBM Spectrum Scale filesystem name. The ID is the same if you run the command on the remote storage cluster on the original file system name (e.g. essfs1) or on the local CNSA cluster on the local file system name for the remote mount (e.g. fs1 as local name for the mounted  remote file system essfs1). 

​​​​​​​

Fri May 06, 2022 05:46 AM

Hallo Gero,

great article. Can you a little bit clarify two points here?

1. dependent fileset jointly embedded in independent Files ??

  • consistency group volumes (backed by dependent filesets jointly embedded in an independent fileset for consistent snapshots of all contained volumes).

2. Is your mentionioned filesystemid the stripegroupID? And were can I  found this ID with a mm cmd? (I know the mmfsadm dump fs cmd)

Regards Renar