File and Object Storage

Hybrid cloud data sharing and collaboration with IBM Spectrum Scale Active File Management

By Nils Haustein posted Tue December 08, 2020 04:19 AM

  

By Nils Haustein and Kedar Karmarkar

IBM Spectrum Scale version 5.1 introduced an enhancement for the Active File Management function (AFM) allowing to connect to cloud object storage. Customers use data objects stored in cloud object storage buckets to run workloads such as mobile applications, backup and restore, enterprise applications, and big data analytics. AFM to cloud object storage can be used to accelerate these workloads for faster computation while the file and object data is synchronized between the AFM fileset and the cloud object storage.

With this new AFM to cloud object storage function, IBM Spectrum Scale is well positioned for hybrid cloud data solutions. In this blog article we give you a brief overview of this new function and highlight some practical use cases.

Introduction to AFM to cloud object storage

The AFM to cloud object storage enables sharing of files with cloud object storage and caching of objects from cloud object storage as files in the AFM to cloud object storage fileset. This combination of file and object storage combines the advantages of both: Cloud object services such as Amazon S3 and IBM Cloud® Object Storage offer industry-leading scalability, global data availability and security. IBM Spectrum Scale AFM to cloud object storage filesets provide leading performance and scalability, especially for AI and big data workloads.

 

Some AFM basics

The AFM to cloud object storage allows associating an IBM Spectrum Scale fileset with a cloud object storage bucket. The connection between the AFM to cloud object storage fileset and the object storage bucket is established over secure TLS connection, with valid object storage bucket credentials and optional certificate verification. Once an empty AFM to cloud object storage fileset is initially connected to the cloud object storage bucket, all objects residing in the bucket are presented as files in the AFM to cloud object storage fileset.

 The state of a file in an AFM to cloud object storage fileset can be cached or un-cached. An un-cached file is an empty shell representing the object. It is constructed from object metadata. After the initial connection to the cloud object storage bucket, all files in the AFM to cloud object storage fileset are in un-cached state. When an un-cached file is accessed by the user or prefetched using the AFM prefetch command, then the object data is synchronously downloaded and stored in the file. The file state changes to cached. A cached file is a copy of the associated object in the cloud object storage bucket. The object naming structure in the cloud object storage bucket is reflected in the associated AFM to cloud object storage fileset.

 When a file is modified or changed in the AFM to cloud object storage fileset then the file is marked as “dirty”. Depending on the configured AFM connection and operation mode, dirty files are asynchronously uploaded to the cloud object storage bucket. The path and file names of files stored in an AFM to cloud object storage fileset are reflected in the object name of the associated object in the cloud object storage bucket.

 An AFM to cloud object storage fileset can be configured to synchronize access control lists (ACL) and extended attributes of files with the cloud object storage bucket. These attributes are stored in the associated object metadata. Furthermore, an AFM to cloud object storage fileset can also be configured with quota in terms of the number of files and directories or the usable capacity.

 The IBM Spectrum Scale AFM function is designed for unreliable networks. This means if the network connection between the AFM to object storage fileset and the cloud object storage bucket is not operational, the files in the AFM fileset are still visible and files in cached state are accessible. Files in un-cached state cannot be accessed until the network connection is re-established. Once AFM can connect to the cloud object storage bucket, files in dirty state are uploaded automatically and un-cached files can be accessed again.

 

AFM connection and operation modes

The AFM connection mode decides if objects updated in the cloud object storage bucket are presented in the AFM to cloud object storage fileset and if files that are modified in the AFM to cloud object storage fileset are uploaded to the cloud object storage bucket. An AFM to cloud object storage fileset can be configured in one of the following connection modes:

  • Single writer (SW): In this mode, only one AFM to cloud object storage fileset does all the writing and this fileset does not check the associated cloud object storage bucket for object updates. After establishing a SW-mode AFM to cloud object storage fileset, pre-existing objects in the cloud object storage bucket are presented as files in the AFM fileset (in un-cached state). The associated data can be downloaded either on access or with the AFM prefetch command.  Changed and modified files are uploaded to the cloud object storage bucket. Objects written to the cloud object storage bucket by other applications or AFM filesets are not cached in the SW-mode fileset.
  • Independent writer (IW): This mode allows multiple AFM to cloud object storage filesets to point to the same cloud object storage bucket. Multiple AFM to cloud object storage filesets can be on the same IBM Spectrum Scale cluster or on a different cluster. All objects in the cloud object storage bucket are presented in all AFM filesets. Each AFM to cloud object storage fileset can download objects from the cloud object storage bucket. Flies that are modified or changed files in an IW-mode AFM to cloud object storage fileset are asynchronously uploaded to the to cloud object storage bucket. There is no synchronous locking between multiple AFM to cloud object storage file sets while updating objects in a cloud object storage bucket. If the same file is updated in multiple AFM to cloud object storage filesets independently, conflicting updates from multiple AFM to cloud object storage sites can cause the data on the cloud object storage site to be undetermined
  • Read-only (RO): In this mode, data in the AFM to cloud object storage fileset is read-only. You cannot create or modify files in a RO-mode AFM to cloud object storage fileset. Objects in the cloud object storage bucket are presented in the AFM fileset and can be downloaded on access or using the AFM prefetch command.
  • Local updates (LU): This mode is like the RO-mode, although you can create and modify objects in an AFM to cloud object storage fileset. Updates in the AFM to cloud object storage fileset are considered local to the AFM to cloud object storage fileset and get decoupled from the corresponding object on a cloud object storage. Local updates are never pushed back to a cloud object storage.


The operation mode determines whether the metadata of objects stored in the cloud object storage bucket are automatically synchronized with the AFM to cloud object storage fileset. One of the following operation modes can be configured per AFM fileset:

  • ObjectFS mode: In this mode, an AFM to cloud object storage fileset is synchronized with a cloud object storage bucket. AFM to cloud object storage filesets configured in RO, LU, or IW modes synchronize metadata to and from a cloud object storage. For example, if the directory in the AFM to cloud object storage fileset is read then the object metadata is synchronized from the cloud object storage bucket. Object data is downloaded when the associated file is accessed or by using the AFM prefetch command.
  • ObjectOnly mode: In this mode, refresh of an AFM to cloud object storage fileset (AFM RO, LU, and IW mode fileset) with a cloud object storage will not be on-demand or frequent. You need to manually download data or metadata from the cloud object storage to the AFM to cloud object storage fileset. Data transfer from the AFM to cloud object storage fileset to the cloud object storage works automatically without manual intervention.

 

Are you stressed by the complexity? Don’t worry, an AFM to cloud object storage fileset can have only one connection and one operation mode and depending on the use case the appropriate mode is selected.

Use cases

In this section we describe some use cases enabling global sharing and collaboration in a hybrid cloud environment. The commands given are examples and must adjusted for your real use case.

Global sharing

In this use case we want to share files from one IBM Spectrum Scale cluster with other IBM Spectrum Scale clusters in different locations and perhaps geographies. The files are produced in on IBM Spectrum Scale cluster (cluster 1) and made available to all other clusters (cluster 2 and cluster 3). Because all workloads are performed on a file level it is important to keep the ACL and extended attributes in sync in all clusters. We use an IBM Cloud Object Storage bucket for data sharing. Figure 1 provides an overview of the solution:

AFM to cloud object storage global sharing 

Figure 1: Architecture for hybrid cloud global sharing solution

 The cloud object storage bucket (sharedBucket) has already been created and the following access information is available:

This information must be configured on all clusters using the following command. The access and secret keys do not have to be identical for all clusters, the associated users must have the appropriate privileges on the bucket:

# mmafmcoskeys sharedBucket:cloud.object.storage  set key1234567890 key0987654321

Cluster 1 is the provider of all files which are ingested from different edges and users. In cluster 1 we create an AFM to cloud object storage fileset (fileset1) in Single-Writer mode (SW). SW-mode is appropriate because files are solely produced and uploaded from fileset1. The following command can be used to create the AFM to cloud object storage fileset for the provider. The file system name where the AFM to cloud object storage fileset1 resides is fs1 in this example:
# mmafmcosconfig fs1 fileset1 --endpoint https://cloud.object.storage --xattr --acls --bucket sharedBucket --mode sw --object-fs

Cluster 2 and cluster 3 are consumers and see and download all files produced and uploaded by the provider cluster 1. In each of these clusters we create an AFM to cloud object storage fileset in Read-Only mode (RO). RO-mode is appropriate because these fileset do only read files produced by cluster 1. The fileset in cluster 2 is named fileset2 and the fileset in cluster 3 is named fileset3. The following command can be used to create these AFM to cloud object storage filesets in RO-mode:
# mmafmcosconfig fs2 fileset2 --endpoint https://cloud.object.storage --xattr --acls --bucket sharedBucket --mode ro --object-fs
# mmafmcosconfig fs3 fileset3 --endpoint https://cloud.object.storage --xattr --acls --bucket sharedBucket --mode ro --object-fs

Once the AFM to cloud object storage filesets in all clusters are operational the workloads can start. Files are ingested into the AFM to cloud object storage fileset1 in cluster 1. File ingest can be done through NFS, SMB and Posix. Because the operation mode is set object-fs in all filesets, new files created in fileset1 are asynchronously uploaded to the cloud object storage bucket (sharedBucket) and asynchronously presented in the AFM to cloud object storage fileset2 in cluster 2 and fileset3 in cluster 3 respectively. Files in fileset2 and fileset3 are read-only and in un-cached state initially and can be downloaded on access or by using AFM prefetch command shown below. This command assumes that the name of the objects are recorded line-by-line in file /tmp/objlist and these objects are downloaded to fileset2 in cluster 2:
# mmafmcosctl fs2 fileset2 /gpfs/fs1/fileset2 download --object-list /tmp/objlist --data

The file names, extended attributes and ACL given to files in fileset1 are synchronized with the corresponding files in fileset2 and fileset3 via the cloud object storage bucket. For example,  a file named file1 in root directory of fileset1 will show up as file1 in the root directory of fileset2 and fileset3. Likewise, a file created in a subdirectory of fileset1 (e.g. dir1/file2) shows up as dir1/file2 in the other filesets.

When files in the provider fileset1 are deleted, then these deletions are propagated to the consumer filesets asynchronously, via the cloud object storage bucket.

Summary of this use case:
With this use case it is possible to share all files from a provider fileset1 with other consumer filesets (fileset2 and fileset3) in different clusters residing in different locations. Because AFM is made for unreliable network connections, consumer filesets can reside in clusters located in different geographies than the provider fileset. All files created and modified in the provider fileset1 are automatically uploaded to the cloud object storage bucket, which requires a solid network connection. The file metadata of all files residing as objects in the cloud object storage is automatically downloaded to the consumer fileset2 and fileset3 on demand or using the AFM prefetch command. This allow you to control the download costs that may be associated with cloud object storage. The next use case Selective sharing explains methods for more control of download volumes and cost.

 

Selective sharing

This use case aims to limit the volume of data being downloaded from the cloud object storage providing better control of costs associated with downloading file data and metadata. One AFM to cloud object storage fileset (fileset1) is configured as provider and all files created, modified or changed in this fileset are uploaded to cloud object storage. Two other AFM to cloud object storage filesets (fileset2 and fileset3) are configured as consumer in object-only mode and download files provided by fileset1 from cloud object storage when required. The download of file data and metadata is done by the IBM Spectrum Scale storage administrator using the AFM pre-fetch command. This use case is similar to the Global sharing with the difference that file data and metadata is not automatically presented in the consumer fileset2 and fileset3. Figure 2 shows an overview of this solution.

 AFM to cloud object storage selective sharing

Figure 2: Architecture for hybrid cloud selective file sharing solution

The cloud object storage bucket (sharedBucket) is created and users are configured. The information about user credentials and endpoints is available.

Fileset1 in cluster 1 is configured in IW-mode in object-FS mode. In this mode the file metadata is asynchronously presented in fileset1 and file data is downloaded on access or with the AFM pre-fetch command. Furthermore, newly created and modified files are asynchronously uploaded to the cloud object storage bucked sharedBucket. To configure fileset1 the following command can be used:
# mmafmcosconfig fs1 fileset1 --endpoint https://cloud.object.storage  --xattr --acls --bucket sharedBucket --mode iw--object-fs

Fileset2 in cluster 2 and fileset3 in cluster3 are configured in RO-mode and in operation mode objectOnly. With this configuration, files provided by fileset1 into the cloud object storage bucket are not automatically presented in fileset2 and fileset3. The following commands show how to create fileset2 and fileset3:
# mmafmcosconfig fs2 fileset2 --endpoint https://cloud.object.storage --xattr --acls --bucket sharedBucket --mode ro

# mmafmcosconfig fs3 fileset3 --endpoint https://cloud.object.storage --xattr --acls --bucket sharedBucket --mode ro

Note, omitting the option --object-fs automatically turns the AFM to cloud object storage filesets into objectOnly mode.

After the AFM to cloud object storage filesets in all clusters are configured, files created and modified in fileset1 are asynchronously uploaded to the cloud object storage bucket (sharedBucket). Because the operation mode of fileset2 and fileset3 is set to objectOnly the metadata of uploaded files is not yet presented in these filesets. The download of data and metadata to fileset2 and fileset3 can be done on demand using the AFM pre-fetch command:
# mmafmcosctl fs2 fileset2 /gpfs/fs1/fileset2 download --object-list /tmp/objlist --data

This command assumes that the name of the objects are recorded line-by-line in file /tmp/objlist and these objects are downloaded to fileset2 in cluster 2. The object names are relative to the bucket name and require some additional tooling to check for the object name in the cloud object storage bucket (sharedBucket). The file names, extended attributes and ACL given to files in fileset1 are synchronized with the corresponding files in fileset2 and fileset3 via the cloud object storage bucket.

It is also possible to download just the metadata from the cloud object storage bucket to fileset2. With this the user can see the file in fileset2 without yet having access to the data. Especially for large files it gives the user an overview about the available files at minimal download volumes. The file data must be downloaded separately. To pre-fetch just the file metadata the following command can be used. The object names subject for metadata download can either be specified in a list or with option –all. This option downloads metadata for all objects in the bucket:
# mmafmcosctl fs2 fileset2 /gpfs/fs1/fileset2 download --object-list /tmp/objlist | all --metadata

Files in fileset2 and fileset3 are configured in RO-mode, this means no new data can be added or modified in these filesets. It is possible to configure these filesets in IW-mode (similar to the Global collaboration use case) allowing files to be created and modified in fileset2 and fileset3. In IW-mode files created and modified in fileset2 and fileset3 are automatically uploaded to the cloud object storage bucket.

Summary of this use case:
With this use case it is possible to selectively share files from provider fileset1 with other consumer filesets (fileset2 and fileset3) in different clusters located in different locations. It allows better control of the download cost from the cloud object storage bucket because the download of file data and metadata can be controlled for the consumer fileset2 and fileset3. The identification and download of files required in fileset2 and fileset3 must be performed by an administrator using the AFM pre-fetch command.

 

Global collaboration

The global collaboration use case is similar to the global sharing use case with the exception that files in the AFM to cloud object storage fileset2 (cluster 2) and fileset3 (cluster 3) can be read, written and deleted. All AFM to cloud object storage filesets are enabled for reading and writing. This means that files created in fileset2 of cluster 2 are asynchronously uploaded to the cloud object storage bucket (sharedBucket) and made available to fileset1 and fileset3. Likewise, files modified in fileset3 are asynchronously uploaded to the cloud object storage bucket and made available to fileset1 and fileset2. Accordingly, all fileset are configured in IW mode. Figure 3 gives an overview about the solution:

 AFM to cloud object storage global collaboration

Figure 3: Architecture for hybrid cloud global collaboration solution

The configuration of the AFM to cloud object storage filesets in all three clusters is like the global sharing solution, with the exception that all filesets are configured in Independent-Writer mode (IW). Once all AFM to cloud object storage filesets are operational, files can be created, modified and deleted in all filesets. Files that are created and modified are asynchronously uploaded to the cloud object storage bucket (sharedBucket) and presented in the other AFM to cloud object storage filesets. When a file is deleted in one AFM to cloud object storage fileset, then this deletion is propagated to the other filesets. The names of the files are the same in all AFM to cloud object storage filesets. The path name may differ depending on the mount point of the file system and fileset within a cluster. Additionally, the ACL and EA are synchronized among all fileset via the cloud object storage bucket (sharedBucket).

Attention: Care must be taken when the same file (name) is modified in different clusters at the same time. Due to the asynchronous nature of the uploads the final version of the file in the cloud object storage is not deterministic. Generally, the last writer wins. The best practice for this kind of solution is to control which files are modified in which AFM to cloud object storage fileset at which time. This requires cross-cluster orchestration using additional resource and workload manager tools like IBM Spectrum LSF.

Summary of this use case:
With this use case it is possible to globally collaborate in the same name space. Different clusters in different locations can participate in this global collaboration. Because all AFM to cloud object storage fileset in all clusters upload files that have been created and modified, there is limited cost control over upload cost.

 

Object caching

In this use case the data is provided by an application directly into a cloud object storage bucket and cached in multiple AFM to cloud object storage filesets. These fileset reside in different IBM Spectrum Scale clusters. Files cached in the AFM to cloud object storage fileset1 pertaining to cluster 1 are read and writable, while files in the AFM to cloud object storage fileset2 pertaining to cluster 2 are read-only in fileset1. Figure 4 shows the high-level architecture of this solution:

 AFM to cloud object storage caching

Figure 4: Architecture for hybrid cloud object caching solution

The cloud object storage bucket (sharedBucket) is created and users are configured. The information about user credentials and endpoints is available. The cloud object storage bucket and the application are provisioned in the cloud. The application creates, stores and managed objects in the cloud object storage bucket (sharedBucket).

Fileset1 in cluster 1 is configured in IW-mode:
# mmafmcosconfig fs1 fileset1 --endpoint https://cloud.object.storage --bucket sharedBucket --mode iw --object-fs

Fileset2 in cluster 2 is configured in RO-mode:
# mmafmcosconfig fs2 fileset2 --endpoint https://cloud.object.storage --bucket sharedBucket --mode ro --object-fs

After the creation of the filesets all pre-existing objects in cloud object storage bucket (sharedBucket) are presented in both filesets. The files are in state un-cached and can be fetched on access or prefetched using the AFM prefetch command. Files stored in AFM to cloud object storage fileset2 of cluster2 are read-only. Files stored in AFM to cloud object storage fileset1 of cluster 1 can modified and new files can be created, causing these files to be uploaded to the cloud object storage bucket (sharedBucket).

Attention: Care must be taken because file modified in the IW-mode AFM to cloud object storage fileset overwrite objects created and managed by the application in the cloud object storage bucket.

The object names created by the object storage application are preserved in the AFM to cloud object storage filesets. An object named object1 in the “sharedBucket” and will appear as file object1 in the root path of fileset1. Likewise, when a new file is created in fileset1 with the name newfile1, then this file shows up in the sharedBucket as an object with the name newfile1. A file create in a subdirectory of fileset1 (e.g. dir1/newfile2) appears in the sharedBucket as object named dir1/newfile2.

Summary of this use case:
This use case bridges cloud and on-premise storage with on-demand data access. Data created and managed by applications running in the cloud can be seamlessly made available in one or more IBM Spectrum Scale clusters. These clusters can be deployed on-premise or off-premise. The AFM to cloud object storage filesets created in these IBM Spectrum Scale clusters can be configured in read-only mode to prevent changes to object. However, it is also possible to allow AFM to cloud object storage filesets to modify objects created by the cloud native application. With this solution true hybrid cloud data collaboration can be achieved. 

References

IBM Spectrum Scale AFM to cloud object storage function:

https://www.ibm.com/support/knowledgecenter/STXKQY_5.1.0/com.ibm.spectrum.scale.v5r10.doc/b1lins_quickreference_hpt.htm

1 comment
208 views

Permalink

Comments

Fri December 11, 2020 12:29 PM

"You need to manually download data or metadata from the cloud object storage to the AFM to cloud object storage fileset."

Can you explain this sentence a bit better?