Containerized workloads have the advantage of having consistent runtime environment to quickly go from development to production, required application isolation and low resource overheads. Hence more and more customers are looking to run their Hadoop/Spark workloads inside containers to leverage these advantages.
Containerized Hadoop/Spark workloads can read/write the data from/into IBM Spectrum Scale file system. If the container instance running the Hadoop or Spark jobs crashes and does a failover, this will not impact the integrity of the data that the applications had already written into IBM Spectrum Scale. Refer to Figure 1 for high level view of how IBM Spectrum Scale can be used as platform storage for running containerized workloads.
Figure 1 IBM Spectrum Scale as platform storage for running containerized workloads
There are two primary methods to configure IBM Spectrum Scale to work with containerized workloads:The 1st method – NFS based access
The 2nd method – IBM Storage Enabler for containers package
- Configure IBM Spectrum Scale protocol/NFS to leverage the container orchestration to mount the NFS exported directory into the container instances. After that, the applications running inside the containers can read/write the data from/into via the NFS mounted point.
IBM Spectrum Scale storage for Containerized standalone Spark workload
- Install the IBM Storage Enabler for Containers package. The container package and user guide can be downloaded from IBM Fixcentral.
- Follow the IBM Storage Enabler for Containers’ guide to configure the container orchestration. Once the container instances are created, it will request the PV from the orchestration and then will get the volume from the IBM Spectrum Scale file system. After that, the applications running inside the containers can read/write data directly via the POSIX interface.
If running Standalone Spark workloads inside containers, then follow the deployment mode in Figure 1 and use the 2nd method (IBM Storage Enabler for containers package) described above to provide POSIX interface for the Spark jobs directly. IBM Spectrum Scale storage for Containerized Hadoop workload requiring HDFS access
If you are running Hadoop workloads inside containers requiring HDFS based access, then follow the topology shown in Figure 2.
Figure 2: IBM Spectrum Scale as platform storage for containerized Hadoop workloads requiring HDFS access
The following section will describe how to configure containerized Hadoop/Spark requiring HDFS access with IBM Spectrum Scale platform storage:
- Follow the guide from the Hadoop vendors themselves to setup the Hadoop/Spark instance inside the containers.
- IBM Spectrum Scale is configured to run on the physical nodes and these physical nodes can be part of the Kubernetes cluster or not.
- On the IBM Spectrum Scale cluster, the HDFS Transparency can take different adapters (IP address) for its NameNode(s) and DataNodes.
- All container instances that belong to one Hadoop/Spark instance cluster requires to be able to communicate with each other over TCP/IP.
- The Hadoop/Spark container instances are required to be able to communicate with the HDFS Transparency service over TCP/IP.
- One HDFS Transparency instance can be configured to provide the service for one fixed directory from the IBM Spectrum Scale file system. If there are more than one Hadoop/Spark instances that needs to access different directories from the IBM Spectrum Scale file system, then it is required to configure different HDFS Transparency instances for each one. It is recommended to configure different physical nodes for different HDFS Transparency instances. That means, one physical node is configured only for one HDFS Transparency instance (Refer to Figure3):
Figure 3: Multiple HDFS Transparency instances over the same Spectrum Scale file system
- If there is a limited number of physical node and there is no need to have several HDFS Transparency instances over these limited physical nodes, then refer to the IBM Knowledge Center Multiple HDFS Transparency clusters on the same set of physical nodes section to configure different TCP port number for different instances to avoid network port number conflicts.
- If using a Hadoop distribution as the Hadoop platform running inside containers, then follow the Hadoop distro vendor’s guide on how to deploy it. Retain the native HDFS which was set up for the Hadoop platform inside the containers. Stop the native HDFS and change the HDFS schema to use the HDFS Transparency NameNode schema and then restart the native HDFS service.
Note: To stop Open Source Apache Hadoop, follow the Hadoop Shutdown commands. To change the HDFS schema to use the HDFS Transparency Namenode schema, change the fs.defaultFS value from hdfs://
- If using Open Source Apache Hadoop, then follow the Apache Launching Applications Using Docker Containers website.