IBM Spectrum Scale HDFS Transparency implementation integrates both the NameNodes and the DataNodes services and responds to the request as if it were HDFS on IBM Spectrum Scale file system.
Starting from HDFS Transparency version 3.1.1 and IBM Spectrum Scale version 18.104.22.168, HDFS Transparency is integrated with both the IBM Spectrum Scale installation toolkit and Cluster Export Services (CES). Integration advantages
- Ability to quickly configure an IBM Spectrum Scale HDFS Transparency cluster to connect to an existing centralized storage in shared mode (where the HDFS Transparency NameNodes and DataNodes are part of the same GPFS cluster as the centralized storage) using the IBM installation toolkit.
- HDFS Transparency 3.1.1 uses IBM Spectrum Scale Cluster Export Services (CES) to manage and configure NameNodes state and configurations. All protocols use the mmces commands.
- Separation of compute and storage model for easy deployment and maintenance.
- A HDFS Transparency cluster can also be used for Hadoop storage tiering. See Hadoop Storage Tiering mode without native HDFS federation for more information.
Figure 1. CES HDFS single HDFS configuration layout in single GPFS cluster in shared modeLimitations
- Red Hat Enterprise Linux is supported.
- CES HDFS is not supported for Cloudera® distributions.
- CES HDFS supports Open Source Apache Hadoop.
- mmhadoopctl command is used for HDFS Transparency 3.1.0 and below.
- mmhdfs command is used for HDFS Transparency 3.1.1 for CES HDFS management.
- Starting in IBM Spectrum Scale 22.214.171.124, the installation toolkit now supports ESS deployment through the installation toolkit. Note: IBM Spectrum Scale version 126.96.36.199 only supports SAN based shared storage so ESS deployment through the installation toolkit is not supported. See IBM Knowledge Center for CES HDFS support Matrix.
See Support Matrix
and Limitations and Recommendations
sections of the IBM Spectrum Scale Big Data and Analytics support documentation within the IBM Knowledge Center.Sample configuration used for this exampleThis sample adds the CES HDFS nodes into the centralized file systemCentralized cluster node (Admin node):
c902f05x10.gpfs.netNote: If using ESS, use the EMS node: c902ems.gpfs.netNameNodes:
c902f08x01.gpfs.net, c902f08x03.gpfs.net DataNodes:
c902f08x04.gpfs.net, c902f08x13.gpfs.net, c902f08x14.gpfs.net, c902f08x15.gpfs.netInstaller node ip address:
172.16.1.125CES Public IPs:
Note: For more information on CES Public IPs assignment and setup , see CES IP aliasing to network adapters on protocol nodes
section in IBM Spectrum Scale documentation in the IBM Knowledge Center. Installation steps
This example gives sample instructions on how to deploy a CES HDFS cluster onto an existing centralized storage that is up and running.
- Ensure the steps for HDFS Transparency pre-req setup are done. For example, on all HDFS Transparency nodes:
- Install base packages required
yum -y install kernel-devel cpp gcc gcc-c++ binutils make net-tools java-1.8.0-openjdk* bind-utils
- Set Java path
- Edit /etc/security/limits.conf limits
* soft nofile 65536
* hard nofile 65536
* soft nproc 65536
* hard nproc 65536
- Ensure the CES Shared Root file system is created and available
- On the installer node, download and extract the IBM Spectrum Scale installer bin and accept the license.
- On the installer node, cd to where the IBM Spectrum Scale installer resides, default installs to the following directory:
- On the installer node, run the following commands to create the CES HDFS cluster to use the centralized storage:
# Configure the installer node using its IP address
./spectrumscale setup -s 172.16.1.125
# Note: If using ESS as the storage, set the -st ess flag
# ./spectrumscale setup -s 172.16.1.125 -st ess
# Discover and populate the existing cluster configuration
./spectrumscale config populate --node c902f05x10.gpfs.net
# Note: If using ESS, then populate using the EMS node
# ./spectrumscale config populate --node c902ems.gpfs.net
Installer will keep backup of existing clusterdefinition.txt file in /usr/lpp/mmfs/188.8.131.52/installer/configuration path and populate a new one. Do you want to continue [Y/n]: y
Do you want to provide IP addresses for NTP [Y/n]: n
Note: This is because NTP setup is not supported for adding nodes to an existing cluster.
# Add HDFS Transparency NameNodes as protocol nodes (-p)
./spectrumscale node add c902f08x01.gpfs.net -p
./spectrumscale node add c902f08x03.gpfs.net -p
# Add HDFS Transparency DataNodes
./spectrumscale node add c902f08x04.gpfs.net
./spectrumscale node add c902f08x13.gpfs.net
./spectrumscale node add c902f08x14.gpfs.net
./spectrumscale node add c902f08x15.gpfs.net
./spectrumscale install -precheck
# Verify install
# Enable CES HDFS protocol
./spectrumscale enable hdfs
# Configure the CES public IPs. At least two IPs must be specified.
Note: CES IPs must be unused IPs, belong to a subnet made available by an existing adapter and routes on each HDFS Transparency NameNode, and forward/reverse DNS lookup must be in place for each CES IP.
./spectrumscale config protocols -e 172.16.2.80,172.16.2.84
# Configure the CES Shared root filesystem
./spectrumscale config protocols -f gpfs -m /ibm/gpfs/cessharedroot
# Create the new HDFS Transparency cluster with unique cluster name and data directory names. This example uses cluster name “cescluster1” and data directory “gpfscluster1”. Note: Cluster name does not support special characters. Spaces are not supported between commas.
./spectrumscale config hdfs new -n cescluster1 -nn c902f08x01.gpfs.net,c902f08x03.gpfs.net -dn c902f08x04.gpfs.net,c902f08x13.gpfs.net,c902f08x14.gpfs.net,c902f08x15.gpfs.net -f gpfs -d gpfscluster1
# Check the HDFS configuration
./spectrumscale config hdfs list
# Deploy HDFS
./spectrumscale deploy --precheck
# Check the -k ACL value for the filesystem is set to “all” when using HDFS protocol. If not, set the -k using the mmchfs
# Verifying cluster
- Check to see one can write to the GPFS mount point via POSIX
then edit testposix to put some values into the file
- Check HDFS Transparency status for the NameNodes and Datanodes (must be executed on either a DataNode or a NameNode)
/usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs status
- Check HDFS Transparency Namenode status
/usr/lpp/mmfs/bin/mmhealth node show HDFS_Namenode -v -N cesNodes
- Check CES IP address and group names
/usr/lpp/mmfs/bin/mmces address list --full-list
Note: CES appends the “hdfs” prefix to the group name
- Run basic hadoop commands
/usr/lpp/mmfs/hadoop/bin/hdfs dfs -mkdir -p /user/root
/usr/lpp/mmfs/hadoop/bin/hdfs dfs -ls /user
/usr/lpp/mmfs/hadoop/bin/hdfs dfs -cp /testposix /user/root
/usr/lpp/mmfs/hadoop/bin/hdfs dfs -cat /user/root/testposix