IBM Spectrum Scale HDFS Transparency implementation integrates both the NameNodes and the DataNodes services and responds to the request as if it were HDFS on IBM Spectrum Scale file system (GPFS).
In HDFS, DataNodes contains the local disk for the file system and in order to scale the storage, more DataNodes are required to be added into the cluster. HDFS Transparency DataNodes can be viewed as the HDFS gateway to the IBM Spectrum Scale storage layer. The storage layer can be scaled separately from the DataNodes when using Elastic Storage Server (ESS).
In this PoC environment, the customer requirement is to reduce the number of nodes used by HDFS, especially the DataNodes. One of the issues they are facing is the high disk failure rates seen on their local disks in their Hadoop cluster which they need to manage. Because of this, IBM proposed to use ESS with HDFS Transparency as the solution to their problem.
In the customer environment, a new Hortonworks Data Platform (HDP®) with Ambari cluster was instantiated and a separate HDFS Transparency cluster was created that connects to the ESS.
To ensure successful deploy at the customer site, an internal PoC environment was created to test out setting up Kerberos with Open Source Apache Hadoop. This article will contain the ported and merged instructions executed at the customer site that have HDP and Ambari installed.Internal PoC Environment
- OS: Redhat 7.6
- Spectrum Scale: 22.214.171.124
- Open Source Apache Hadoop: 3.1.2
- HDFS transparency: 126.96.36.199
- Kerberos Server: krb5-server-1.15.1-34.el7.x86_64
- ESS GL1S / v188.8.131.52
- 1 name node and 2 data nodes for HDFS Transparency with IBM Spectrum Scale.
The configuration for the customer’s Hadoop environment is not specified in this blog. The customer environment can be viewed as the Hadoop Node/Ambari server as seen in the diagram below.
Figure 1. Customer’s PoC environment
There are several deployment models for integrating IBM Spectrum Scale with Hadoop. The model used for in this PoC is a variation of the 4th deployment model as described in the IBM Knowledge Center under Hadoop Scale Storage Architecture.
In IBM Spectrum Scale Knowledge Center, there is an existing Kerberos section describing how to enable Kerberos under HDP® but this is based on the deployment model that IBM Spectrum Scale service was deployed under Ambari.
Since this PoC separated the Hadoop cluster and IBM Spectrum Scale HDFS Transparency cluster for HDFS Transparency version 3.1.0 and below, one would need a new set of instructions to switch the instantiated Hadoop cluster to use IBM Spectrum Scale and be enable to Kerberized HDFS Transparency without going through Ambari.
Note: When switching an existing Hadoop cluster that uses HDFS filesystem, if there are any data residing in the HDFS filesystem, those data will no longer be accessible once the switch is done to use IBM Spectrum Scale filesystem.
Prerequisites before enabling Kerberos
Before enabling Kerberos with HDFS Transparency, the following steps are required to be completed:
- The Hadoop cluster is up and running. This cluster will be called “native Hadoop system” or “HDFS” or “native HDFS” in this blog.
- The KDC server is up and running and provides Kerberos authentication for all of the Hadoop components including HDFS.
- The ESS is up and running.
- IBM Spectrum Scale clients required for the HDFS Transparency cluster are up and running and is part of the ESS Scale cluster.
- The native Hadoop system, ESS cluster and the KDC server are able to communicate through the network.
Set up HDFS Transparency
Follow the Installation and configuration of HDFS transparency in the IBM Knowledge Center on how to install and configure HDFS Transparency.
Switch HDFS to HDFS Transparency
The existing Hadoop environment is setup to use native HDFS. In order to switch to use IBM Spectrum Scale, one would need to point the default filesystem from HDFS to IBM Spectrum Scale.
Hortonworks HDP® configurations files are located under /etc/hadoop/conf.
HDFS Transparency 3.1 configuration files are located under /var/mmfs/hadoop/etc/hadoop.
- gpfs-site.xml -- This file is for HDFS Transparency only
- Change the fs.defaultFS parameter in core-site.xml under the HDP® and HDFS Transparency configuration directories.
If the port number used for HDFS is 9000, and the port number used for HDFS Transparency is 8020, then the port number must be changed to 8020 in the core-site xml file in the native Hadoop environment to match the HDFS Transparency environment.
When making configuration changes under HDFS Transparency, modify the files from the HDFS Transparency NameNode and run the mmhadoopctl command to synchronize the configuration files from the HDFS Transparency NameNode to all the other HDFS Transparency nodes in the cluster.
Refer to Sync HDFS Transparency configurations in IBM Knowledge Center for more information.
On the HDFS Transparency NameNode, the following messages are displayed when the mmhadoopctl command is executed.
Note the output shows the configuration files which are synchronized to the other nodes.
- As root, on one of the HDFS Transparency nodes, start HDFS Transparency using mmhadoopctl command and verify that the NameNode and DataNodes are up and running.
Note: How to access IBM Spectrum Scale file system from the Hadoop client will be addressed later in this blog under “Access IBM Spectrum Scale via HDFS Transparency” section.
Enable Kerberos authentication with HDFS TransparencyPrerequisites
Follow these steps to enable Kerberos on HDFS Transparency
- Key Distribution Center (KDC) server is up and running
- Native Hadoop system is configured to work with Kerberos authentication on the KDC server.
- Create the HDFS Transparency Principals and KeyTabs on the KDC server.
Note: The kdb5_util create command should have already been executed to initialize the KDC server for the realm.
There are three types of principals required to be created:
- Host principals for the HDFS Transparency nodes
- Service principals for the NameNode and DataNodes specific to the hosts that are running the services
- User principals for the client accessing HDFS Transparency
Note: For services and user principals, the KeyTabs need to be created on the KDC and exported to their respective hosts.
The following table lists sample base services that are to be configured:
Note: The hostname
specified for the principals in the examples below should be replaced with the value returned by the “hostname” command from your cluster.
This example shows the principals for the host to be created:
This example shows the principals for Services to be defined:
This shows the principal for the user who is to access HDFS Transparency on a client node needs to be created. This is required to authenticate the user on the client node.
Refer to Setting up the Kerberos clients on the HDFS Transparency nodes
for more information.
Note: These steps are for CES HDFS HDFS Transparency version 3.1.1. Setting up Kerberos does not change on which HDFS Transparency version used but stopping/syncing configuration files/starting commands are different between the different HDFS Transparency versions.
Once the principals are created, create the KeyTabs for each host.
There are two types of Keytab files created for NameNode:
- The service principal for the NameNode service and the host principal are exported to KeyTab "nn.service.keytab"
- The service principal for the NameNode HTTP service and the host principal are exported to KeyTab "spnego.service.keytab"
For DataNode, the service principal for the DataNode service and the host principal are exported to KeyTab "dn.service.keytab". This ensures that every DataNode has a unique KeyTab. This mechanism provides better security.
Once the KeyTab file is created for one DataNode for one host, it should be copied over to the respective host before creating the KeyTab for the next DataNode.
The same KeyTab creation needs to be executed for the other DataNode, that would be “datanode-hostname2” for this example.
Finally, create a KeyTab file for the user, and copy it to the client node.
- Modify the Kerberos related configuration files on all the HDFS Transparency nodes (NameNode and DataNodes)
Note: The /etc/krb5.conf file needs to be changed to define the information related to the Kerberos realm.
In this example, the definitions marked in yellow are modified in order to match the customer’s Hadoop cluster environment.
The Hadoop client needs to have a separate principal name than the transparency NameNode and DataNode in order to access files via HDFS Transparency.
For example, hdfstr is the username.
The KeyTab files created above need to be copied to a directory, for example, called “/etc/security/keytabs.
Refer to step #5 Create principal required for NameNode in Setting up the Kerberos clients on the HDFS Transparency nodes
in the IBM Knowledge Center for more information.
- Add Kerberos stanza information into HDFS Transparency configuration files.
Copy the Kerberos stanza information from the native Hadoop system core-site.xml file into /var/mmfs/hadoop/etc/hadoop/core-site.xml on the HDFS Transparency NameNode.core-site.xml
hadoop.security.auth_to_local (Customer setup)
- Add additional Kerberos stanza information into /var/mmfs/hadoop/etc/hadoop/hdfs-site.xml on the HDFS Transparency NameNode.
and others based on Hadoop side information.
- Configure HDFS Transparency to point to the correct IBM Spectrum Scale filesystem.
See Configure HDFS transparency nodes
on how to modify these configuration files for HDFS Transparency.
Note: For this configuration, the gpfs.storage.type=shared
- After the configurations are modified, synchronized the configuration using mmhadoopctl from the HDFS Transparency NameNode.
- Now the HDFS Transparency cluster is ready to get the Kerberos ticket
Execute “kinit” command on each of the HDFS Transparency nodes to get the Ticket Granting Ticket (TGT).
Note: The operations for synchronizing the configuration files and starting HDFS transparency was already explained in under Switch HDFS to HDFS Transparency
Access IBM Spectrum Scale via HDFS Transparency
In order for a user, hdfstr, on the client node on the Hadoop system to access the IBM Spectrum Scale filesystem, the user needs to execute the kinit command on the client node to get a ticket from the KDC server.
First su – hdfstr
on the Hadoop client and then run kinit to get the Kerberos ticket.
After user hdfstr gotten the ticket, the user can access the filesystem as shown below:
The following example shows the case when “hdfstr” user did not perform the “kinit” command and did not get the proper ticket credentials. As the result, “hdfstr” user cannot access files via HDFS Transparency.
When Kerberos is enabled, any user who is not authenticated by the KDC server will not have access to IBM Spectrum Scale via HDFS Transparency.
In order to enable Kerberos with HDFS Transparency manually without using Ambari, the HDFS Transparency nodes require to get the Kerberos information from the Kerberized native Hadoop cluster to be setup in the configuration files and generate the principals and keytabs locally. After the HDFS Transparency cluster is Kerberized, users from the Hadoop clients can get the Kerberos ticket and access the IBM Spectrum Scale filesystem.
In short, steps require for Kerberos configuration are:
- Update /etc/krb5.conf based on the information of the native Hadoop system for all HDFS Transparency nodes
- default_realm name
- [realms] definitions
- [domain_realm] definitions
- Define the users and groups who require access: The user/group need to be the same on native Hadoop system and HDFS Transparency cluster.
For example, hdfstr:hadoop
- The KeyTab files created should be copied to /etc/security/keytabs/ on all the nodes in the HDFS Transparency cluster.
- Update /var/mmfs/Hadoop/etc/hadoop/core-site.xml based on the native Hadoop system information on the HDFS Transparency NameNode.
- The fs.defaultFS needs to be set to the HDFS Transparency NameNode definition.
For example, hdfs://namename-hostname:8020
- Enable Kerberos as the native Hadoop system by changing the following parameters in /var/mmfs/hadoop/etc/hadoop/core-site.xml:
- Update /var/mmfs/Hadoop/etc/hadoop/hdfs-site.xml based on the native Hadoop system information on the HDFS Transparency NameNode.
and others fields based on the native Hadoop system information.
- Update /var/mmfs/Hadoop/etc/hadoop/gpfs-site.xml which is IBM Spectrum Scale specific configuration file on the HDFS Transparency NameNode with the following values to enable HDFS Transparency to access the IBM Spectrum Scale filesystem properly.
- Synchronize the updated config files from the HDFS Transparency NameNode.
Execute mmhadoopctl connector syncconf /var/mmfs/hadoop/etc/hadoop/. Check no errors are detected.
Check that all the config files on all the NameNodes/DataNodes are correct.
- Start HDFS Transparency with mmhadoopctl connector start command
- On the native Hadoop client, login as the user authenticated by Kerberos to access IBM Spectrum Scale filesystem via HDFS Transparency.