Data is growing exponentially. IBM Spectrum Scale and IBM Spectrum Scale Elastic Server is an enterprise solution that is able to ingest and manipulate data seamlessly under one namespace for existing workloads and new workloads like Analytics, Artificial Intelligence, and Machine learning. This helps enterprises reduce data copies and storage footprint.
The Hadoop Distributed File System (referred as HDFS or native HDFS or native Hadoop in this article) uses a different data storage mechanism than IBM Spectrum Scale which is a POSIX file system. Therefore, IBM Spectrum Scale provides integration with Hadoop applications by using the HDFS Transparency connector. The HDFS Transparency offers a set of interfaces that allow applications to use the HDFS Client to access the IBM Spectrum Scale through HDFS RPC requests.
Lot of times enterprises start on their Hadoop journey with native HDFS, but as data grows they require a solution like IBM Spectrum Scale that has ability to server traditional applications, Hadoop applications and ML/DL applications from a single data repository. This article explains how you can migrate data from native HDFS to IBM Spectrum Scale based shared storage environment.
To be able to migrate the data easily from native HDFS to an IBM Spectrum Scale based shared storage, one needs to install the HDFS Transparency onto the IBM Spectrum Scale cluster.
For shared storage like the Elastic Storage server, it already contains an IBM Spectrum Scale cluster within and the HDFS Transparency is not added to ESS Spectrum Scale cluster. To add HDFS Transparency, one needs to set up a multi-Spectrum Scale cluster environment.
There are two setup options:
• The 1st option is to add the HDFS Transparency onto the IBM Spectrum Scale cluster so that data movement will be done through the HDFS protocol between the IBM Spectrum Scale cluster with HDFS Transparency and the Hadoop cluster.
The diagram below shows an ESS configuration where a separate local IBM Spectrum Scale with HDFS Transparency is configured to access the ESS.
For this option, the distcp jobs should be running on the Hadoop cluster. The data can be moved from the ESS to the Hadoop or from the Hadoop to the ESS. This scenario usually uses the ESS for data archival. Data ingestion will be done on the Hadoop cluster and one can archive the cold data from the Hadoop cluster into the ESS to save cost.
• The 2nd option is to set up a new Hadoop cluster on a local Spectrum Scale cluster that access the shared storage so that Storage Tiering model can be used. It is recommended to setup the local IBM Spectrum Scale cluster to access the ESS via remote mount. This creates a multi-cluster IBM Spectrum Scale environment and the one IBM ESS storage can be shared among the different groups. The remote mount mode can isolate the storage management from the IBM Spectrum Scale local cluster. Some operations from local clusters (for example, mmshutdown -a) will not impact the storage side Spectrum Scale. Refer to the following picture:
This mode can use the ESS as a fast data ingestion layer (e.g. Use the ESS GSxS model to leverage fast SSD for data ingestion). The analysis jobs running on the native Hadoop cluster with native HDFS can read and write the data from IBM Spectrum Scale Hadoop cluster in real time with only one copy of the data by leveraging Storage tiering capabilities. Therefore, distcp is not needed to copy data from the native Hadoop cluster to the IBM Spectrum Scale share storage. Storage tiering is for data sharing/access and distcp is for data copy/migrate usage.
There might be corner cases where you might want to start moving data into the ESS which is available first in your data center, but you do not have enough machines to create a local IBM Spectrum Scale cluster to host the HDFS Transparency in. Here are a couple of interim solution options for migrating data from the native Hadoop cluster into the ESS.
There are two interim setup options:
• The 1st option is to configure one node as the HDFS NFS Gateway in the Hadoop cluster and then mount the HDFS through the NFS Gateway into any one node in the Hadoop cluster. Use the scp command to copy the data from the NFS mount point into the remote ESS file system. Refer to the following picture:
• The 2nd option is to use the “hadoop dfs -copyToLocal” command to copy the data from the native HDFS into a local directory onto any node in the Hadoop cluster. Then use the scp command to copy the data from that node into the remote ESS file system. Refer to the following picture: