File and Object Storage

 View Only

Big data analytics with Spectrum Scale using remote cluster mount & multi-filesystem support

By Archive User posted Tue November 28, 2017 10:33 AM

This article describes how to get more value for your analytic workloads from your Spectrum Scale storage by leveraging the following two recent capabilities. These capabilities were added to Spectrum Scale Ambari management pack version along with HDFS Transparency version, released in Oct 30,2017.

  1. Support of remote mounted Spectrum Scale filesystems for Hadoop

  2. Support of multiple Spectrum Scale filesystems for Hadoop

Let’s examine the features a bit and the value that they bring over the previous implementation. Throughout this document, though there are references to ESS for easier understanding, they are generally applicable for shared storage Spectrum Scale clusters as well.

  1. Remote mounted Spectrum Scale filesystems for Hadoop:

    Capability added for Hadoop clusters to be able to use a remote mounted filesystem from another Spectrum Scale cluster. With this feature, a remotely mounted Spectrum Scale filesystem may be utilized for running Hadoop application stack via the HDFS Transparency protocol. The feature allows higher flexibility, security & better capacity utilization of existing Spectrum Scale shared storage infrastructure such as the IBM Elastic Storage Server (ESS).

    Let’s compare at the options available to use an ESS prior to this feature and the advantages gained with the remote mount support. Consider one ESS cluster and one Hadoop cluster. The Hadoop nodes may not be not storage rich, so you want to leverage the filesystems on the ESS for Hadoop storage.

    BEFORE the remote mount feature:

    Spectrum Scale client and the HDFS Transparency components are configured onto the Hadoop hosts which are chosen to have Spectrum Scale. There is only one Spectrum Scale cluster which is the one on the ESS. The chosen Spectrum Scale hosts (via Ambari GUI) from the Hadoop cluster get added (through the mmaddnode command) to the Spectrum Scale cluster on the ESS and access its filesystems as client nodes.

    This downside of this design was that only one Hadoop cluster could access & use the Spectrum Scale cluster on the ESS at a given time.

    AFTER the remote mount support:

    • Multiple Hadoop clusters can leverage a common shared Spectrum Scale storage at the same time, allowing better utilization and flexibility. This can be useful for an organization requiring multiple Hadoop clusters catering to different lines of business or Geographies but sharing the same shared storage/ESS infrastructure.

      In this design, there would be n+1 Spectrum Scale clusters, where n is the number of Hadoop clusters that wish to access the Spectrum Scale cluster on the ESS. Unlike before, the Hadoop clusters aren’t added (i.e. don’t become a part of) to the Spectrum Scale cluster on the ESS. Rather, follow one of the following approaches to share the cluster on the ESS at the same time:

      • Each cluster (which is Hadoop cluster and also a Spectrum Scale cluster) mounts a common filesystem from the ESS and performs IO to one unique sub-directory (e.g. datadir_n) under that mount point. In this approach the entire storage capacity and IO bandwidth of the ESS is available to all the Hadoop clusters.

      • Each cluster (which is both a Hadoop + Spectrum Scale cluster) mounts a unique filesystem from the ESS and performs IO to the mount point. This may be desired in cases where there are multiple filesystems on the ESS already and you want to use different filesystems for different Hadoop clusters for better isolation.

      Let’s consider an example in which there are n Hadoop clusters which have mounted one filesystem called /ess_fs1 from the ESS as /ess_fs1. On Hadoop_cluster n, if the Transparency parameters are set as following:

      • gpfs.mnt.dir is set to /ess_fs1

      • is set to datadir_n

      • gpfs.mnt.dir is set to remote

      Applications on the Hadoop cluster n will see /ess_fs1/datadir_n as the Hadoop root directory and will perform IO to that directory.

      Note: An existing Spectrum Scale cluster and an existing remote cluster are required before configuring this feature through Ambari.

    • Better security for your shared storage cluster's administration environment:

      In the earlier design, the Hadoop nodes also became part of the Spectrum Scale cluster on the ESS and had access to full administration functionality of the ESS. With the remote mount feature, Hadoop nodes are no longer added to the cluster on the ESS. Rather a separate Spectrum Scale cluster is needed to be created over select Hadoop nodes upfront. Only one or more authorized filesystems from the ESS are exposed to the authorized Hadoop/Spectrum Scale clusters. This results in better security of the ESS administration environment.

    Details on how to configure this feature can be found here:

  2. Support for multiple Spectrum Scale filesystems for Hadoop:

    Capability added for multiple Spectrum Scale filesystems to be used from within a Hadoop cluster. The filesystems could be local or remote-mounted type. Both the filesystems can now be accessed from within the same Hadoop cluster and potentially from the same Hadoop application. For example, one single MapReduce Java application could use one filesystem for reading data and another for writing into.

    This brings in multiple benefits to Spectrum Scale users:

    • Easier Migration from shared-nothing architecture (FPO) to shared storage (e.g. ESS) :
      Smaller BD&A clusters would want to start with FPO than shared storage, as the latter gives a better TOC for a relatively larger data set. So, customers with modest storage needs, typically start with FPO model, leveraging storage rich commodity servers. As storage requirement grows, they might want to migrate to a centralized shared storage such as the ESS.

      So, for a period of time, they would have FPO file system and ESS filesystem coexisting. With this feature, it is possible to use FPO and shared storage filesystem within a single Hadoop cluster, which allows for a smoother migration from FPO to ESS/shared storage. Having the multi-Filesystem support enables customers to read/write data from/into FPO/ESS from the same Hadoop cluster and potentially from the same Hadoop application.

    • Single point Hadoop application view for multiple filesystems from different Spectrum Scale clusters: This provides an alternative to the HDFS Federation feature (since Hortonworks doesn’t support federation yet at the time of this writing)

      For example, if customers are invested in multiple ESS clusters and have data spread across them, one filesystem from each ESS could be remote mounted onto the Hadoop/Spectrum Scale cluster. This would allow for a single point of view of the filesystems from multiple Spectrum Scale clusters from Hadoop application perspective. This also allows for separate filesystems to be used for different applications or use-cases, e.g. use one filesystem for Hbase & the other for Hive. Otherwise, use one filesystem for storing active data & another for archived data.

    Note: Spectrum Scale Ambari management pack version supports only two filesystems configuration as following. This may be enhanced in future to include multiple (more than 2) filesystems.



    For details on how to configure this feature, please see: