Deploying IBM Spectrum Scale File System using Apache Ambari framework on Hadoop clusters

View Only

Deploying IBM Spectrum Scale File System using Apache Ambari framework on Hadoop clusters

By Archive User posted Tue June 27, 2017 12:52 PM

Like

The Big data revolution has led to increased number of data processing and analytics applications. These Big data analytics ecosystems requires a robust, scalable and enterprise level file system to store this huge amount of data. The default file system used in Hadoop ecosystem for storing the data is Hadoop Distributed File System (HDFS).

The IBM Spectrum Scale file system, offers an enterprise-class alternative to Hadoop Distributed File System (HDFS) for building big data platforms. IBM Spectrum Scale is a high-performing, POSIX-compliant technology that is used in thousands of mission-critical commercial installations worldwide. The IBM Spectrum Scale file system can be deployed independently or with IBMs big data platform which, consists of IBM BigInsights for Apache Hadoop. IBM Spectrum Scale is now certified with Hortonworks HDP 2.6 hadoop distribution as well.

Apache Ambari is open source tool used for management, provisioning and monitoring of hadoop clusters. Apache Ambari has a pluggable architecture wherein any service can be added or removed easily. Also, it provides an easy-to-use Hadoop management web UI backed by its RESTful APIs. Major Hadoop distributions such as Hortonworks and IBM Big Insights use Apache Ambari for managing and creating big data Hadoop clusters.

The Ambari integration package helps in leveraging the pluggable architecture of ambari server and simplifying the addition of spectrum scale as a service to an existing Big Data cluster. Furthermore, the Ambari Service Addition Wizard helps in easy configuration and installation of Spectrum Scale file system as a service.

The Apache Ambari 2.4.0 or higher provides a unified way of adding a custom service to hadoop clusters using a management pack. This allows the same management pack can be added to different hadoop distributions which are using ambari 2.4.0 or higher. The same IBM Spectrum Scale management pack can be added to IBM Big insights 4.2.5 and Hortonworks HDP 2.6 distribution.

IBM BigInsights is an enterprise ready platform for Hadoop Ecosystem. IBM BigInsights provides Apache Hadoop and its related open source projects as core components, along with several IBM features to provide enterprise-class capabilities. It also provides Web Management Console, Development tools such as Eclipse plug-ins and a text analytics workbench, and analytics accelerators, visualization tools and connectors to ingest and integrate data from variety of data sources.

Hortonworks HDP is the industry's only true secure, enterprise-ready open source Apache Hadoop distribution based on a centralized architecture (YARN). HDP addresses the complete needs of data-at-rest, powers real-time customer applications and delivers robust analytics that accelerate decision making and innovation. With latest version HDP 2.6, customers benefit from interactive query in seconds, enhanced data science, enterprise-grade security and streamlined operations, in the cloud and on-premises, to harvest value from their data faster than previously possible.

The Ambari integration package provides a way of integrating the installation and provisioning of IBM Spectrum Scale filesystem within an existing hadoop cluster.

The transparency daemons upon installation replaces the HDFS RPC daemons such as datanode and namenode to redirect the I/O request to IBM Spectrum Scale file system instead of HDFS file system.

The Ambari integration package allows addition of IBM Spectrum Scale as a service on the existing Hadoop cluster using ambari. When IBM Spectrum Scale service is integrated in the Hadoop cluster, there is a flexibility to integrate and unintegrate IBM Spectrum Scale service.
After unintegration the I/O request from the Hadoop clients again routes back to HDFS. The HDFS service panel reflects the HDFS Transparency daemon status, The Transparency namenode and datanodes daemons seamlessly replace the HDFS daemons.

The namenode Ambari Metrics shown on the HDFS service page is emitted by transparency namenode. The file system metadata is stored in the IBM Spectrum Scale file system. Therefore, the transparency namenode becomes a stateless entity daemon. Whenever a block request is made to the namenode, it fetches the blocklocation from the metadata stored in the IBM Spectrum Scale file system. The Transparency namenode does not create the fsimage and editlogs.

The fsimage is used for storing the Namenode inode and other information after HDFS namenode shutdown and editlogs is used to track the HDFS namenode runtime operation so when there is any crash, the namenode could recover the status by checking this log and the fsimages. There is no need for secondary namenode since the merging of the editlogs and fsimage is not required in case the IBM Spectrum Scale is integrated because transparency namenodes do not create fsimages and editlogs. Therefore, we remove the Secondary Namenode component (which is used for merging the editlogs and fsimages and send it back to ) in case of non HA from the HDFS service panel and that component is added back when we unintegrate the IBM Spectrum Scale service from the Ambari server. The stateless transparency namenode helps in case of disaster recovery since the namenode is not dependent on the editlogs and fsimage for filesystem recovery.

Spectrum Scale benefits over HDFS:-

In-place data analytics. Spectrum Scale is POSIX compatible, which supports various applications and workloads. With Spectrum Scale HDFS Transparency Connector, you can analyze file and object data in-place with no data transfer or data movement.

Flexible deployment mode. You can not only run IBM Spectrum Scale on commercial storage rich server, but also choose IBM Elastic Storage Server (ESS) to provide higher performance massive storage system for your Hadoop workload. You can even deploy Spectrum Scale in traditionally SAN storage system as well for HDP.

Spectrum Scale enterprise-class data management features, such as POSIX-compliant APIs or the command line

Unified File and Object support (NFS,SMB,Object)

FIPS and NIST compliant data encryption

Cold data compression

Disaster Recovery

Snapshot support for point-in-time data captures

Policy-based information lifecycle management capabilities to manage PBs of data

Maturely enterprise-level data backup and archive solutions (inclusing Tape)

Remote cluster

Seamless secure tiering to Cloud Object stores

HDFS Transparency Connector

IBM Spectrum Scale HDFS Transparency Connector (part of IBM Spectrum Scale Offering) offers a set of interfaces that allows applications to use HDFS Client to access IBM Spectrum Scale through HDFS native RPC requests. All data transmission and metadata operations in HDFS are through the RPC mechanism and processed by NameNode and DataNode services within HDFS. IBM Spectrum Scale HDFS Transparency integrates both the NameNode and the DataNode services, and responds requests from HDFS client. In other words, HDFS client can continue to access Spectrum Scale seamlessly just as it did with HDFS.
[caption id="attachment_3623" align="alignnone" width="884"]

Spectrum Scale HDFS Transparency Connector Architecture[/caption]

Key advantage of Spectrum Scale Transparency Connector includes :

No Spectrum Scale Client is needed on every Hadoop node. HDFS client can access data on Spectrum Scale as it does with HDFS storage.

Full Kerberos support for more Hadoop components (e.g. Impala which will call HDFS client directly without calling Hadoop FileSystem interface, discp, webhdfs)

Leverages HDFS client cache

HDFS compliant APIs or shell-interface command

Application client isolation from storage. Application client may access data in the IBM Spectrum Scale filesystem without having a GPFS client installed.

Improved security management by Kerberos authentication and encryption for RPCs

Simplified file system monitoring by Hadoop Metrics2 integration

Installating the Ambari integration package on an existing hadoop cluster or a new cluster :

This section provides the overview of adding spectrum scale as a service on a existing Hadoop cluster and shows how easily it can be integrated with the current

Download the integration package or management pack from the wiki, setup the gpfs repo having transparency connector rpm. The integration package and HDFS Transparency connector can be downloaded from this link (BI 4.2.5 and HDP 2.6)

For Big Insights 4.2.5 and Hortonworks HDP 2.6 ,

#./SpectrumScaleIntegrationPackageInstaller-2.4.2.0.bin

For Big Insights 4.2.0 or below, use this integration package.

# ./gpfs.hdfs-transparency.ambari-iop_4.2-1.noarch.bin

Installing this integration package links the IBM Spectrum Scale as a service to the existing hadoop stack. This integration allows the Ambari GUI wizard to identify the IBM Spectrum Scale as a service to be installed.

[caption id="attachment_3632" align="alignnone" width="1286"]

Actions Panel for adding the service.[/caption]

[caption id="attachment_3657" align="alignnone" width="897"]

Custom Service Addition Panel in Ambari Server GUI[/caption]

[caption id="attachment_3637" align="alignnone" width="1277"]

Assignment of GPFS_MASTER on one of the node.[/caption]

[caption id="attachment_3638" align="alignnone" width="1275"]

Assignment of GPFS NODES on all the nodes of a cluster.[/caption]

[caption id="attachment_3639" align="alignnone" width="1278"]

IBM Spectrum Scale customize service panel which is used for configuring the file system parameters [/caption]

[caption id="attachment_3640" align="alignnone" width="1281"]

Final Review Panel of Spectrum Scale Service[/caption]

[caption id="attachment_3641" align="alignnone" width="1274"]

Installation completion of Spectrum Scale Service in Ambari Server[/caption]

The cluster is now created successfully. This can be verified from the command prompt of one of the nodes by running this command

# /usr/lpp/mmfs/bin/mmlscluster

GPFS cluster information
========================
  GPFS cluster name:         bigpfs.gpfs.net
  GPFS cluster id:           4605732645497527881
  GPFS UID domain:           bigpfs.gpfs.net
  Remote shell command:      /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
  Repository type:           CCR

 Node  Daemon node name     IP address   Admin node name      Designation
--------------------------------------------------------------------------
   1   c902f10x09.gpfs.net  172.16.1.91  c902f10x13.gpfs.net  quorum
   2   c902f10x10.gpfs.net  172.16.1.93  c902f10x14.gpfs.net  quorum
   3   c902f10x11.gpfs.net  172.16.1.95  c902f10x15.gpfs.net  
   4   c902f10x12.gpfs.net  172.16.1.97  c902f10x16.gpfs.net  quorum

The status of all the nodes can also be verified using this 

# /usr/lpp/mmfs/bin/mmgetstate -a

 Node number  Node name        GPFS state 
------------------------------------------
       1      c902f10x09       active
       2      c902f10x10       active
       3      c902f10x11       active
       4      c902f10x12       active

IBM Spectrum Scale Service Panel shown in Ambari Server.

The Spectrum Scale service is deployed as a two component service :-
1. GPFS_MASTER
2. GPFS_NODE

GPFS_MASTER creates filesystem and adds the gpfs nodes into the filesystem. GPFS_NODE are the nodes mounting the filesystem.

After you have successfully added the IBM Spectrum Scale file system, the HDFS service panel displays the transparency connector daemon status such as Namenodes and datanodes.

[caption id="attachment_3643" align="alignnone" width="1191"]

HDFS Service Panel as displayed in Ambari Server after spectrum scale integration, the namenodes and datanodes are transparency daemons[/caption]

Upgrading the IBM Spectrum Scale Service using Ambari Server in BigInsights:-

IBM Spectrum Scale Service can be updated using the Ambari Server GUI as well.

Upgrade of separate components can be done independently, user has to specify the new repo location from which the upgrade will happen.

1. Upgrading Spectrum Scale :- This option upgrades the IBM Spectrum Scale with the latest provided rpms in the repository.

2. Upgrading Transparency :- This option upgrades the transparency connector rpm on all the GPFS_Nodes.

[caption id="attachment_3645" align="alignnone" width="1274"]

Spectrum Scale Repo Update configuration[/caption]

[caption id="attachment_3644" align="alignnone" width="1198"]

IBM Spectrum Scale Service Upgrade options.[/caption]

Running TeraSort MapReduce Benchmark with IBM Spectrum Scale

The Terasort benchmark can be run on the cluster when IBM Spectrum Scale is integrated with the hadoop cluster. This benchmark is used to test the CPU/Memory power of the cluster. The filesize can be varied according to the available resources of a cluster.

# hadoop jar /usr/iop/4.2.5.0-0000/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 100000 /tmp/DjTeragen/
WARNING: Use "yarn jar" to launch YARN applications.
17/06/27 08:38:10 INFO impl.TimelineClientImpl: Timeline service address: http://c902f10x10.gpfs.net:8188/ws/v1/timeline/
17/06/27 08:38:10 INFO client.RMProxy: Connecting to ResourceManager at c902f10x10.gpfs.net/172.16.1.93:8050
17/06/27 08:38:11 INFO terasort.TeraSort: Generating 100000 using 2
17/06/27 08:38:11 INFO mapreduce.JobSubmitter: number of splits:2
17/06/27 08:38:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1498566741990_0007
17/06/27 08:38:12 INFO impl.YarnClientImpl: Submitted application application_1498566741990_0007
17/06/27 08:38:12 INFO mapreduce.Job: The url to track the job: http://c902f10x14.gpfs.net:8088/proxy/application_1498566741990_0007/
17/06/27 08:38:12 INFO mapreduce.Job: Running job: job_1498566741990_0007
17/06/27 08:38:17 INFO mapreduce.Job: Job job_1498566741990_0007 running in uber mode : false
17/06/27 08:38:17 INFO mapreduce.Job:  map 0% reduce 0%
17/06/27 08:38:21 INFO mapreduce.Job:  map 50% reduce 0%
17/06/27 08:38:22 INFO mapreduce.Job:  map 100% reduce 0%
17/06/27 08:38:22 INFO mapreduce.Job: Job job_1498566741990_0007 completed successfully
17/06/27 08:38:22 INFO mapreduce.Job: Counters: 31
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=265820
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=164
		HDFS: Number of bytes written=10000000
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
	Job Counters 
		Launched map tasks=2
		Other local map tasks=2
		Total time spent by all maps in occupied slots (ms)=4796
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=4796
		Total vcore-milliseconds taken by all map tasks=4796
		Total megabyte-milliseconds taken by all map tasks=17188864
	Map-Reduce Framework
		Map input records=100000
		Map output records=100000
		Input split bytes=164
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=74
		CPU time spent (ms)=2840
		Physical memory (bytes) snapshot=545300480
		Virtual memory (bytes) snapshot=10188140544
		Total committed heap usage (bytes)=580386816
	org.apache.hadoop.examples.terasort.TeraGen$Counters
		CHECKSUM=214574985129000
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=10000000

The data generated in the teragen stage is stored in the IBM Spectrum Scale mountpoint. This data is then given as input to the terasort.

# hadoop jar /usr/iop/4.2.5.0-0000/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /tmp/DjTeragen/ /tmp/DjTeraoutput
WARNING: Use "yarn jar" to launch YARN applications.
17/06/27 08:45:27 INFO terasort.TeraSort: starting
17/06/27 08:45:28 INFO input.FileInputFormat: Total input paths to process : 2
Spent 147ms computing base-splits.
Spent 2ms computing TeraScheduler splits.
Computing input splits took 150ms
Sampling 2 splits of 2
Making 1 from 100000 sampled records
Computing parititions took 271ms
Spent 423ms computing partitions.
17/06/27 08:45:29 INFO impl.TimelineClientImpl: Timeline service address: http://c902f10x10.gpfs.net:8188/ws/v1/timeline/
17/06/27 08:45:29 INFO client.RMProxy: Connecting to ResourceManager at c902f10x10.gpfs.net/172.16.1.93:8050
17/06/27 08:45:29 INFO mapreduce.JobSubmitter: number of splits:2
17/06/27 08:45:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1498566741990_0008
17/06/27 08:45:30 INFO impl.YarnClientImpl: Submitted application application_1498566741990_0008
17/06/27 08:45:30 INFO mapreduce.Job: The url to track the job: http://c902f10x14.gpfs.net:8088/proxy/application_1498566741990_0008/
17/06/27 08:45:30 INFO mapreduce.Job: Running job: job_1498566741990_0008
17/06/27 08:45:35 INFO mapreduce.Job: Job job_1498566741990_0008 running in uber mode : false
17/06/27 08:45:35 INFO mapreduce.Job:  map 0% reduce 0%
17/06/27 08:45:40 INFO mapreduce.Job:  map 100% reduce 0%
17/06/27 08:45:45 INFO mapreduce.Job:  map 100% reduce 100%
17/06/27 08:45:45 INFO mapreduce.Job: Job job_1498566741990_0008 completed successfully
17/06/27 08:45:45 INFO mapreduce.Job: Counters: 50
	File System Counters
		FILE: Number of bytes read=10400006
		FILE: Number of bytes written=21202869
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=10000214
		HDFS: Number of bytes written=10000000
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=1
		Rack-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=5624
		Total time spent by all reduces in occupied slots (ms)=5330
		Total time spent by all map tasks (ms)=5624
		Total time spent by all reduce tasks (ms)=2665
		Total vcore-milliseconds taken by all map tasks=5624
		Total vcore-milliseconds taken by all reduce tasks=2665
		Total megabyte-milliseconds taken by all map tasks=20156416
		Total megabyte-milliseconds taken by all reduce tasks=19102720
	Map-Reduce Framework
		Map input records=100000
		Map output records=100000
		Map output bytes=10200000
		Map output materialized bytes=10400012
		Input split bytes=214
		Combine input records=0
		Combine output records=0
		Reduce input groups=100000
		Reduce shuffle bytes=10400012
		Reduce input records=100000
		Reduce output records=100000
		Spilled Records=200000
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=129
		CPU time spent (ms)=5770
		Physical memory (bytes) snapshot=4995510272
		Virtual memory (bytes) snapshot=18434174976
		Total committed heap usage (bytes)=5077729280
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=10000000
	File Output Format Counters 
		Bytes Written=10000000
17/06/27 08:45:45 INFO terasort.TeraSort: done

Conclusion

IBM Spectrum Scale provides a robust, reliable and enterprise level alternative to HDFS file system used in existing hadoop clusters. Ambari Integration package and Management packs provide a simpler and easy to use method to add Spectrum Scale service to existing Hadoop clusters. HDFS Transparency connector provides a seamless way for any existing big data application and HDFS clients to interact with IBM Spectrum Scale file system as they would have with HDFS file system. Ambari server provides a easier management console to manage, monitor and upgrade the IBM Spectrum Scale service. Hortonworks HDP 2.6 hadoop distribution is also now certified to run seamlessly with IBM Spectrum Scale file system.

For more detailed Instructions, refer to
IBM Knowledge Center ( Big data and analytics )

4 comments

2 views

Permalink

Comments

Archive User

Thu June 29, 2017 06:31 AM

Thanks Jean. I looked up Binfer. Thats a tool to transfer large files. I didn't understood the reference here.

Archive User

Wed June 28, 2017 10:40 AM

This was informative. Please allow me to add to this conversation. Have you heard about Binfer? Very easy tool to transfer big data.

Archive User

Wed June 28, 2017 04:26 AM

Very nice blog Deepak. Really a nice piece of work. Lots of technical details about IBM Spectrum Scale on Big Data and Analytics front.

Nathan Falk

Tue June 27, 2017 02:47 PM

Good work, Deepak! Lots of technical details in here about how Spectrum Scale's HDFS transparency service plugs in and what advantages Spectrum Scale can have over traditional HDFS.

IBM Storage

The online community where IBM Storage users meet, share, discuss, and learn.

File and Object Storage