File and Object Storage

 View Only

Benchmarking IBM Storage Scale HDFS transparency on Hadoop clusters

By Philipp Macher posted 24 days ago



This blog post will focus on the benchmarking of HDFS transparency on Hadoop clusters, as well as the automation of the process. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers. These clusters are used to process big data, running operations on terabytes of information. Still, Hadoop is designed to scale up from single servers to thousands of machines.

Instead of installing the Hadoop components themselves, we use HDP (Hortonworks Data Platform). HDP is built on Hadoop and contains Hadoop features combined with management tools. One of these tools (which will be used later) is called Apache Ambari. It is a tool used to set up, manage, and monitor the whole HDP cluster.

By default, Hadoop uses a file system called Hadoop Distributed File System (HDFS). But instead of HDFS, we will be using IBM Storage Scale (‘Scale’ for short). Scale is a high performance clustered parallel file system developed by IBM. Contrary to HDFS, Scale is POSIX compatible and offers a global namespace for all your data, allowing parallel access by various data analytics applications and through many protocols like SMB, NFS and OBJECT. POSIX compatibility allows to extend Hadoop with existing operational processes, like backing up directly to tape.

IBM Storage Scale HDFS transparency (‘HDFS transparency’ for short) is developed by IBM to enable the usage of Scale as the underlying file system for HDFS. The Hadoop cluster can’t make out any difference, storing and retrieving data works just like without Scale.

This post will give an insight into some technologies used for automated benchmarking in this context and how they can be used.



As stated in the introduction, part of this blog will cover the automation of the benchmarking process.

As setting up a Hadoop cluster is a rather long process, it can save a lot of time to run an automation instead of going through the whole setup process over and over by hand. To test different HDFS transparency versions and clusters without transparency, the cluster setup will be executed repeatedly.

Also running an automation multiple times at the same state with the same parameters should result in exactly the same cluster, which is essential for the comparison of different benchmarks.


For automation, different techniques can be considered. We chose Ansible as a framework to control the whole process. This section will give a brief introduction.

Ansible is in theory a declarative (and therefore idempotent) programming language. This means that the program specifies a state the target system has to be in and Ansible will perform steps to set the target into this state, if needed. If the target is in the right state, nothing should happen, therefore running an Ansible “program” a second time should not change anything. This concept of “repeating does not change anything” is called idempotency. In practice, preserving this idempotency can be hard for the programmer and therefore is sometimes neglected. An example for the effort needed to ensure idempotency will be given in the “Ansible Modules” section.

Ansible “programs” are called playbooks and consist of a list of so-called plays. Each of these plays is a list of tasks and all tasks of the same play will be run sequentially on the same set of hosts. Usually a task defines a state the hosts should be in (package x shall be installed, path y shall be a directory, etc). Every Ansible file is written in YAML (“yet another markup language”/”YAML ain't markup language”). Variables are encapsulated in {{ double curly braces }}, as Ansible uses Jinja2 templating.

Ansible is a great choice for automating, but it has a few drawbacks. Some tasks can be quite complicated in Ansible, as well as implementing more complex logic. Idempotency is an important feature of Ansible, however, it often is considerably more effort for the programmer and quite a different programming scheme than most schemes people are used to.

An example task that installs a yum package ensures a yum package is installed (remember, we only define a state to reach) on our Ambari host:

  - name: Install and configure ambari-agent
    hosts: ambari
    become: yes
    gather_facts: no
      - name: Install ambari-agent
          name: ambari-agent
          state: installed

As one might notice, the structure of YAML is quite similar to JSON, while its notation is different.

Cluster creation

This section will focus on the creation of a Hadoop cluster including the automation of the process using a REST API. As the complete setup consists of a lot of steps, often detailed and platform-specific, this section will focus on the most important parts of this process.

The whole Hadoop cluster in our case is managed from within a VM running Apache Ambari, a tool for the management and deployment of Hadoop clusters. It offers a web UI and an API while collecting metrics of all involved nodes and offering centralized control of the whole Hadoop cluster.

For example, the dashboard of a ready, but idle cluster looks like this:

More info about Ambari can be found on the Apache Ambari website.

Choosing a cluster structure

To set up a Hadoop cluster, Ambari needs to distribute the different services used to the available hosts and needs a list of hosts that are supposed to be part of the cluster. On our hardware, 10 worker nodes, two virtual machines (one running Ansible, one running Ambari), and an Enterprise Storage Server (ESS) are available. The ESS is a cluster of storage nodes, which will be used to store all the Hadoop data later on.

Ambari provides a UI to guide users through the process of setting up a cluster. As automation is one of our main goals, using the UI will be avoided during the final setup process. Instead, we will use Ambaris API to perform necessary tasks, with each API call executed by Ansible. To select a cluster structure, two files containing the desired cluster structure can be uploaded to Ambari. One file, called the “blueprint”, declares host groups and assigns services and configurations to them. As this file gets very big (over 3000 lines of JSON) and complicated, we will avoid creating it dynamically. To overcome this, we used the UI to create a cluster configuration and downloaded its blueprint as a reference.

A second file is the cluster template, which is quite simple and assigns host addresses to the host groups defined in the blueprint. This file can be edited easily if needed. We also created it with Ambaris UI for the first time.

Setting up the desired cluster

This subsection will give more detailed insight into the steps it takes to set up the cluster with the chosen structure, as well as formatting disks as needed, and the communication with Ambaris API.

Setting up disks

Before instructing Ambari to deploy its services, some of the nodes need to have some disks partitioned and mounted, as HDFS needs storage on its nodes. The logic and details regarding the formatting were implemented in a Python script, which gets called from Ansible.

How the disks should be formatted is defined in a variable file, like all Ansible variables used here. To make it easier to locate the right variable, they were split up into different files. Having all the variables organized this way makes it easier to log configurations when needed, as they can easily be read by every playbook and offer a single location where settings can be tweaked, so no user has to read trough actual playbooks to change configs. An example config entry used to specify a disk and the file system which it should be formatted with:

      size: 3.7T
      filesystem: ext3
      mount: /disk01

As is it not impossible for the disk names to change, another way is needed to assure no crucial disk (the one containing the OS, for example) is formatted. Therefore a size check is performed, since the crucial disks are considerably smaller than the ones which have to be formatted. For this check, the expected size for each disk is stated in the config and the script will abort when a disk has a different size than expected.

Uploading the Blueprint with Ambaris REST API

To upload the blueprint, Ambaris API is used, which will be introduced in this section, with the example of uploading a blueprint in the end.

Ambari offers a very detailed REST API, which we will use to control Ambari. Documentation can be found in the ambari-server git repository. As the references sometimes are not that detailed, exploring the API by navigating around is quite helpful. It is easy to find subpages of every page, as every page contains links to all its subpages. To explore the API in your browser, first authenticate on the web at the Ambari main page. For instance, to see a list of uploaded blueprints, open http://ambari:8080/api/v1/blueprints to get

"href" : "http://ambari:8080/api/v1/blueprints",
"items" : [
"href" : "http://ambari:8080/api/v1/blueprints/mycluster",
"Blueprints" : {
"blueprint_name" : "mycluster"

Where it can be seen that only a single blueprint called “mycluster” has been uploaded. For more details, the path of the item’s page is listed in its “href” attribute.

In total, the Ambari API offers a lot of functionalities, which will be very useful later on, as this makes detailed controlling of the cluster setup possible.

To upload our desired blueprint (and the cluster template in a second step), we will perform a POST request to the endpoint http://ambari:8080/api/v1/blueprints/:mycluster (while ‘mycluster’ is the blueprint name, different blueprints can be uploaded):

- name: Create Cluster
  hosts: localhost
    - var_files/vars.yaml
    - name: Register blueprint
        url: "{{ ambari_url }}/api/v1/blueprints/mycluster"
        url_username: admin
        url_password: admin
        force_basic_auth: yes
          X-Requested-By: ansible
          - 200
          - 201
          - 202
        method: POST
        body_format: json
        body: "{{lookup('file', blueprint_native if native else blueprint_scale)}}"

Which posts the blueprint as a JSON payload.

This example contains a play (named “Create Cluster”) at the top level, specifying the host on which the tasks will be executed, the files from which variables will be imported, and the tasks that shall be executed. The only task has the name “Register blueprint” and executes the uri module, providing the parameters needed. In the ‘url’ parameter, {{ ambari_url }} is a variable imported from a variable file defined in vars_files. Looking at the ‘body’ parameter, lookup() is an Ansible extension to Jinja2, reading the contents of a file. The file to read is specified by a Python inline if. It is read from the path stored in {{ blueprint_native }} if {{ native }} is True, or read from the path stored in {{ blueprint_scale }} else. Jinja2 is very useful for including some inline logic in Ansible, as this example showed. Native, in this context, refers to a setup of HDFS without Storage Scale.

After uploading the blueprint, the cluster template will be uploaded as well. The task is very similar and part of the same play:

    - name: Upload clustertemplate and start installation
          url: "{{ ambari_url }}/api/v1/clusters/mycluster"
          url_username: admin
          url_password: admin
          force_basic_auth: yes
            X-Requested-By: ansible
            - 200
            - 201
            - 202
          method: POST
          body_format: json
          body: "{{ lookup('file', clustertemplate_native if native else clustertemplate_scale) }}"

When the cluster template is uploaded to http://ambari:8080/api/v1/clusters/clustername, the deployment of the cluster is started. The cluster template contains the name of the blueprint to use and the name of the desired cluster.

Ansible modules

The above request to register a blueprint is a good example of a non-idempotent step, as it will just fail if the blueprint is already present. To ensure idempotency, we would have to download the existing blueprint, compare it to the one we plan to upload, and then decide whether upload it or not. This check can’t be done in Ansible itself smoothly. It would be possible to do the comparison and skip the upload, but that would add multiple tasks to the playbook for every task performing a similar API request. The cluster template upload as well, for example.

Another example is the following API request, which comes directly after the request above uploading the blueprint and directly before starting the installation by uploading the cluster template:

- name: Add new repository_version
        tags: repo
        ignore_errors: no
          url: "{{ambari_url}}/api/v1/stacks/HDP/versions/3.1/repository_versions/"
          url_username: admin
          url_password: admin
          force_basic_auth: yes
            X-Requested-By: ansible
            - 200
            - 201
            - 202
          method: POST
          body_format: json
              display_name: HDP-3.1
              id: 1
              repository_version: 3.1
              - OperatingSystems:
                  os_type: redhat7
                  - Repositories:
                      base_url: "{{hdp_repo_url}}"
                      repo_id: HDP-3.1
                      repo_name: HDP
                  - Repositories:
                      base_url: "{{hdp_gpl_repo_url}}"
                      repo_id: HDP-3.1-GPL
                      repo_name: HDP-GPL
                  - Repositories:
                      base_url: "{{hdp_utils_repo_url}}"
                      repo_id: HDP-UTILS-
                      repo_name: HDP-UTILS

This request changes a few config entries in Ambari, providing URLs to different repositories needed by Ambari to install HDP. Because communication with Ambari is crucial when operating on Hadoop clusters, these requests occur very often and tend to increase the size of playbooks significantly. A solution for both problems, to ensure idempotency, and aviod large tasks, is the creation of Ansible modules. An Ansible module is a script written in Python and later called from Ansible. Ansible provides all arguments given in a Python dictionary and the module will exit providing a result dictionary. In the returned dictionary the module can state if it failed, and if it performed changes on the host system.

Calling such a module in Ansible looks the following:

      - name: Stop Yarn and HDFS
          url: {{ ambari_url }}/api/v1/clusters/mycluster/services
          username: admin
          password: admin
          state: stopped
            - YARN
            - HDFS

This module is called ‘ambari_service’, gets only the necessary information, and performs API requests. It has default values for some parameters, which further shortens the call. It is good practice to use modules for abstraction when more complicated logic needs to be implemented. Others should be able to understand the playbook when reading it, understand which steps it will perform and what the results will be. Ensuring the behavior is as supposed, can be hidden from the playbook. Also, many task details are easier to implement in Python, which is used for Ansible modules. Documentation on how to write modules can be found on the Ansible website.

Scale Clusters

One of our targets is the comparison of HDFS with Scale and HDFS without it (which we call a ‘native’ setup for short). Therefore both setups have to be supported. This section contains the extra steps it takes to setup a Scale cluster, compared to a native one.

Setting up Scale includes nearly the complete setting up of a native cluster. However, a different cluster blueprint will be used. Also, no disks on our worker nodes have to be formatted, as the ESS provides all the storage needed. After running the native setup, a Scale-specific playbook will be run, which installs ‘IBM Spectrum Scale service’ (Mpack) which provides another service to Ambari to control HDFS transparency.

IBM Storage Scale gets installed and deployed and keys of all Scale nodes are exchanged. To install Scale, the installer is configured with the desired nodes, and the installer node and cluster name are specified: (Documentation)

  - name: Configure and run installer
    hosts: gpfs-installer
    become: yes
    become_user: root
      - var_files/GPFS_vars.yaml
      - name: Set installer IP
        shell: "{{ installer_script }} setup -s {{ installer_ip }}"

      - name: Set cluster name
        shell: "{{ installer_script }} config gpfs -c {{ gpfs_local_cluster_name }}"

      - name: Set port range
        shell: "{{ installer_script }} config gpfs --ephemeral_port_range 60000-61000"

      - name: Set callhome to disabled
        shell: "{{ installer_script }} callhome disable"

      - name: Add nodes to installer
        include_tasks: add_to_scale_installer.yaml
          - "{{ groups['scale'] }}"
          - "{{ groups['ambari'] }}"

      - name: Do install
        shell: "{{ installer_script }} install"

      - name: Do deploy
        shell: "{{ installer_script }} deploy"

After this, the Scale cluster on the ESS is added as as a remote mount in our newly created Scale cluster, so all nodes can access the storage of the ESS.

The only thing that is still missing is HDFS transparency, which is installed in the next step. In the last step, the newly installed service gets added to Ambari. Ambaris modularity comes in handy here, as this new service can’t be included in the blueprint, because it is not known to Ambari beforehand. Adding this new service to the existing HDP cluster consists of nine API requests.

Cluster performance benchmarks

This section will give an insight into using two benchmarking frameworks, TestDFSIO and SLG.

TestDFSIO is a tool for benchmarking the file system performance of a Hadoop cluster’s file system. It performs reads or writes of a selectable number of files of any requested size. It outputs a log file containing statistics, most importantly the given parameters, throughput in mb/s, average i/o rate per file, and standard deviation of the i/o rate. These results can very easily be collected and measured in multiple read and write runs. The source code is available in the Hadoop GitHub repository.

In an Ansible playbook, looping over read and writes a variable number of times could look like this:

- name: Run testdfsio
  become: yes
  become_user: hdfs
    executable: /bin/bash
    cmd: |-
      hadoop jar {{ testdfsio_jar }} TestDFSIO -{{ item[1] }} \
      -nrFiles {{ testdfsio_files_num }} \
      -fileSize {{ testdfsio_files_size_mb }} \
      &> {{ temp_log_file_dest_path }}/{{ testdfsio_subfolder }}/{{ item[1] }}_{{ testdfs_log_file_name }}
  # Loop over combinations of iteration index and write/read operation
  loop: "{{ range(iterations) | product(['write', 'read']) | list }}"
  when: execute is defined

TestDFSIO arguments are defined as variables in a dedicated variable file. The loop will create a list of pairs: 0, write; 0, read; 1, write; 1, read; etc… where read/write is accessed as item[1] and the index as item[0] (which is not used here, but in the log file name). As execution and evaluation tasks are both imported from the same file, choosing execution or evaluation is done by defining {{execute}} or {{evaluate}}, which is easy to do when including the tasks:

- include_tasks:
    file: performance_tasks/testdfsio.yaml
        execute: yes
  when: use_testdfsio

The output contains a lot of details, but the most important part is very easy to parse as it consists of key/value entries:

23/10/30 12:31:21 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
23/10/30 12:31:21 INFO fs.TestDFSIO:             Date & time: Mon Oct 30 12:31:21 CET 2023
23/10/30 12:31:21 INFO fs.TestDFSIO:         Number of files: 10
23/10/30 12:31:21 INFO fs.TestDFSIO:  Total MBytes processed: 1000000
23/10/30 12:31:21 INFO fs.TestDFSIO:       Throughput mb/sec: XXX
23/10/30 12:31:21 INFO fs.TestDFSIO: Average IO rate mb/sec: XXX
23/10/30 12:31:21 INFO fs.TestDFSIO: IO rate std deviation: XXX
23/10/30 12:31:21 INFO fs.TestDFSIO: Test exec time sec: XXX

Apache Synthetic Load Generator (SLG) is another framework, which measures the performance of the NameNode, which is the coordination node for HDFS. It generates load on the NameNode and measures response times of different operations, allowing the user to choose between read, write, and list operations, or a mix of them. SLG needs more steps to be run, as a test space has to be created and filled with data before the load generation. An example:

mkdir testLoadSpace
$yarn jar $clientjar NNstructureGenerator -outDir testLoadSpace \
    -numOfFiles $numOfFiles
$yarn jar $clientjar NNdataGenerator -inDir testLoadSpace -root /tmp/testLoadSpace

m="####### Running ${desc} ($noThreads threads)" ; echo "${m}" ; echo "${m}" >&2
$yarn jar $clientjar NNloadGenerator -root /tmp/testLoadSpace -root \
    /tmp/testLoadSpace -numOfThreads $noThreads -readProbability ${rp} \
    -writeProbability ${wp} -elapsedTime 60

First, a folder on the local machine is created, and the SLG structure generator is executed, randomly generating files and subdirectories on the local filesystem. Then, the SLG data generator copies these files from the local folder to an HDFS folder (/tmp/testLoadSpace in this case). After printing some information about the current test, the load generator is executed. The load generator creates a table for all the directories and files in the test space and spawns worker threads, which will read files, write files, and list directories randomly. During the process, operation durations and operation throughput are monitored and printed.

The commands above would produce an output like this:

Printing to testLoadSpace/dirStructure
Printing to testLoadSpace/fileStructure
####### Running mix (800 threads)
Running LoadGenerator against fileSystem: hdfs://node27s:8020
Result of running LoadGenerator against fileSystem: hdfs://node27s:8020
Average open execution time: XXXms
Average list execution time: XXXms
Average deletion execution time: XXXms
Average create execution time: XXXms
Average write_close execution time: XXXms
Average operations per second: XXXops/s

More detailed information on the usage of SLG can be found in the usage guide.

While running and evaluating both frameworks is a key part of cluster performance measuring, it is important to log the cluster state before the benchmark to be able to compare different benchmarks that were run under different conditions.


This blog covered technologies needed to perform automated benchmarking of HDFS transparency and HDFS on Hadoop clusters. We talked about the usage of Ansible for the coordination of multiple hosts, took a look at some YAML examples, glimpsed at Jinja2, and covered Ansible modules. We also mentioned that using Ansible modules is a best practice for adding complicated logic to Ansible playbooks. Another practice we saw, which proved useful, was storing variables in a central location in dedicated variable files with different files for different categories.

We also saw how Ambari can be used to create and alter Hadoop clusters while using its REST API instead of its Web UI to enable automation and integration into our Ansible playbooks. To show how an example usage of Ambari’s API could look, examples were shown how to create a cluster and change some settings. We also covered the main steps of designing and creating such a cluster, including extra steps to install Mpack, HDFS transparency, and Scale on the cluster.

In the Cluster performance benchmarks section, we looked at two frameworks, TestDFSIO and SLG to benchmark NameNode and file system performance. We also showed commands needed to run both of them and possible outputs created by the frameworks.


Apache, Apache Hadoop, Hadoop, Apache Ambari and Ambari are trademarks of The Apache Software Foundation. Ansible is a registered trademark of RED HAT, INC. All other marks mentioned may be trademarks or registered trademarks of their respective owners.