File and Object Storage

I/O Workflow of Hadoop workloads with IBM Spectrum Scale and HDFS Transparency

By CHINMAYA MISHRA posted Thu November 19, 2020 01:03 AM


Hadoop workloads like MapReduce, Hive, HBase and many others leverage the HDFS client to perform I/O to Spectrum Scale through HDFS Transparency. In this blog, we take a closer look at the workflow of read and write operations performed by the HDFS client in a HDFS transparency environment. This writing is intended for Hadoop solution architects/designers who may be considering to move to a IBM Spectrum Scale based Big Data solution from a native HDFS based environment.

HDFS Transparency with IBM Spectrum Scale supports a variety of shared storage including the IBM Elastic Storage Server (ESS) and other IBM storage platforms. This is explained in detail in IBM documentation:

This blog focuses on shared storage, and IBM ESS is cited as example.

Abbreviations used in this blog:

  • NN    means NameNode
  • DN    means DataNode
  • Scale means “IBM Spectrum Scale”
  • ESS means “IBM Elastic Storage Server”


The benefits of using shared storage for Hadoop over the traditional scale-out storage offered by HDFS can be broadly summarized as the following

  • Compute and storage can be scaled separately/flexibly as business requirements grow.
  • High storage efficiency compared to the typical 3x storage overhead of native HDFS.
  • Enterprise grade storage features can be offered for Hadoop data
  • Advanced data protection against disk/node failure
  • Negligible additional network bandwidth from disk rebuilds as may be the case with HDFS.

Replication considerations

Before getting into the read & write workflow, its important to understand the Replication considerations for HDFS Transparency with shared storage, since it intricately affects the way read & write works.

For native HDFS based scale-out storage, dfs.replication is set to 3 by default. Storage redundancy is achieved by ensuring there are 3 or more replicas of a block. This can also be set by the dfs client. Some teragen jobs would set setReplication(10). As long as there are 10 nodes, the data would be replicated across those nodes.

However, with ESS based shared storage, ESS natively manages the data redundancy. Typically, each block on the Scale filesystem is stripped into 8+2P (8 stripes + 2 parity bits) or 8+3P, so the scale out replication-based redundancy is unnecessary. Transparency provides the following options:

  • provided by in gpfs-site.xml
    • If this is set, HDFS Transparency uses logic for shared storage.
  • gpfs.replica.enforced=gpfs provided by in gpfs-site.xml
    • This is the recommended setting for shared storage & the default setting since HDFS Transparency 3.1.1-1.
    • Scale will manage replication natively, based on filesystem level settings (-r, -R). For shared storage, usually the replication factor within the Scale filesystem is 1 or 2. For example:
[root@localhost ~]# mmlsfs <filesystem-name>  -r -R
flag                value                    description
------------------- ------------------------ -----------------------------------
 -r                 1                         Default number of data replicas
 -R                 2                        Maximum number of data replicas
    • dfs.replication=<N> will be ignored for replication.

  • gpfs.replica.enforced=dfs provided by in gpfs-site.xml
    • The hdfs-site.xml parameter dfs.replication=<number=N> is honoured. N replicas are created for each HDFS block across failure groups (FG). FGs in an ESS/Scale filesystem can be found with the “mmlsdisk” command.
    • This means that the FG count on the ESS/Scale filesystem is a factor to be considered. If dfs.replication 2 but there is only 1 FG, setrep(2) fails.
    • This is the recommended setting for FPO.
    • This setting is NOT recommended for shared storage because it incurs N times more storage overhead for the data being stored.


  • dfs.replication in hdfs-site.xml (provided by HDFS)
    • HDFS Transparency ignores this parameter if gpfs.replica.enforced is set to gpfs
    • This HDFS parameter is also a client side setting, DFS clients still thinks there are N replicas across N datanodes when dfs.replication=<N>. If one of the DN goes down, client can retry with another DN so that the application doesn’t fail.

Based on above considerations, a good strategy for replication & redundancy for HDFS Transparency with shared storage would be:

Avoid using dfs.replication=1 because dfs client thinks that there is only 1 replica and does not retry other DNs if it meets any error.


The rest of the document assumes the above settings for HDFS Transparency:


WRITE Workflow with HDFS Transparency:


Let’s consider a dfs client writing a 1 GB file to native HDFS using the following command:

[root@localhost ~]# hdfs dfs   -put /tmp/1GB-file.csv  /user/hdfs/

In native HDFS, dfs client splits the file into {dfs.blocksize} blocks (128MB by default). When the write is completed, each block is saved as a regular OS file of 128MB file size on a local filesystem (e.g. ext4) on a particular disk & node. The write logic in native HDFS follows the following sequence:

  1. dfs client asks NameNode a block to write, by addBlock RPC
  2. NameNode returns back to dfs client with 3 chosen DNs which will be used to save 3 replicas
  3. dfs client, writes the data to the 1st DN from the list, over RPC.
  4. DataNode chooses one local directory from the list of configured DN directories ( Typically, each directory is usually configured on a local filesystem over a local whole disk. DN writes the data block as a file on the local filesystem.
  5. For replication, 1st DN writes to the 2nd DN, 2nd DN writes to 3rd. This is called pipeline write. it saves the network bandwidth of dfs clients.
  6. After the result is sent back to dfs client, 3 replicas were saved in 3 DNs
  7. dfs client repeat 1) to 6) for the next block until all blocks are written

Note that dfs client processes all the blocks serially, which means until all replicas of one block are written, dfs client won't write the next block. This is because hdfs protocol only supports sequential write

Now let’s examine how the WRITE workflow differs from the native HDFS example when HDFS Transparency & shared storage come into picture:

  1. dfs client asks NameNode a block (128MB) to write, by addBlock RPC
  2. NameNode returns to dfs client with {dfs.replication} number of Let’s see how DNs will be selected if dfs.replication=3
  3. The NN provides to the dfs client a list of DNs and also the block location fetched from the Scale filesystem. Block location equals offset of the Scale file, where the current block should be written to. The offset is calculated from the block number, so 1st block is 0, 2nd block is 128MB, etc.
  4. dfs client sends the Scale file path and file offset to the 1st DN in the DN list.
  5. The DN which has the Scale client, opens the GPFS file, seeks to the offset and writes the data block. Unlike native HDFS, there is no pipeline write.
  6. DN sends back acknowledgement to the dfs client after it receives the data from socket but before writing to disk. This is same logic as native HDFS and can be controlled by application to make it a synchronous write if needed.
  7. Scale client split the HDFS block (128MB) it received for write, into multiple Scale blocks. Assuming block size of the ESS filesystem to be 16MB, each chunk is split into 8 Scale blocks,
  8. Scale takes care of replication based on mmlsfs -r -R setting.
  9. dfs client repeat 1) to 8) for the next block until all blocks are written. Like native HDFS, the blocks will be written serially
  10. If a DN goes down while its writing a block, the dfs client will retry to write the failed block with the next DN in it’s list.

HDFS WRITE workflow with HDFS Transparency
WRITE Workflow with HDFS Transparency

In native HDFS, DNs perform IO to local disks for user data. In Transparency, DNs perform IO to a configured Scale directory [<gpfs.mnt.dir>/<>] by leveraging local Scale client. The DNs act as a conduit to the data flow to the Scale filesystem.

A form of distributed dfs client is the DFSIO program, which is used for benchmarking DFS IO bandwidth. When a DFSIO program is run, multiple mappers are created, one for each file to be read/write. Each mapper manages writing of one file by leveraging the logic mentioned above.

READ Workflow with HDFS Transparency:

Let’s consider a dfs client performing READ of 1 GB file to HDFS Transparency using the following command:

[root@localhost ~]# hdfs dfs -get  /user/hdfs/1GB-file.csv   /tmp

The READ logic in HDFS Transparency follows the following sequence:

  1. dfs client asks NameNode a block (128MB) to read, by getBlockLocation RPC
  2. NameNode returns to info dfs client with {dfs.replication} number of DNs. See the previous example of hdfs -put to see which DNs are selected.
  3. The NN also returns to the client the offset of the Scale file where the data can be read from. The offset is calculated from the block number, so 1st block is 0, 2nd block is 128MB, etc.
  4. The NN also reads ahead the location of 10 (configurable) subsequent blocks and provides it to the dfs client. This mechanism spares the dfs client 10 future RPC calls to the NN for the next blocks.
  5. dfs client sends the Scale file path and file offset to the 1st DN in the DN list.
  6. The DN which has the Scale client, opens the Scale file, seeks to the offset and reads the data.
  7. dfs client repeat 1) to 6) for the next block until all blocks are read into the client buffer.
  8. Note that one dfs client does NOT read the blocks parallelly, rather one after the other. However, for a mapreduce job, different blocks of a file can read in parallel.
  9. After all blocks have been read, dfs client writes the contents to a local file.

Short circuit read & write:

  • Using short circuit feature, dfs client can directly read from the local filesystem, if the dfs client is co-located with a DN. The DN is no longer in the IO path, so generally it tends to provide better performance.
  • The local DN is still required to be up and running, as the DN acts as a facilitator to the dfs client. When a dfs client references a block for read, the DN passes the file reference of the actual file path corresponding to the data block, to the DFS client through a configured shared memory segment. For native HDFS this could be the actual ext4 file path corresponding to a block. With shared storage and Transparency, there are no local disks, the Scale file descriptor details as a file is referenced & passes to the dfs client.


 Thanks you for reading this blog post. Comments are welcome.

Thanks to Xin TW Wang from IBM China Systems & Technology Lab for reviewing of this content.