ISV Ecosystem - Group home

CDzOS Performance and Tuning : Concurrent BSAM I/O and Striping

By Ed Peters posted Fri July 23, 2021 08:22 AM

  

Overview

IBM Sterling Connect:Direct for z/OS (CDzOS) is a managed file transfer product. It specializes in transferring z/OS datasets from one CDzOS instance (node) to another using SNA (LU0, LU6.2) or TCP/IP. It is a mature product with many optional capabilities (TLS 1.3, zEDC compression, HSAO (FASP) TCP optimization), tuning knobs (INITPARMs), and configuration files (Netmap, Secure+ Parmfile), logging files (Statistics), workload management files (TCQ, TCX) and ancillary files (CKPT, etc.).

Tuning each of these is a topic all by itself and is covered in CDzOS documentation. But one area often neglected is the file being transferred. A non-VSAM DASD dataset with poorly optimized attributes (e.g. small block size) can take literally hundreds of times longer to transmit than the same data in a dataset with optimized attributes, simply because of I/O delay. With multiple gigabyte datasets, it pays to tune the dataset attributes. An old rule of thumb in computing is to avoid I/O as much as possible, since even the fastest I/O to SSD is much slower than direct memory access. But since I/O cannot be avoided altogether, the next best thing is to make every I/O count as much as possible. That means maximizing the transfer rate by a) making each I/O transfer as much data as possible and b) making the average time per I/O as short as possible. To do the first, optimize the dataset attributes for the device type it resides on. To do the second, maximize the number of concurrent I/Os to the dataset. Using a striped dataset fully exploits the latter technique so thoroughly that the performance bottleneck is partly and often completely moved from DASD I/O to TCP/IP I/O in a CDzOS file transfer.

Making Each I/O Transfer As Much Data As Possible vs. DASD Space Utilization

CDzOS uses the BSAM access method to read and write non-VSAM files. One BSAM READ macro initiates the transfer of one block of data from DASD to memory. Likewise, with WRITE in the opposite direction. With modern DASD (SAN) and the relatively small block sizes typical of device type 3390, a READ or WRITE of a block of 80 bytes about as much time as one of 27920 bytes. It’s mostly the overhead of the non-data-transfer phases of I/O (access method, interrupt driven dispatcher preemption, channel delay, and peripheral hardware delay) that take the lion’s share of the time. The system determined block size is usually “half-track blocking” (2 blocks per track) for 3390 sequential datasets. This is because the maximum DASD block size is 32760 and the maximum track size is 56,664 bytes.

Rather than write one 32760 block per track and leave 23904 bytes unused, system determined block size will reduce block size to reduce waste. The maximum half-track block size is 27998 for the BASIC and LARGE DSNTYPEs, and 27966 for EXTENDED DSNTYPEs. In either case, 27920 is the maximum half-track block size for a dataset with LRECL=80. So, a half-track blocked RECFM=FB LRECL=80 BLKSIZE=27,920 sequential organization dataset will have 55,840 usable bytes per track, leaving only 824 unused bytes per track (theoretically). By letting the system determine the block size, you achieve optimum packing of data on disk, and transfer only about 4K per I/O less than the maximum aupported by BSAM for device type 3390.

If you need the fastest possible I/O and don’t mind wasting the disk space, a block size closer to the maximum can be specified. Transferring 1 GB of data with a block size of 32,720 takes 30,563 I/O’s. A blocksize of 27,920 takes 35,817 I/O’s, an increase of 5,254 I/O’s or 17%.  Thus, all else being equal, the most space-efficient sequential dataset will require 17% longer to read or write than the most time-efficient. This example is typical - there is usually a tradeoff between maximizing usable bytes per track and maximizing bytes per I/O. System determined block size favors maximizing space utilization.

Minimizing Average Time Per I/O

Three ways to minimize the average time per I/O are: do more than one I/O concurrently; interleave the I/O; and spread out the I/O to multiple volumes. The QSAM access method will chain its I/O, so that instead of bringing into storage only one block at a time, it bundles a small number of them and moves them into storage as a group, thus saving significant overhead. BSAM can do better or worse than that, depending on how you code your routine. If you only READ or WRITE one block, then wait for the I/O to complete via CHECK, BSAM will be slower than QSAM. But if the number (n) of READ or WRITE macros done before issuing CHECK is more than 1, the I/O’s can complete concurrently. The greater n is, the more the concurrency, and the more the time savings.

But it gets even better if you interleave the I/Os by only issuing a CHECK macro when you have to. For example, let’s assume you are reading, and n=5. You could issue 5 READs, then 5 CHECKs, and repeat that until end of file (EOF), and that would be much faster than the n=1 case. But it would be even faster if you did an initial 5 READs, and from then on, do 1 CHECK followed by 1 READ, until EOF. Interleaving lets the system and/or the DASD device decide how to most efficiently deliver the blocks. By increasing n, you give it more leeway to transfer more blocks in one fell swoop. As you probably know, n is just NCP – the number of concurrent channel programs. It is a 1-byte binary field in the DCB (DCBNCP) and so has a maximum allowed value of 255. Thus, there can be at most 255 outstanding READs or WRITEs on one DCB.

Before you can do another READ or WRITE on the DCB, you must free up one of the slots by doing a CHECK on the oldest READ or WRITE DECB. The first CHECK will take the usual amount of time, since it must go through the entire process to complete the first I/O. But subsequent CHECKs may take almost no time, since the I/O may have actually completed before that. In those instances, CHECK does little more than the system’s bookkeeping for the I/O completion. The greater the NCP, the greater the chance that any given I/O completed before the CHECK for it is issued. The “System Determined NCP” (SDN) is calculated at OPEN time if DCBNCP is 0. It sets DCBNCP to a recommended value. When the DCBE parameter MULTSDN > 0, it tells the system to multiply the SDN by MULTSDN and use the product (up to 255) instead. If you have the buffer space, an NCP that as close to 255 as possible and is a multiple of the number of blocks per track is the best.

There is also a concept called the “Accumulation Value” which tells BSAM how many READ or WRITE requests to accumulate before starting them. This is controlled by the DCBE parameter MULTACC. If you don’t use interleaved I/O, a large accumulation of READ/WRITE is best. With interleaved I/O, a reasonably small value is best because a large value can defeat interleaved I/O by making the DASD device wait for completion on a less than optimum schedule. CDzOS uses interleaved I/O with high NCP and no accumulation to allow the device maximum its own performance.

Finally, if you spread out the I/O to different devices, even though these days it is often going to the same “box” (e.g. IBM DS8K series), the performance improves because there are multiple independent paths and the box is capable of processing the I/Os simultaneously. So even though a DASD volser is just a logical construct, striping a dataset across multiple volumes still realizes a great increase in the effective transfer rate, and the more stripes the better, up to the number of paths to the box. However, striping does little good unless the program doing the I/O uses concurrent I/O. With an NCP of 1, striping is superfluous. The most efficient user of striped datasets is the program that does interleaved I/O with a high NCP, like CDzOS.

Examples of Transfer Rates

In the table below, each row represents one transfer of data. The attributes in common are:    

Bytes = 9,135,424,000

DSORG = PS

LRECL = 80

BLKSIZE = 27920

Blocks = 327200

zOS level = 2.4

CPU Hardware = z13 2964-718 with 2584 MSUs

CDzOS is able to generate or consume data without I/O, via its “IOEXIT” feature. In the first row, it is used to show the theoretical maximum throughput for the configuration. Each group of rows after the IOEXIT is for a different dataset with increasing number of stripes. Within a group, the 1st row shows how long it takes ISPF Browse to scroll to the bottom of the dataset. Subsequent rows in the group are CDzOS COPY steps with increasing NCP.  

Operation

NCP

Elapsed

Seconds

Xfer Rate

MB/sec

COPY IOEXIT

 

5.28

1730.2

 

 

 

 

Browse PS-L Stripes=1

 

139.50

65.49

COPY PS-L Stripes=1

1

151.90

60.14

COPY PS-L Stripes=1

2

101.24

90.24

COPY PS-L Stripes=1

10

81.33

112.33

COPY PS-L Stripes=1

100

80.30

113.77

COPY PS-L Stripes=1

255

83.51

109.39

 

 

 

 

Browse PS-E-V1 Stripes=2

 

137.2

66.58

COPY PS-E-V1 Stripes=2

1

142.89

63.93

COPY PS-E-V1 Stripes=2

2

104.45

87.46

COPY PS-E-V1 Stripes=2

10

66.08

138.25

COPY PS-E-V1 Stripes=2

100

57.74

158.22

COPY PS-E-V1 Stripes=2

255

53.90

169.49

 

 

 

 

Browse PS-E-V1 Stripes=4

 

137.70

63.34

COPY PS-E-V1 Stripes=4

1

140.64

64.96

COPY PS-E-V1 Stripes=4

2

94.79

96.38

COPY PS-E-V1 Stripes=4

10

46.24

197.57

COPY PS-E-V1 Stripes=4

100

32.69

279.46

COPY PS-E-V1 Stripes=4

255

34.24

266.81

 

 

 

 

Browse PS-E-V1 Stripes=8

 

137.5

66.44

COPY PS-E-V1 Stripes=8

1

146.14

62.51

COPY PS-E-V1 Stripes=8

2

100.03

91.33

COPY PS-E-V1 Stripes=8

10

39.66

230.34

COPY PS-E-V1 Stripes=8

100

27.19

335.98

COPY PS-E-V1 Stripes=8

255

23.53

388.25

 

The absolute numbers change with the system hardware, but the relative improvements remain. A high number of concurrent interleaved BSAM I/O maximizes throughput. The cost is increased memory usage to hold all the buffers needed for the simultaneous I/O. But if memory is in abundance and time is of the essence, it pays to use striping and highly concurrent interleaved I/O.

0 comments
23 views