Enterprise Linux

Enterprise Linux on Power

Enterprise Linux on Power delivers the foundation for your open source hybrid cloud infrastructure with industry-leading cloud-native deployment options.

View Only

Back to Blog List

I/O scaling performance for IBM Power servers running Linux

By Murilo Fossa Vicentini posted Wed October 06, 2021 04:51 PM

I/O scaling performance for IBM Power servers running Linux

Abstract

The purpose of this blog is to explore the I/O scaling performance capabilities with both network and storage devices of an IBM® Power® E980 (machine type and model 9080-M9S) server, which is based on IBM POWER9 architecture.

It does so by providing the performance data for network and storage devices observed in the IBM development labs along with some tuning tips and recommendations for achieving better overall performance numbers.

Scaling storage devices

The system used for performance data collection of storage devices was set into a single logical partition (LPAR) containing all the system CPUs, memory, and storage devices. This section describes the configuration of the system used and provides information on the kernel and benchmark tool used.

Configuration

One LPAR with:

128 dedicated processors (1024 CPUs)
3894 GB of memory
Sixteen 3.2 TB NVMe devices (Feature Codes: EC7C, EC7D / CCIN: 594B)
Twelve 800 GB NVMe U.2 devices (Feature Code: EC5J / CCIN: 59B4) [Internal SSD drive]
5.11 upstream kernel used
Benchmark tool - fio - version 3.25

Test scenarios

This section covers the complete set of performance test scenarios performed for the storage devices. Given the large amount of data generated, some subsets were selected for use in this blog to display the overall behavior seen in these tests.

Workload: Random read and random write for 4 KB block size

Parameters:

I/O depth: 1, 8, 16, 32, and 64
Number of jobs: 1, 8, 16, 32, and 64
I/O engines: libaio and io_uring

As mentioned previously the fio tool was used to measure the performance with the test settings. Refer to the following sample format used for this benchmark:

fio --direct=1 --refill_buffers --rw=<workload (randread/randwrite)> --ioengine=<I/O engine (libaio/io_uring)> --bs=4k --iodepth=<I/O depth> --runtime=120 --numjobs=<number of jobs> --group_reporting --name=job1 --filename=<first nvme device> --name=job2 --filename=<second nvme device> ...

Performance

With the system configuration and test set previously mentioned, the measurement was done on how the platform performs as we scale up the number of storage I/O devices in the following scenarios:

Performance of a single 3.2 TB NVMe device (CCIN: 594B)
Performance of eight 3.2 TB NVMe devices (CCIN: 594B)
Performance of sixteen 3.2 TB NVMe devices (CCIN: 594B)
Performance of sixteen 3.2 TB NVMe devices (CCIN: 594B) and twelve 800 GB NVMe U.2 devices (CCIN: 59B4)

For random read operations with 4 KB block size, the following performance numbers were observed:

The maximum expected performance line in the chart is a line that is calculated based on the maximum performance obtained with a single device for both the 3.2 TB NVMe and 800 GB NVMe U.2 multiplied by the number of devices used in the test. The change in slope seen in the chart is due to two reasons: First one is the number of devices added between each step is different and the second reason is adding different devices (each with its own performance characteristics) between the last two steps.

As can be seen performance scales linearly up to the 16 devices test and with the full 28 devices, performance scales up to 97% of the target expected, hitting 28.9 million input/output operations per second (IOPS) of random read operations with 4 KB block size.

For random write operations with 4 KB block size, the following performance numbers were observed:

Write data was collected only with the sixteen 3.2 TB NVMe devices, given the 800 GB NVMe U.2 devices are read-intensive devices and the system used to collect the data was a loan, to avoid wearing down of these devices they were skipped.

Notice that performance progressed linearly up to the sixteen devices tested, hitting 8.55 million IOPS.

Scaling network devices

The system used was divided into two LPARs containing each with half the system CPU, memory, and network I/O devices. The network devices in one LPAR were connected back-to-back (no switch involved) to the network devices in the other LPAR. This section provides a description of them as well as information about the kernel and benchmark tool used.

Configuration

Two LPARs each with:

64 Dedicated processors (512 CPUs)
1908.75 GB of memory
10 Dedicated ports from five 2-port 100 Gbps Mellanox devices (Feature Codes: 2CF3, EC66, EC67)
One virtual function (VF) at 100% capacity of a single port of a 2-port 100 Gbps Mellanox device (Feature Codes: 2CF3, EC66, EC67) in Single Root – I/O Virtualization (SR-IOV) mode
5.11 Upstream kernel used
Benchmark tool - uperf version 1.0.7

Test scenarios

This section provides a complete set of performance test scenarios performed for the network devices. Given the large amount of data generated, only some subsets were selected for use in this blog to display the overall behavior seen in these tests.

Workload: TCP Stream and Request / Response (RR)

Parameters:

Connections:
- 1, 4, 8, 16, and 64 (Stream)
- 1, 25, 50, 100, and 150 (RR)
Message size:
- 256, 1KB, 4KB, 16KB, and 64KB (Stream)
- 1, 1KB, 4KB, 16KB, and 64KB (RR)
MTU: 1500 and 9000

As mentioned earlier, the uperf tool was used to measure the performance with the above settings. Refer to the following sample format of the XML file used by this benchmark for a stream workload:

<?xml version="1.0"?>
<profile name="TCP_STREAM">
  <group nprocs="{number of connections}">
    <transaction iterations="1">
      <flowop type="connect" options="remotehost={first peer IP} protocol=tcp"/>
    </transaction>
    <transaction duration="120">
      <flowop type="write" options="count=16 size={message size}"/>
    </transaction>
    <transaction iterations="1">
      <flowop type="disconnect"/>
    </transaction>
  </group>
  <group nprocs="${number of connections}">
    <transaction iterations="1">
      <flowop type="connect" options="remotehost={second peer IP} protocol=tcp"/>
    </transaction>
    <transaction duration="120">
      <flowop type="write" options="count=16 size={message size}"/>
    </transaction>
    <transaction iterations="1">
      <flowop type="disconnect"/>
    </transaction>
</group>
  ...
</profile>

A sample XML format for an RR workload:

<?xml version="1.0"?>
<profile name="TCP_RR">
  <group nprocs="{number of connections}">
    <transaction iterations="1">
      <flowop type="connect" options="remotehost={first peer IP} protocol=tcp"/>
    </transaction>
    <transaction duration="120">
      <flowop type="write" options="size={message size}"/>
      <flowop type="read"  options="size={message size}"/>
    </transaction>
    <transaction iterations="1">
      <flowop type="disconnect"/>
    </transaction>
  </group>
  <group nprocs="{number of connections}">
    <transaction iterations="1">
      <flowop type="connect" options="remotehost={second peer IP} protocol=tcp"/>
    </transaction>
    <transaction duration="120">
      <flowop type="write" options="size={message size}"/>
      <flowop type="read"  options="size={message size}"/>
    </transaction>
    <transaction iterations="1">
      <flowop type="disconnect"/>
    </transaction>
  </group>
  ...
</profile>

Performance

With the system configuration and test set mentioned earlier, the measurements were done on how the platform performs as we scale the number of network I/O devices when using a bandwidth workload generated by the uperf benchmark tool. The scaling order used in these tests include:

One pair of ports - First port from a device in dedicated mode
Two ports - First port from two devices in dedicated mode
Four ports - First port from four devices in dedicated mode
Six pairs of ports - First port from each device in dedicated mode + one VF of device in SR-IOV mode
Eleven pairs of ports - Two ports from each device in dedicated mode + one VF of device in SR-IOV mode

The next chart shows a sample of the performance seen in the 9080-M9S server:
The maximum expected performance line in the chart is a line that is calculated based on the maximum performance of a single device multiplied by the number of pair of ports involved in the test.

As can be seen, performance scales properly up to six pair of ports. But when moved to the case of 11 ports by adding the second port of each device, the performance obtained is around 75% of the expected value.

Tuning / Optimization

This section explores simple tunings and optimizations that can lead to a better overall performance or resource utilization in the server.

Partition topology

The 9080-M9S server can have up to four nodes, as it is the case in the tests performed. So an unoptimized CPU / memory topology in an LPAR can have a considerable impact on the I/O devices’ performance.

The following is an example of a performance test done with a vanilla distribution kernel with an unoptimized and optimized partition.

In this example, the network cards are divided into two LPARs, one LPAR (lp1) contains network devices from node 1 and node 3 and the second LPAR (lp2) contains network devices from node 2 and node 4.

Dedicated devices connections (lp1 <-> lp2):

U78D5.ND1.CSS4675-P1-C3-C1 <-> U78D5.ND2.CSS4587-P1-C1-C1
U78D5.ND1.CSS4675-P1-C5-C1 <-> U78D5.ND2.CSS4587-P1-C3-C1
U78D5.ND1.CSS4675-P1-C7-C1 <-> U78D5.ND2.CSS4587-P1-C5-C1
U78D5.ND3.CSS46B3-P1-C1-C1 <-> U78D5.ND4.CSS449F-P1-C1-C1
U78D5.ND3.CSS46B3-P1-C3-C1 <-> U78D5.ND4.CSS449F-P1-C3-C1

So, in an unoptimized configuration, we can see the following topology in lp1:

# numactl -H
available: 16 nodes (0-15)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 130447 MB
node 0 free: 126791 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 130664 MB
node 1 free: 129546 MB
node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
node 2 size: 99966 MB
node 2 free: 98731 MB
node 3 cpus: 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 130664 MB
node 3 free: 129632 MB
node 4 cpus: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
node 4 size: 131176 MB
node 4 free: 130800 MB
node 5 cpus: 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
node 5 size: 131432 MB
node 5 free: 130978 MB
node 6 cpus: 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
node 6 size: 100750 MB
node 6 free: 100247 MB
node 7 cpus: 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
node 7 size: 131176 MB
node 7 free: 130652 MB
node 8 cpus: 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 8 size: 131169 MB
node 8 free: 130353 MB
node 9 cpus: 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327
node 9 size: 110953 MB
node 9 free: 110482 MB
node 10 cpus: 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351
node 10 size: 101007 MB
node 10 free: 100227 MB
node 11 cpus: 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383
node 11 size: 122233 MB
node 11 free: 121846 MB
node 12 cpus: 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415
node 12 size: 131689 MB
node 12 free: 131175 MB
node 13 cpus: 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447
node 13 size: 125289 MB
node 13 free: 125075 MB
node 14 cpus: 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479
node 14 size: 121940 MB
node 14 free: 121725 MB
node 15 cpus: 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511
node 15 size: 121193 MB
node 15 free: 120958 MB
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
  0:  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40 
  1:  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40 
  2:  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40 
  3:  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40 
  4:  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40 
  5:  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40 
  6:  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40 
  7:  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40 
  8:  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40 
  9:  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40 
 10:  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40 
 11:  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40 
 12:  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20 
 13:  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20 
 14:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20 
 15:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10

A better topology can be done in this scenario for lp1 considering how the devices are assigned to the partition. So, here is how the LPAR topology looks like after optimization:

# numactl -H
available: 8 nodes (0-3,8-11)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 245784 MB
node 0 free: 241706 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 246241 MB
node 1 free: 244816 MB
node 2 cpus: 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 2 size: 246465 MB
node 2 free: 244876 MB
node 3 cpus: 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 3 size: 246241 MB
node 3 free: 244952 MB
node 8 cpus: 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319
node 8 size: 221931 MB
node 8 free: 220155 MB
node 9 cpus: 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383
node 9 size: 248019 MB
node 9 free: 247035 MB
node 10 cpus: 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447
node 10 size: 253134 MB
node 10 free: 251482 MB
node 11 cpus: 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 
489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511
node 11 size: 243939 MB
node 11 free: 242940 MB
node distances:
node   0   1   2   3   8   9  10  11 
  0:  10  20  20  20  40  40  40  40 
  1:  20  10  20  20  40  40  40  40 
  2:  20  20  10  20  40  40  40  40 
  3:  20  20  20  10  40  40  40  40 
  8:  40  40  40  40  10  20  20  20 
  9:  40  40  40  40  20  10  20  20 
 10:  40  40  40  40  20  20  10  20 
 11:  40  40  40  40  20  20  20  10

The following charts show the performance obtained when running five pairs of ports. With just this change in the partition, they show that regardless of MTU when scaling the number of connections or increasing message sizes we can see how negatively the workload bandwidth is affected when having an unoptimized partition configuration.

The topology in a partition or system can be adjusted automatically by using the optmem command in the Hardware Management Console (HMC) command line. To start an operation to optimize all partitions in the system, run the following command:

optmem -m <system> -o start -t affinity

You can also select a subset of partitions to prioritize for optimization using the following command:

optmem -m <system> -o start -t affinity -p <lpar1,lpar2>

You can find more information about this HMC command in the reference manual at: https://www.ibm.com/docs/en/power9?topic=commands-optmem

Dynamic DMA windows (DDW)

The following list explains the terminology used in this section:

DMA stands for direct memory access. DMA allows an I/O adapter to access a limited amount of memory directly, without involving the CPU for memory transfers. Both the device driver for the adapter and the operating system must recognize and support this.
IOMMU stands for I/O memory management unit. IOMMU is responsible for managing the I/O memory addresses, as well as enabling the connection between DMA-capable I/O buses and the main memory.
DMA window is a range of addresses the adapter is allowed to access. The DMA window address is mapped to the physical memory using a Translation Control Entry (TCE) table on the IOMMU system.

The default DMA window that is allocated is relatively small (2 GB), but the platform allocates larger DMA windows that can be used by the OS by using Dynamic DMA Windows (DDW) operations to manipulate the DMA window in use by the device.

With a wider DMA window, it is possible to map the entire partition memory. This enables a direct translation between I/O address space and the memory address space, not requiring manipulation on the hypervisor TCE. On the other hand, if the DMA window is not able to cover the whole memory, then there must be a translation process that converts from one memory space to another, and this operation consumes some CPU cycles.

The way that Linux® currently operates with DDW is that these larger windows are only used if the whole partition memory can be mapped otherwise only the default 2 GB DMA window is used. But, depending on the maximum memory size of the LPAR, the window may not be wide enough to map the whole memory partition. For this case, the slots can be enabled with Huge Dynamic DMA Window (HDDW) capability using the I/O Adapter Enlarged Capacity setting in Advanced System Management Interface (ASMI). The HDDW-enabled slots are guaranteed to allocate enough DDW capability to map all installed platform memory using 64 KB I/O mappings.

Refer to the following URLs for more information about the size of these larger DMA windows and about the I/O Adapter Enlarged Capacity for 9080-M9S at:

The drawback of this feature is an increased reserved memory footprint by the hypervisor for the HDDW scenario.

To check if DDW was enabled properly on Linux, you can check the kernel messages and look for a message of the ibm,create-pe-dma-window Run-Time Abstraction Services (RTAS) call returning successfully (code 0) for the device you are interested in.

# dmesg | grep create-pe-dma-window
nvme 0011:01:00.0: ibm,create-pe-dma-window(54) 10000 8000000 20000011 10 2a returned 0 (liobn = 0x70000011 starting addr = 8000000 0)

To show the performance benefits of this optimization, tests were performed with both a DDW enabled scenario and a DDW disabled scenario with the storage and network devices.

DDW benefits on storage devices

On the storage side performance data was collected for random read operations. The improvements are really significant even with a single device.

The following chart shows the comparison of number of IOPS for random read operations using a 4 KB block size when the devices are using DDW and not using DDW (using the default DMA window):

The maximum performance of a single 594B device with DDW is 1.5 million IOPS, and without DDW it is 560k IOPS (2.7 times higher with DDW). And when scaling to 28 devices, the maximum performance seen is 28.5 million IOPS with DDW enabled and is around 10 million IOPS (2.85 times higher with DDW) with DDW disabled.

A broader test considering multiple I/O depth values and number of jobs show that the average benefit is around two times higher in case of both single adapter and 28 adapters.

DDW benefits on network devices

On the network side, performance data was collected on two types of workloads: a streaming workload and a request / response workload.

For a bandwidth test, this tuning showed similar results when using a single device. But, as we scale the number of devices in the system, the benefits start to be more noticeable.

The following chart shows the comparison of bandwidth when the device is using DDW and not using DDW (using the default DMA window).
A broader test considering multiple connections and message sizes show that for a single pair of port, the same maximum performance is obtained (94 Gbps with MTU 1500) but as we scale up the number of ports, we see a benefit with DDW enabled. With 11 pairs of ports, the maximum throughput is 30% higher with DDW enabled, and in average 25% better results with DDW enabled.

On a request/response (RR) test with even a single pair of ports, we already see a significant performance change.

The following chart shows the comparison of the number of transactions per second (TPS) the device is capable with DDW enabled and DDW disabled (using the default DMA window).

For a single pair of devices, we get up to 1.3 million TPS with DDW enabled and up to 340,000 TPS (3.8 times higher) with DDW disabled. A broader test considering multiple number of connections and message sizes show that in average for a single port, DDW gets 2.3 times higher performance for this workload.

Interrupt coalescing

Disclaimer: The optimal setting for this feature is highly workload dependent. So its real case usage might be valuable only for well-defined and consistent workloads.

Some NVMe devices have support for a feature called interrupt coalescing, in which interrupts get aggregated by one of the two methods (whichever condition is met first): time or threshold.

The setting for threshold uses the last 8 bits which determine the minimum number of completion queue entries to aggregate per interrupt vector before signaling an interrupt to the host. The setting for time involves the second to last 8 bits to determine the maximum time (in 100 microsecond increments) that a controller may delay an interrupt due to interrupt coalescing. Whereas the other bits are currently reserved and are not in use.

To check the current value set for this feature, the following command can be used:

nvme get-feature /dev/<nvme device> -f 8

To set a new value for this feature the following command can be used:

nvme set-feature /dev/<nvme device> -f 8 -v <value>

A few tests were performed to highlight some of the potential benefits of this feature. In the test, the aggregation threshold setting was set to the maximum value (0xFF) to avoid being the trigger and the aggregation time is the one being modified.

This feature showed some slight performance benefits depending on the test case as seen in the following chart:

As expected, the interrupt rate (number of interrupts generated per second for all devices) is greatly reduced, as seen in the following chart:

This kind of interrupt mitigation leads to a more interesting aspect of this feature, which is CPU utilization reduction due to less interrupt handling being necessary. This means that we can drive a lot more performance for the same unit of CPU utilization. The following chart illustrates this point by showing a normalization (IOPS / CPU%) of the performance divided by the CPU utilization (in percentage) seen during each test:

So, for scenarios where CPU is being a constraint, this feature can potentially show benefits in mitigating the CPU utilization of these devices.

Summary

This blog presented an overview of the I/O scaling performance capabilities of an IBM Power server with both network and storage devices by showing how well it can perform with a large number of devices and also the current shortcomings observed in the platform.

It also brought some data on how some basic tuning and optimization (such as DDW and partition topology) can ensure resources and performance are improved across the most common workloads used for these devices.

More in-depth workload specific tunings and optimizations (such as interrupt coalescing) can be done and can provide some further benefits but need to be used cautiously given that they usually involve some sort of trade-off and can lead to bad results under different conditions. So these are best suited for well-defined and consistent workloads.

Contacting the Enterprise Linux on Power Team
Have questions for the Enterprise Linux on Power team or want to learn more? Follow our discussion group on IBM Community Discussions.

0 comments

82 views

Permalink

https://community.ibm.com/community/user/blogs/murilo-fossa-vicentini/2021/10/05/io-scaling-performance-for-ibm-power-servers

Enterprise Linux

Enterprise Linux on Power

I/O scaling performance for IBM Power servers running Linux

By Murilo Fossa Vicentini posted Wed October 06, 2021 04:51 PM

I/O scaling performance for IBM Power servers running Linux

Abstract

Scaling storage devices

Configuration

Test scenarios

Performance

Scaling network devices

Configuration

Test scenarios

Performance

Tuning / Optimization

Partition topology

Dynamic DMA windows (DDW)

DDW benefits on storage devices

DDW benefits on network devices

Interrupt coalescing

Summary

Permalink

Additional
Resources

Office

Quick Links

Enterprise Linux

Enterprise Linux on Power

I/O scaling performance for IBM Power servers running Linux

By Murilo Fossa Vicentini posted Wed October 06, 2021 04:51 PM

I/O scaling performance for IBM Power servers running Linux

Abstract

Scaling storage devices

Configuration

Test scenarios

Performance

Scaling network devices

Configuration

Test scenarios

Performance

Tuning / Optimization

Partition topology

Dynamic DMA windows (DDW)

DDW benefits on storage devices

DDW benefits on network devices

Interrupt coalescing

Summary

Permalink

Additional Resources

Office

Quick Links

Additional
Resources