Db2

Db2

Connect with Db2, Informix, Netezza, open source, and other data experts to gain value from your data, share insights, and solve problems.

 View Only

Troubleshooting a DB2 Crash with Entropy Issues

By Greg Sorensen posted Tue August 19, 2025 05:10 PM

  

Troubleshooting a DB2 Crash with Entropy Issues

Introduction

The strength of cryptographic keys, essential for robust data security, hinges on high entropy. Within a DB2 database environment, both database and network connections commonly utilize SSL/TLS for secure communication. Nevertheless, a degradation of entropy at the operating system layer can critically impair GSKit, resulting in disruptions to network connectivity and potentially causing a DB2 crash. The dynamics of this issue were examined by a team of DBAs, system admins and program developers. This blog post will detail the process of identifying the root causes the team found, such as low entropy issues and outline the diagnostic approaches and effective resolution strategies for scenarios where a DB2 crash occurred due to a GSKit malfunction.

Problem Description

In a very rare case, a DB2 system crash halted with the following message.

<Date>   Instance:<instance>   Node:000
PID:44585(db2agent (<DB>) 0)   TID:2768234240   Appid:<IP>.37370.270505022405
base sys utilities  sqleDoForceDBShutdownFODC Probe:10   Database:<DB>

ADM14001C  An unexpected and critical error has occurred: "ForceDBShutdown".
The instance may have been shutdown as a result. "Automatic" FODC (First
Occurrence Data Capture) has been invoked and diagnostic information has been
recorded in directory
"/<dbpath/sqllib/db2dump/DIAG0000/FODC_ForceDBShutdown_<date>/". 
Please look in this directory for detailed evidence about
what happened and contact IBM support if necessary to diagnose the problem.

<Date>   Instance:<instance>   Node:000
PID:44585(db2agent (<DB>) 0)   TID:2768234240   Appid:<IP>.37370.270505022405
base sys utilities  sqleDoForceDBShutdownFODC Probe:17395   Database:<DB>

ADM7518C  The database manager has shut down the following database because a
severe error has occurred: "<DB_name>     ".

Superficially, this situation appeared quite alarming; however, it generally represents DB2's normal self-preservation protocols, necessitating further investigation. During the actual diagnosis of the failure, we encountered a highly unusual event: a complete collapse of the entire GSKit encryption library. The db2diag.log recorded the following message types:

<DATE>-240 I870310E1818          LEVEL: Error
PID     : 44585                TID : 140149602117376 PROC : db2sysc 0
INSTANCE: <INSTNAME>              NODE : 000            DB   : <DB>
HOSTNAME: <FQDN>
EDUID   : 48                   EDUNAME: db2loggw (<DB>) 0
FUNCTION: DB2 Common, Cryptography, cryptEncryptInit, probe:20
MESSAGE : ECF=0x90000403=-1879047165=ECF_CRYPT_UNEXPECTED_ERROR
          Unexpected cryptographic error
DATA #1 : Hex integer, 8 bytes
0xFFFFFFFFFFFFFFFE
DATA #2 : String, 55 bytes
error:FFFFFFFFFFFFFFFE:lib(255):func(4095):reason(4094)
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
  [0] 0x00007F7732FDD87B /db2_src/db2plug/sqllib/lib64/libdb2osse.so.1 + 0x24087B
  [1] 0x00007F7732FDD6B1 ossLog + 0xA1
  [2] 0x00007F7734A39931 cryptLogICCError + 0x171
  [3] 0x00007F7734A37790 cryptEncryptDecryptInit + 0x290
  [4] 0x00007F7734A342AA cryptDecryptBuffer + 0x3A
  [5] 0x00007F7738FF3866 _Z23sqlexRedeemCipherTicketP17sqlexCipherTicketPhmP5sqlca + 0xF6
  [6] 0x00007F773AAA64E6 _Z31sqlpgEncryptUserDataInNLogPagesPvS_tP17sqlexCipherTicketm + 0x1C6
  [7] 0x00007F773AC68862 _Z15sqlpWriteNPagesP9SQLP_DBCBP9sqeBsuEduP9SQLP_LECBPcmmb + 0x202
  [8] 0x00007F773AC64EBF _Z16sqlpgWriteToDiskP9SQLP_DBCBP9SQLP_LFPBmbbm + 0xD4F
  [9] 0x00007F773AC63E36 _Z13sqlpgPingPongP9SQLP_DBCBP9SQLP_LFPBmbm + 0x76
  [10] 0x00007F773AC6BDCD _ZN11sqpLoggwEdu8sqlpgwlpEmmPK9SQLP_LSN8m + 0xF3D
  [11] 0x00007F773AB956A3 _ZN11sqpLoggwEdu13sqlpLoggwMainEv + 0x16A3
  [12] 0x00007F773AE07C17 _ZN11sqpLoggwEdu6RunEDUEv + 0x27
  [13] 0x00007F773C52748E _ZN9sqzEDUObj9EDUDriverEv + 0x1BE
  [14] 0x00007F773AA3CD2A sqloEDUEntry + 0x57A
  [15] 0x00007F7742694EA5 /lib64/libpthread.so.0 + 0x7EA5
  [16] 0x00007F773200DB0D clone + 0x6D


<DATE>-240 I881028E519           LEVEL: Error
PID     : 44585                TID : 140149602117376 PROC : db2sysc 0
INSTANCE: <INSTNAME>              NODE : 000            DB   : <DB>
HOSTNAME: <FQDN>
EDUID   : 48                   EDUNAME: db2loggw (<DB>) 0
FUNCTION: DB2 UDB, data protection services, sqlpEncrypt, probe:100
MESSAGE : ZRC=0x875C00CD=-2024013619=SQLEX_UNEXPECTED_SYSERR
          "Unexpected System Error"
DATA #1 : String, 30 bytes
Error redeeming cipher ticket!

<DATE>-240 I880112E510           LEVEL: Severe
PID     : 44585                TID : 140149602117376 PROC : db2sysc 0
INSTANCE: <INSTNAME>             NODE : 000            DB   : WEB
HOSTNAME: <FQDN>
EDUID   : 48                   EDUNAME: db2loggw (<DB>) 0
FUNCTION: DB2 UDB, bsu security, sqlexRedeemCipherTicket, probe:1519
MESSAGE : ZRC=0x875C00CD=-2024013619=SQLEX_UNEXPECTED_SYSERR
          "Unexpected System Error"

<DATE>-240 I881548E19008         LEVEL: Error
PID     : 44585                TID : 140149602117376 PROC : db2sysc 0
INSTANCE: <INSTNAME>              NODE : 000            DB   : <DB>
HOSTNAME: <FQDN>
EDUID   : 48                   EDUNAME: db2loggw (WEB) 0
FUNCTION: DB2 UDB, data protection services, sqlpgEncryptUserDataInNLogPages, probe:30
MESSAGE : Encountered error when encrypting log page.
DATA #1 : unsigned integer, 8 bytes
0
DATA #2 : SQLP_LFPB, PD_TYPE_SQLP_LFPB, 1 bytes

    bytecount = 2542
   firstIndex = 138
      PHFlags = 0x0010
      pageLso = 21194762240142 maps to 0000135F010789FD
     CheckSum = 0x319467a1

 Hex Dump of Log Data ------------------------------------
 <Followed by a long set of HexDump values>

Upon analyzing this situation, it became clear that the encryption problems stemmed directly from the inability to encrypt the logs. This led to a DB2 crash, which invoked FODC, and brought all transactions to a halt. The necessary recovery procedure includes issuing a db2kill , performing an ipcrm to clear any lingering DB2 resources, and then executing a db2start to enable the database to undergo crash recovery.

Error redeeming cipher ticket and the associated ZRC=0x875C00CD” indicates a problem with retrieving or validating the "cipher ticket". A cipher ticket is likely related to the key management and exchange process within the DB2 encryption mechanism.

The message “Encountered error when encrypting log page from function sqlpgEncryptUserDataInNLogPages, probe:30”, confirms the issue occurred while encrypting a log page.

Based on searches involving this we have all these likely sources.

Key Management Issues:

The core problem seems to be an inability to access or utilize the necessary encryption

keys or related credentials for encrypting log pages. According to IBM, DB2 relies on GSKit (IBM Global Security Kit)

for cryptographic requests, so problems with GSKit could prevent access to the required encryption keys.

Keystore Accessibility/Corruption:

There may have been issues accessing the keystore that stores the encryption keys for the DB2 instance. If the keystore was corrupt or unavailable, DB2 would not be able to retrieve the necessary key to encrypt the log page.

Permission Issues:

Improper permissions on the keystore or related files could have prevented DB2 from accessing them and subsequently encrypting the log page.

GSKit Errors:

The ZRC value 0x875C00CD indicates a "Unexpected System Error" which can be caused by various issues, including GSKit errors related to the encryption process.

Low Entropy Environment:

According to IBM, low entropy environments can lead to GSKit self-test failures, potentially impacting encryption operations.

FIPS Compliance:

If the DB2 instance is operating in FIPS-compliant mode, and there were issues related to FIPS-validated cryptography, this could have also contributed to the encryption failures.

Diagnosis

Our troubleshooting process involved a comprehensive examination of all encryption sources. Initial investigations readily dismissed issues concerning GSKit KDB file access, as its permissions and location had been stable for years. We also confirmed the integrity of the keystores for both DB2 encryption and SSL encryption by successfully viewing and verifying them. FIPS compliance mode was ruled out, validated by the absence of STRICT_FIPS in the DB2AUTH variable and the lack of fips=1 within /proc/cmdline on the Linux operating system. The server did not exhibit signs of overload in terms of CPU or memory utilization. Network interfaces were fully functional, showing no indications of dropped packets or CRC errors. Nevertheless, monitoring data from Instana highlighted a sharp increase in s3fs CPU utilization immediately preceding the crash. Our Instana setup tracks critical processes such as DB2, s3fs, DB2 automation scripts, CDC activity, Tivoli Storage Manager backups, various Java processes, and the underlying OS, and it explicitly flagged s3fs for heavy CPU consumption during this period.

This essentially left us with a "Low Entropy Environment" issue.

To diagnose the issue, we meticulously reviewed DB2 audit logs for elevated transaction volumes and consulted /var/log/messages for anomalies occurring concurrently. While a correlation with increased DB2 transactions from particular users was noted, it did not entirely exceed typical operational bounds. The messages file, conversely, provided more revealing insights. Enabling debug logging (-o dbglevel=debug) for s3fs (our cloud object storage filesystem client) exposed an unusually high number of HTTPS operations. Specifically, the s3fs process was generating numerous 'HTTP response code 404' entries, signifying that its underlying curl requests (for HTTPS GET and PUT operations) were failing to locate specific directories on the traversed network paths. Furthermore, we ascertained that rngd was not deployed to address potential low entropy conditions. Examination of /proc/sys/kernel/random/entropy_avail indicated erratic fluctuations in entropy levels, demonstrating insufficient replenishment to sustain demand.

We ran this command multiple times over a period of 10 minutes. A healthy system should show values in the thousands (preferably closer to the maximum pool size, which is typically

4096 bits on a standard system).

$ cat /proc/sys/kernel/random/entropy_avail

Values below 200 bits are considered dangerously low and can cause processes to block, waiting for more entropy. Our values were smaller than 3200 bits and might also lead to issues like the system hanging while generating certificates.

We further used the rng-tools package provides a utility called rngtest that can perform tests on /dev/random to assess the quality of the randomness.

Install rngd with sudo yum install -y rng-tools (or equivalent for your distribution).

Then, run tests like:

# cat /dev/random | rngtest -c 100

This will perform 100 FIPS tests and report on the randomness of the data from /dev/random.

The results of these showed that FIPS testing failed randomly with the FIPS types with failure counts in the low single digits.

Solution

After evaluating all potential sources of encryption-related issues, we identified three key problems to resolve: first, the rngd service needed to be installed; second, the s3fs programs required recompilation and updating to improve stat cache utilization; and third, every directory in the cloud object storage bucket needed to be touched with an additional / appended.

The rngd service is easily configured. We use Red Hat 8.10 and installed the rng-tools package and then enabled it.

# yum install -y rng-tools
# systemctl enable rngd
# systemctl start rngd

# systemctl status rngd
● rngd.service - Hardware RNG Entropy Gatherer Daemon
   Loaded: loaded (/usr/lib/systemd/system/rngd.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon ......
 Main PID: 2451155 (rngd)
   CGroup: /system.slice/rngd.service
           └─2451155 /usr/sbin/rngd -f --rng-device=/dev/hwrng --fill-watermark=0 -x pkcs11 -x nist -x qrypt -D daemon:daemon

The compilation instructions for s3fs are located at https://github.com/s3fs-fuse/s3fs-fuse/blob/master/COMPILATION.md. Utilizing these guidelines, we updated our code to version v195 and subsequently applied the new binary. This update enabled the use of several new options, including use_path_request_style, enable_noobj_cache, stat_cache_expire=3600, and kernel_cache. The revised methods employed by this updated version modified the approach to utilizing target URLs for objects, thereby mitigating the occurrence of 404 errors and enhancing the efficiency of HTTPS communication for object storage.

Finally, we had to mkdir every directory on COS with the trailing / and non-slash.

$ cd <Cloud_Object_Base_Mount_Point>
$ find . -type d | while read dir; do mkdir $dir; mkdir $dir/; done

This ensured that all of the objects on the storage have both object types for curl to find. Cloud Object Storage is not a filesystem, and ensuring that both types exist help with the transactions used to POST and GET when storing data. This further reduces the 404 errors and the network activity needed when using an s3fs type storage device.

We then checked the entropy state after applying all of these with the same rngtest command and found all of the FIPS tests came back with 0 errors.

Conclusion

While our specific environment might be unique, the diagnostic approach is universally applicable. Any server experiencing similar problems requires a holistic examination of all encryption processes, both inside and outside DB2. Furthermore, a detailed review of major processes involving network communication is warranted, as they are often contributing factors. This encompasses, but is not limited to, analyzing network monitors, DB2 database encryption, DB2 remote transactions, system network backup utilities, NFS mounts, incoming network port scanners, system monitoring programs, and system log message agents.

The problem of low entropy can be exacerbated in virtualized environments, particularly on VMs that lack a dedicated hardware encryption module. Servers encountering such issues necessitate a comprehensive, system-wide analysis to identify the root causes. Nevertheless, installing the rngd service constitutes a valuable initial step. For Linux VMs operating on KVM/QEMU, the most widely adopted and recommended methodology involves the utilization of VirtIO RNG. Additionally, haveged serves as another daemon whose purpose is to replenish the kernel's entropy pool, thereby ensuring an adequate supply of randomness through /dev/random and /dev/urandom.

Keeping sufficient entropy available is critical for cryptographic operations and, consequently, vital for DB2 functionality in most production environments. DB2 utilizes SSL/TLS to secure both its database and network connections, making it highly dependent on a healthy entropy pool.

About the Author

Greg Sorensen is a DBA and Systems Admin for IBM Quoting with a Bachelors Degree in Electrical Engineering from UT Austin. He has worked primarily with AIX, Linux, DB2 and systems programming. He can be reached at gsorense@us.ibm.com.

0 comments
17 views

Permalink