Enterprise Linux

 View Only

Linux kernel crash-dump mechanism ( KDUMP and FADUMP )

By Sachin P Bappalige posted Fri June 14, 2024 06:03 AM

  

Authors:  Praveen Pandey (praveen.pandey@in.ibm.com) ,  Sachin Bappalige (sachin.pb@in.ibm.com)

Abstract 

This blog illustrates ways to use  Kdump and Fadump features in Linux based systems. It also includes how to configure Kdump and Fadump for different Linux distributions and targets (Disk /Network).

Introduction 

In large enterprise deployments, it's crucial to have tools in place for tracing, recording, and analyzing system state for serviceability when problems occur. This is where First Failure Data Capture (FFDC) comes into play, allowing user to gather all pertinent information needed to debug problem from the first time it occurs. FFDC is very important to the operating system (OS), which plays an important role in the entire system. When a crash or hang occurs, capturing the state of the OS at the time of crash/hang is the initial step towards identifying the root cause of the problem.

In 2006, IBM introduced kdump mechanism as the go-to mechanism for OS crash dumps in Linux environments. Kdump is a feature of the Linux kernel that creates crash dumps in the event of a kernel crash. When triggered, kdump exports a memory image (also known as vmcore) that can be analyzed for the purposes of debugging and determining the cause of a crash. Kdump is a kernel crash dumping mechanism and is very reliable because the crash dump is captured from the context of a freshly booted kernel and not from the context of the crashed kernel. Kdump uses kexec to boot into a second kernel whenever system crashes. This second kernel, often called the crash kernel, boots with very little memory and captures the dump image. That means Kdump is the kexec-based Crash Dumping Solution

Alternatively, for IBM POWER systems, there's the Firmware Assisted Dump (fadump) mechanism. Fadump serves as an alternative to kdump, capturing the vmcore file from a fully-reset system with PCI and I/O devices. This mechanism employs firmware to preserve memory regions during a crash and then utilizes kdump userspace scripts to save the vmcore file. Fadump offers enhanced reliability compared to traditional dump types by rebooting the partition and utilizing a new kernel to dump data from the previous kernel crash.

KDUMP 

kdump uses kexec , a kernel-kernel bootloader, that bypasses the firmware. Kexec had been used for fast reboots and extending this mechanism to boot into a new (minimal) kernel in a reserved memory region, without disturbing the contents of the rest of RAM was an appealing prospect for capturing the OS state at the time of failure. Thus, we use a two-kernel approach, where the production kernel runs as normal, while a minimal kernel resides in a reserved memory area and is booted into in the case where the production kernel crashes. Once booted into this minimal kernel, contents of the (untouched) RAM can then be accessed and written out in the ELF format for analysis with tools like crash. The system administrator can decide where this dump needs to be stored and can further filter the dump for size.

Configuring the kdump target as shown in above diagram:  When a kernel crash is captured, the core dump can be either stored as a file in a local file system, written directly to a device, or sent over a network using the NFS (Network File System) or SSH (Secure Shell) protocol. Only one of these options can be set at the moment. The default option is to store the vmcore file in the /var/crash directory of the local file system

To reduce the size of the vmcore dump file, kdump allows you to specify an external application (a core collector) to compress the data, and optionally leave out all irrelevant information. Currently, the only fully supported core collector is makedumpfile.

By default, kdump uses the kexec system call to boot into the second kernel (a capture kernel) without rebooting and then captures the contents of the crashed kernel’s memory (a crash dump or a vmcore) and saves it into a file. After the successful save, kdump reboots the machine. However, If kdump fails to create a core dump at the target location specified, then kdump reboots the system without saving the vmcore (By default).

Although kdump is an excellent solution addressing critical problem(s), it has some drawbacks. Once the OS crashes, the system is in an inconsistent state, especially the devices. While utmost care is taken to prevent failures, in some rare cases, a rogue DMA or bad behaving device drivers can cause the kdump capture to fail. There is continued effort to make kdump robust.

KEXEC 

Kexec plays an crucial role in the operation of Kdump by allowing the system to boot into a secondary kernel, known as the kdump kernel, upon a kernel crash. Here's how it works and how it's designed:

Kexec Overview

Kexec is a mechanism in the Linux kernel that allows booting a new kernel from the currently running kernel without going through the entire system boot process. It essentially bypasses the BIOS or bootloader and directly loads a new kernel into memory. This process is significantly faster than a full system reboot.

Kdump Integration

In the context of Kdump, Kexec is used to load a special "kdump kernel" when a kernel crash occurs. This kdump kernel is a minimalistic kernel designed specifically for capturing crash dumps. It resides in a reserved area of memory, separate from the primary kernel's memory space.

Crash Detection

When the primary kernel encounters a critical error or crashes, it triggers a kernel panic. This panic event is detected by the Kexec mechanism, which then takes over to initiate the boot process of the kdump kernel.

Kdump Kernel Initialization
The kdump kernel is responsible for collecting diagnostic information about the state of the system during system crash. Upon booting, it initializes essential hardware drivers, mounts necessary filesystems, and sets up communication channels for collecting crash dump data.

Crash Dump Capture

Once the kdump kernel is operational, it proceeds to capture the memory dump. This includes gathering information about the state of the system's memory, including kernel memory, process information, and other relevant data structures. The captured data is then saved to a designated location, typically a disk partition, for later analysis.

Minimalistic Design

The kdump kernel is intentionally kept minimalistic to reduce the risk of encountering the same issue that caused the primary kernel to crash. It includes only essential drivers and functionalities required for crash dump capture, minimizing the chances of encountering complex bugs or errors during the dump process.

After the crash dump is captured, the kdump kernel triggers a system reboot to restore normal operation.

Kexec is an integral component of the Kdump feature, facilitating the rapid booting of a specialized kdump kernel for capturing crash dumps, thereby enabling efficient debugging and analysis of kernel issues.

kdump configurations 

Red Hat, SUSE, Ubuntu and CentOS are all Linux distributions and have different characteristics, purposes, and target audiences. Red Hat and SUSE produce powerful but different enterprise Linux environments. All Linux distributions share similar commands, directory structures and general features. These Linux distributions differ by package manager, distribution-specific commands or pre-installed software. So ,the configuration files and packages vary based on RHEL and SLES distributions.

RHEL specific configurations :

Installing kdump:

kdump service is installed and activated by default

Check kdump is installed on your system:

rpm -q kexec-tools

Example:

# rpm -qa | grep kexec-tools

kexec-tools-2.0.27-8.el9.ppc64le

If not installed , you can go with this command

dnf install kexec-tools

On RHEL, you can use kdumpctl utility to estimate, start, stop, status, restart, reload, rebuild

Check related kdumpctl log messages at  "/var/log/kdump.log". For example :

Configuring the memory usage

Memory reserved for the kdump kernel is always reserved during system boot, which means that the amount of memory is specified in the system’s boot loader configuration.

To specify the memory reserved for kdump kernel, set the crashkernel= option to the required value

You can also set the amount of reserved memory to be variable, depending on the total amount of installed memory. The syntax for variable memory reservation is crashkernel=<range1>:<size1>,<range2>:<size2>.

Example: 

crashkernel=2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G

To check the crashkernel size , you can read /proc/cmdline  as shown  in case 1 (default case)   and case 2 :

case  1 : 

case 2:  Set the crashkernel size 

In addition , the kdumpctl estimate command also helps to get the recommended values for crashkernel size as shown below . This will help to avoid underestimation or overestimation of crashkernel size for a given machine.


kdump.conf is a configuration file for the kdump kernel crash collection service. kdump.conf provides post-kexec instructions to the kdump kernel. It is stored in the initrd file managed by the kdump service. After modifying this file, you can  restart the kdump service to rebuild to initrd.

How to check  kdump service enabled and start kdump service:

Configuring  kdump target

When a kernel crash is captured, the core dump can be either stored as a file in a local file system, written directly to a device, or sent over a network using the NFS (Network File System) or SSH (Secure Shell) protocol. Only one of these options can be set at the moment. The default option is to store the vmcore file in the /var/crash directory of the local file system.

Example: 

To store the vmcore file in /var/crash/ directory of the local file system, edit the /etc/kdump.conf file and specify the path:

path /var/crash

The option path /var/crash represents the file system path in which the kdump saves the vmcore file.

NOTE: dump targets can be configured as shown below :

1) To write the dump directory to a remote directory on SSH

Example:

(a) Edit "/etc/kdump.conf" on RHEL:

ssh root@a.b.c.d

sshkey /root/.ssh/id_rsa

(b) Edit "/etc/sysconfig/kdump" on SLES:

KDUMP_SAVEDIR="ssh://root@a.b.c.d/var/crash"

-----------------------------------

2) To write the dump directory to a remote directory on NFS

Example:

(a) Edit "/etc/kdump.conf" on RHEL:

nfs a.b.c.d:/kdump_nfs/

(b) Edit "/etc/sysconfig/kdump" on SLES:

KDUMP_SAVEDIR="nfs://a.b.c.d/NFS-kdump_Dir"

3)To write the dump directory to a different partition

Example :

ext4 UUID=03138356-5e61-4ab3-b58e-27507ac41937

4) To write the dump directly to a device

Example:

raw /dev/sdb1

5) To write the dump directly to a Persistent memory (PMEM).

Example :

(a) Edit "/etc/kdump.conf" on RHEL:

path /pmem0

NOTE: Its is not recommended dump target as per the documentation

(b) Edit "/etc/sysconfig/kdump" on SLES:

KDUMP_SAVEDIR="/pmem0"

            

Configuring the core collector

(a) Edit "/etc/kdump.conf" on RHEL:

To reduce the size of the vmcore dump file, kdump allows you to specify an external application (a core collector) to compress the data, and optionally leave out all irrelevant information. Currently, the only fully supported core collector is makedumpfile.

Example :

 core_collector makedumpfile -l --message-level 1 -d 31

Sample file:

# cat /etc/kdump.conf | grep -v '#' | grep -v '^$'

path /any/path

auto_reset_crashkernel yes

core_collector makedumpfile -E --message-level 1 -d 23

where -l ===>  To enable the dump file compression

            -d ===>  To remove certain pages from the dump

            -E ===>  Create DUMPFILE in the ELF format

--------------------------

Below are  the tables for message level (Table 1) and dumplevel (Table 2).

Table 1 : Specify the message types [--message-level ML] with makedumpfile .

Table 2 : Specify the dumplevel  [-d DL]: Specify the type of unnecessary page for analysis. Pages of the specified type are not copied to DUMPFILE. The dumplevel consists of five bits, so there are five base levels to specify the type of unnecessary page.
                    1 : Exclude the pages filled with zero.
                    2 : Exclude the non-private cache pages.
                    4 : Exclude all cache pages.
                    8 : Exclude the user process data pages.
                   16 : Exclude the free pages.

(b) Edit "/etc/sysconfig/kdump" on SLES:

Allocate Memory for Kdump

Reserve memory for Kdump to use when capturing crash dumps. You can allocate memory by editing the crashkernel parameter in the GRUB configuration file (/etc/default/grub) on SLES.

GRUB_CMDLINE_LINUX_DEFAULT="quiet crashkernel=2048M"

After editing the GRUB configuration file, regenerate the GRUB configuration by running:

grub2-mkconfig -o /boot/grub2/grub.cfg

Note : Use grubby command on RHEL as shown below:

grubby --args="crashkernel=2048M" --update-kernel=/boot/vmlinuz-`uname -r`

This should be followed by machine reboot 

-------------------------------------------

Blacklisting the kernel drivers is a mechanism to prevent them from being loaded and used. Adding the drivers in /etc/sysconfig/kdump file, prevents the kdump initramfs from loading the blacklisted modules. Blacklisting kernel drivers prevents the oom killer or other crash kernel failures. To blacklist the kernel drivers, you can update the KDUMP_COMMANDLINE_APPEND= variable in the /etc/sysconfig/kdump file and specify one of the following blacklisting option:

rd.driver.blacklist=<modules>

modprobe.blacklist=<modules>

Problems with current kdump approach

1. Kexec-based approach.
2. Kdump kernel is pre-loaded in the reserved memory area.
3. Dependent on crashed kernel to kexec into kdump kernel.
4. Devices are in inconsistent state.
5. DMA may be in progress that may corrupt the kdump kernel.
6. Second reboot required after dump capture

Firmware-Assisted Dump [FADUMP]

A robust mechanism to get reliable kernel crash dump with assistance from Power firmware.  It is IBM Power specific feature. This approach does not use kexec but allows the firmware to boot a kernel kdump while preserving the contents of memory. Unlike kdump, the system is reset\and loaded with a fresh copy of the kernel. In particular, PCI and I/O devices are updated to keep them clean and consistent. This second kernel, often called a capture kernel, boots with very less memory and captures the dump image.

Advantages over kdump are listed below:

1.      Fully reset system

2.      Loaded with fresh copy of the kernel.

3.      PCI and I/O devices are in clean state.

4.      Second reboot is not required.

The first kernel registers the sections of memory with the Power firmware for dump preservation during OS initialization. These registered sections of memory are reserved by the first kernel during early boot. When a system crashes, the Power firmware fully resets the system, preserves all the system memory contents, save the low memory (boot memory of size larger of 5% of system RAM or 256MB) of RAM to the previous registered region. It will also save system registers, and hardware PTE's. Fadump is supported only on ppc64 platform. The standard kernel and capture kernel are one and the same on ppc64.

Fadump Operational Flow

Like kdump, fadump also exports the ELF formatted kernel crash dump through /proc/vmcore. Hence existing kdump infrastructure can be used to capture fadump vmcore.

 The idea is to keep the functionality transparent to end user. From user perspective there is no change in the way kdump init script works. However, unlike kdump, fadump does not pre-load kdump kernel and initrd into reserved memory, instead it always uses default OS initrd during second boot after crash. Hence, for fadump, we rebuild the new kdump initrd and replace it with default initrd. Before replacing existing default initrd we take a backup of original default initrd for user's reference. The dracut package has been enhanced to rebuild the default initrd with vmcore capture steps. The initrd image is rebuilt as per the configuration in /etc/kdump.conf file.

The control flow of fadump works as shown in flowchart

1. Registration 

·        OS registers sections of memory for dump preservation during first kernel.

2. OS Crash 

·        OS terminates abnormally.

·        The firmware moves the registered sections of memory as instructed during dump registration.

·        The partition reboots and provides prior registration data in the device tree.

3. Save dump and continue 

·        The OS saves the preserved memory regions to disk.

The OS completes/invalidates the current dump status.

Memory Reservation Map – First kernel 

First kernel Initialization starts with  steps mentioned below

1)     Check if fadump=1 boot option is specified.

2)     Check if firmware supports the feature.

3)     Check if fadump_reserve_memory= boot option is specified

            – If not, then calculate boot memory size.

4)     Reserve the memory required to hold following:

           – CPU State Data

           – HPTE region

           – Boot memory dump (Real Mode Region)

           – ELF core header and fadump crash info structure.

5)     echo 1 to /sys/kernel/fadump_registered for fadump registration (userspace).

6)     Gather crash memory regions information.

7)     Generate ELF core headers.

            – PT_LOAD program headers for crash memory regions.

            – Place holder for CPU crash notes (PT_NOTE).

Memory Reservation Map – Second kernel

Second kernel Initialization (After Crash)

1)     Check if fadump=1 boot option is specified.

2)     Check if firmware supports the feature.

3)     Check if dump is active

       – ibm,kernel-dump RTAS property present under device tree

4)     Reserve all the memory except boot memory.

5)     Verify and read CPU State Dump data.

6)     Build ELF CPU notes using CPU State Dump data.

7)     Export ELF core header through /proc/vmcore

-        Capture and save dump (/proc/vmcore) to the disk

i.e  copy /proc/vmcore  to /var/crash/ (default dump target)

8)     Export /sys/kernel/fadump_release_mem file.

– echo 1 to /sys/kernel/fadump_release_mem to release reserved memory.

– The reserved memory is released for regular use.

– Kernel re-registers for next kernel dump.

– At this point the memory reservation map will look like as shown in diagram.

The concept of GRUB memory allocation is crucial for understanding how the bootloader manages memory during the boot process, especially in the context of systems utilizing fadump for crash dump analysis. Grub usage is different based on whether the kernel being booted is production kernel (regular/production kernel) or dump capture kernel (kernel that boots after crash). For production kernel, all the memory except reserved will considered by grub as available for boot process (like loading kernel, initrd, etc). For dump capture kernel, there is an additional variable called X. This X can be anything between 0 to Top of memory and grub considers only the memory below this X (except reserved ) as available. This X is nothing but the memory reserved for fadump where X is 768MB.

(a) The available memory for dump capture kernel is 0-768MB except the reserved area.

(b) The available memory for production kernel is 0 to Top of memory except the reserved area.

   * |---------- Top of memory ----------|

   * |                                   |

   * |             available             |

   * |                                   |

   * |----------     768 MB    ----------|

   * |                                   |

   * |              reserved             |

   * |                                   |

   * |--- 768 MB - runtime min space  ---|

   * |                                   |

   * |             available             |

   * |                                   |

   * |----------      0 MB     ----------|

In the context of  fadump, GRUB's memory allocation strategy is significant because it influences how memory is managed during the boot process, including the allocation of memory for crash dump analysis. It depends on number of CPUs and size of the system memory, LPARs with multipath setups and LPARs with different LMB sizes. Ensuring sufficient memory availability and proper allocation is crucial for the successful operation of fadump, which rely on capturing memory contents during system failures for diagnostic purposes.

Firmware-assisted dump feature uses sysfs file system to hold the control files and debugfs file to display memory reserved region. The fadump sysfs files present inside /sys/kernel/fadump directory.  Here is the list of files under kernel sysfs:
 
1) /sys/kernel/fadump/enabled
 
    This is used to display the fadump status.
    0 = fadump is disabled
    1 = fadump is enabled
 
    This interface can be used by kdump init scripts to identify if fadump is enabled in the kernel and act accordingly.
 
 2) /sys/kernel/fadump/registered
 
    This is used to display the fadump registration status as well as to control (start/stop) the fadump registration.
    0 = fadump is not registered.
    1 = fadump is registered and ready to handle system crash.
 
To register fadump echo 1 > /sys/kernel/fadump/registered and echo 0 > /sys/kernel/fadump/registered for un-register and stop the fadump. Once the fadump is un-registered, the system crash will not be handled and vmcore will not be captured. 
 
3) /sys/kernel/fadump/mem_reserved
This sys interface to allow querying the memory reserved by fadump for saving the crash dump
4)  /sys/kernel/fadump/fadump_release_mem
This file is available only when fadump is active during second kernel. This is used to release the reserved memory region that are held for saving crash dump. To release the reserved memory echo 1 to it.
Example: Verify or set the values in fadump sys control files and debug files

Configuring  Fadump

(a) Configurations on SLES

Step 1 : set KDUMP_FADUMP to yes in /etc/sysconfig/kdump

Step 2 :  Launch YaST tool provided by SUSE by opening a terminal and running the command yast2 kdump.

  • Navigate to System Settings: In the YaST Control Center, navigate to "System" settings. This can typically be found under the "System" section in the YaST Control Center.
  • Select Kernel kdump Settings: Within the "System" settings, choose "Kernel Settings". This option allows you to configure kernel-related settings, including fadump.
  • Enable fadump: Look for an option related to crash dumps or fadump i.e labeled as "Use Firmware-Assisted Dump". Select this option to enable fadump.
  • Configure Memory Allocation: Specify the memory allocation for fadump. This usually involves setting the amount of memory reserved for crash kernel using parameters like crashkernel=<memory size>.
  • Apply Changes: After configuring fadump settings, apply the changes to save the configuration.

        Reboot the System: Reboot the system to apply the changes made to the kernel settings.

Verify Configuration: After the system reboots, verify that fadump is enabled and configured correctly. You can do this by checking the kernel parameters using commands like cat /proc/cmdline or dmesg | grep fadump.

NOTE : Instead of yast2 as mentioned in step2 above,  you can also append 'crashkernel=<value> (e.g.: crashkernel=2048M) and  'fadump=on' at the end of the GRUB_CMDLINE_LINUX_DEFAULT entries of /etc/default/grub and run grub2-mkconfig -o /boot/grub2/grub.cfg  as root user. For more details check step 1 and step2 mentioned below:

Step 1 : Pre-check the machine for existing configurations 

Step 2: Modify the config files 

(b) configurations  on RHEL

Add  fadump=on  to the list of kernel parameters

grubby --args="fadump=on" --update-kernel=ALL

grubby --args="fadump=on" --update-kernel=/boot/vmlinuz-`uname -r`

you can use grubby to reserve custom amount of memory:

grubby --args="crashkernel=2048M" --update-kernel=/boot/vmlinuz-`uname -r`

Remove the fadump=on from list of kernel parameter

grubby  --remove-args=fadump=on --update-kernel=/boot/vmlinuz-`uname -r`

To test the configuration, reboot the system with kdump enabled, and make sure that the service is running:

systemctl is-active kdump

Then type the following commands at a shell prompt:

echo 1 > /proc/sys/kernel/sysrq

echo c > /proc/sysrq-trigger

This forces the Linux kernel to crash, and the IPaddress-YYYY-MM-DD-HH:MM:SS/vmcore file is copied to the location you have selected in the configuration (that is, to /var/crash/ by default).

Example:

SLES15 SP6 onwards , SUSE has changed the dump directory format in naming convention i.e   IPaddress-YYYY-MM-DD-HH-MM. As per the kdump-save script found in machine, vmcore file format is  flattened format by default. Details available in README.txt

Internally used dump command :
 
DUMP_COMMAND="makedumpfile -F ${FORMAT} ${THREADS} --message-level $MSG_LEVEL -d ${KDUMP_DUMPLEVEL} ${MAKEDUMPFILE_OPTIONS} /proc/vmcore"
where -F ===>  Output the dump data in the flattened format

Example: 

To determine the cause of the system crash, you can use the crash utility. This utility allows you to interactively analyze a running Linux system as well as a core dump created kdump. When started, it presents you with an interactive prompt very similar to the GNU Debugger (GDB). Additionally, drgn tool (as an alternative to the crash utility) is designed as a library that can be used to build debugging and introspection tool. 

Special  thanks  you to Hari Bathini, Sourabh Jain and Mahesh Salgaonkar for sharing your knowledge and expertise,  which were instrumental in bringing this much-needed document  !

0 comments
42 views

Permalink