Chapter 5. Important changes to external kernel parameters

This chapter provides system administrators with a summary of significant changes in the kernel shipped with Red Hat Enterprise Linux 8.7. These changes could include for example added or updated proc entries, sysctl, and sysfs default values, boot parameters, kernel configuration options, or any noticeable behavior changes.

New kernel parameters

idxd.tc_override = [HW]

With this parameter in the <bool> format you can allow override of default traffic class configuration for the device.

The default value is set to false (0).

kvm.eager_page_split = [KVM,X86]

With this parameter you can control whether or not a KVM proactively splits all huge pages during dirty logging. Eager page splitting reduces interruptions to vCPU execution by eliminating the write-protection faults and Memory Management Unit (MMU) lock contention that is otherwise required to split huge pages lazily.

VM workloads that rarely perform writes or that write only to a small region of VM memory can benefit from disabling eager page splitting to allow huge pages to still be used for reads.

The behavior of eager page splitting depends on whether the KVM_DIRTY_LOG_INITIALLY_SET option is enabled or disabled.

  • If disabled, all huge pages in a memslot are eagerly split when dirty logging is enabled on that memslot.
  • If enabled, eager page splitting is performed during the KVM_CLEAR_DIRTY ioctl() system call, and only for the pages being cleared.

    Eager page splitting currently only supports splitting huge pages mapped by the two dimensional paging (TDP) MMU.

    The default value is set to Y (on).

kvm.nx_huge_pages_recovery_period_ms = [KVM]

With this parameter you can control the time period at which KVM zaps 4 KiB pages back to huge pages.

  • If the value is a non-zero N, KVM zaps a portion of the pages every N milliseconds.
  • If the value is 0, KVM picks a period based on the ratio, such that a page is zapped after 1 hour on average.

    The default value is set to 0.

mmio_stale_data = [X86,INTEL]

With this parameter you can control mitigation for the Processor Memory-mapped I/O (MMIO) Stale Data vulnerabilities.

Processor MMIO Stale Data is a class of vulnerabilities that can expose data after an MMIO operation. Exposed data could originate or end in the same CPU buffers as affected by metadata server (MDS) and Transactional Asynchronous Abort (TAA). Therefore, similar to MDS and TAA, the mitigation is to clear the affected CPU buffers.

The available options are:

  • full: enable mitigation on vulnerable CPUs
  • full,nosmt: enable mitigation and disable SMT on vulnerable CPUs.
  • off: unconditionally disable mitigation

    On MDS or TAA affected machines, mmio_stale_data=off can be prevented by an active MDS or TAA mitigation as these vulnerabilities are mitigated with the same mechanism. Thus, in order to disable this mitigation, you need to specify mds=off and tsx_async_abort=off, too.

    Not specifying this option is equivalent to mmio_stale_data=full.

    For more information, see Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst.

rcutree.rcu_delay_page_cache_fill_msec = [KNL]
With this parameter you can set the page-cache refill delay in milliseconds in response to low-memory conditions. The range of permitted values is 0:100000.
rcuscale.kfree_rcu_test_double = [KNL]
With this parameter you can test the double-argument variant of the kfree_rcu() function. If this parameter has the same value as rcuscale.kfree_rcu_test_single, both the single- and double-argument variants are tested.
rcuscale.kfree_rcu_test_single = [KNL]
With this parameter you can test the single-argument variant of the kfree_rcu() function. If this parameter has the same value as rcuscale.kfree_rcu_test_double, both the single- and double-argument variants are tested.
retbleed = [X86]

With this parameter you can control mitigation of Arbitrary Speculative Code Execution with Return Instructions (RETBleed) vulnerability. The available options are:

  • off: no mitigation
  • auto: automatically select a migitation
  • auto,nosmt: automatically select a mitigation, disabling SMT if necessary for the full mitigation (only on Zen1 and older without STIBP).
  • ibpb: mitigate short speculation windows on basic block boundaries too. Safe, highest performance impact.
  • unret: force enable untrained return thunks, only effective on AMD f15h-f17h based systems.
  • unret,nosmt: like the unret option, will disable SMT when STIBP is not available.

    Selecting the auto option chooses a mitigation method at run time according to the CPU.

    Not specifying this option is equivalent to retbleed=auto.

s390_iommu_aperture = [KNL,S390]

With this parameter you can specify the size of the per device DMA address space accessible through the DMA and IOMMU APIs as a decimal factor of the size of main memory.

  • The default value is set to 1 which means that you can concurrently use as many DMA addresses as physical memory is installed, if supported by hardware, and thus map all of memory at once.
  • With a value of 2 you can map all of memory twice.
  • The value of 0 imposes no restrictions other than those given by hardware at the cost of significant additional memory use for tables.

Updated kernel parameters

acpi_sleep = [HW,ACPI]

Format: { s3_bios, s3_mode, s3_beep, s4_hwsig, s4_nohwsig, old_ordering, nonvs, sci_force_enable, nobl }

  • For more information on s3_bios and s3_mode, see Documentation/power/video.rst.
  • s3_beep is for debugging; it makes the PC’s speaker beep as soon as the kernel real-mode entry point is called.
  • s4_hwsig causes the kernel to check the ACPI hardware signature during resume from hibernation, and gracefully refuse to resume if it has changed. The default behavior is to allow resume and simply warn when the signature changes, unless the s4_hwsig option is enabled.
  • s4_nohwsig prevents ACPI hardware signature from being used, or even warned about, during resume. old_ordering causes the ACPI 1.0 ordering of the _PTS control method, with respect to putting devices into low power states, to be enforced. The ACPI 2.0 ordering of _PTS is used by default.
  • nonvs prevents the kernel from saving and restoring the ACPI NVS memory during suspend, hibernation, and resume.
  • sci_force_enable causes the kernel to set SCI_EN directly on resume from S1/S3. Even though this behavior is contrary to the ACPI specifications, some corrupted systems do not work without it.
  • nobl causes the internal denylist of systems known to behave incorrectly in some ways with respect to system suspend and resume to be ignored. Use this option wisely.

    For more information, see Documentation/power/video.rst.

crashkernel=size[KMG],high = [KNL, X86-64, ARM64]

With this parameter you can allocate physical memory region from top as follows:

  • If the system has more than 4 GB RAM installed, the physical memory region can exceed 4 GB.
  • If the system has less than 4 GB RAM installed, the physical memory region will be allocated below 4 GB, if available.

    This parameter is ignored if the crashkernel=X parameter is specified.

crashkernel=size[KMG],low = [KNL, X86-64]

When you pass crashkernel=X,high, the kernel can allocate a physical memory region above 4 GB. This causes the second kernel crash on systems that require some amount of low memory (for example, swiotlb requires at least 64M+32K low memory) and enough extra low memory to make sure DMA buffers for 32-bit devices are not exhausted. Kernel tries to allocate at least 256 M below 4 GB automatically. With this parameter you can specify the low range under 4 GB for the second kernel instead.

  • 0: disables low allocation. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4 GB.
kvm.nx_huge_pages_recovery_ratio = [KVM]

With this parameter you can control how many 4KiB pages are periodically zapped back to huge pages:

  • 0 disables the recovery
  • N KVM will zap 1/Nth of the 4KiB pages every period.

    The default is set to 60.

module.sig_enforce = norid [S390]
With this parameter you can ignore the RID field and force the use of one PCI domain per PCI function.
rcu_nocbs[=cpu-list] = [KNL]

The optional argument is a CPU list.

In kernels built with CONFIG_RCU_NOCB_CPU=y, you can enable the no-callback CPU mode, which prevents such CPUs callbacks from being invoked in softirq context. Invocation of such CPUs' RCU callbacks will instead be offloaded to rcuox/N kthreads created for that purpose, where x is p for RCU-preempt, s for RCU-sched, and g for the kthreads that mediate grace periods; and N is the CPU number. This reduces OS jitter on the offloaded CPUs, which can be useful for HPC and real-time workloads. It can also improve energy efficiency for asymmetric multiprocessors.

  • If a cpulist is passed as an argument, the specified list of CPUs is set to no-callback mode from boot.
  • If the = sign and the cpulist arguments are omitted, no CPU will be set to no-callback mode from boot but you can toggle the mode at runtime using cpusets.
spectre_v2_user = [X86]

With this parameter you can control mitigation of Spectre variant 2 (indirect branch speculation) vulnerability between user space tasks.

  • auto: kernel selects the mitigation depending on the available CPU features and vulnerability.
  • The default mitigation is set to prctl.
  • Not specifying this option is equivalent to spectre_v2_user=auto.
spec_store_bypass_disable = [X86]

With this parameter you can control whether the Speculative Store Bypass (SSB) optimization to mitigate the SSB vulnerability is used.

  • Not specifying this option is equivalent to spec_store_bypass_disable=auto.
  • The default mitigation is set to prctl.

New sysctl parameters

perf_user_access = [ARM64]

With this parameter you can control user space access for reading performance event counters.

  • When set to 1, user space can read performance monitor counter registers directly.
  • The default is set to 0, which means access disabled.

    For more information, see Documentation/arm64/perf.rst.

force_cgroup_v2_swappiness

With this parameter you can deprecate the per-cgroup swappiness value available only in cgroupsV1. Due to a systemd design choice, most of all system and user processes are run within a cgroup. Furthermore these cgroup swappiness values default to 60. This can lead to undesireable effects where systems swappiness value has little effect on the swap behavior of the system.

If you do want to use the per-cgroup swappiness feature, you can configure the system with force_cgroup_v2_swappiness=1 to have more consistent swappiness behavior across the whole system.

Note that this is a RHEL specific feature.