Chapter 6. Important changes to external kernel parameters

This chapter provides system administrators with a summary of significant changes in the kernel distributed with Red Hat Enterprise Linux 8.2. These changes include added or updated proc entries, sysctl, and sysfs default values, boot parameters, kernel configuration options, or any noticeable behavior changes.

6.1. New kernel parameters

cpuidle.governor = [CPU_IDLE]
Name of the cpuidle governor to use.
deferred_probe_timeout = [KNL]

This is a debugging parameter for setting a timeout in seconds for the deferred probe to give up waiting on dependencies to probe.

Only specific dependencies (subsystems or drivers) that have opted in will be ignored. A timeout of 0 will timeout at the end of initcalls. This parameter will also dump out devices still on the deferred probe list after retrying.

kvm.nx_huge_pages = [KVM]

This parameter controls the software workaround for the X86_BUG_ITLB_MULTIHIT bug.

The options are:

  • force - Always deploy workaround.
  • off - Never deploy workaround.
  • auto (default) - Deploy workaround based on the presence of X86_BUG_ITLB_MULTIHIT.

If the software workaround is enabled for the host, guests do not need to enable it for nested guests.

kvm.nx_huge_pages_recovery_ratio = [KVM]
This parameter controls how many 4KiB pages are periodically zapped back to huge pages. 0 disables the recovery, otherwise if the value is N, Kernel-based Virtual Machine (KVM) will zap 1/Nth of the 4KiB pages every minute. The default is 60.
page_alloc.shuffle = [KNL]

Boolean flag to control whether the page allocator should randomize its free lists.

The randomization may be automatically enabled if the kernel detects it is running on a platform with a direct-mapped memory-side cache. This parameter can be used to override/disable that behavior.

The state of the flag can be read from the sysfs pseudo filesystem from the /sys/module/page_alloc/parameters/shuffle file.

panic_print =

Bitmask for printing system info when panic happens.

The user can chose combination of the following bits:

  • bit 0: print all tasks info
  • bit 1: print system memory info
  • bit 2: print timer info
  • bit 3: print locks info if the CONFIG_LOCKDEP kernel configuration is on
  • bit 4: print the ftrace buffer
  • bit 5: print all printk messages in buffer
rcutree.sysrq_rcu = [KNL]
Commandeer a sysrq key to dump out Tree RCU’s rcu_node tree with an eye towards determining why a new grace period has not yet started.
rcutorture.fwd_progress = [KNL]
Enable Read-copy update (RCU) grace-period forward-progress testing for the types of RCU supporting this notion.
rcutorture.fwd_progress_div = [KNL]
Specify the fraction of a CPU-stall-warning period to do tight-loop forward-progress testing.
rcutorture.fwd_progress_holdoff = [KNL]
Number of seconds to wait between successive forward-progress tests.
rcutorture.fwd_progress_need_resched = [KNL]
Enclose cond_resched() calls within checks for need_resched() during tight-loop forward-progress testing.
tsx = [X86]

This parameter controls the Transactional Synchronization Extensions (TSX) feature in Intel processors that support TSX control.

The options are:

  • on - Enable TSX on the system. Although there are mitigations for all known security vulnerabilities, TSX accelerated several previous speculation-related CVEs. As a result, there may be unknown security risks associated with leaving it enabled.
  • off - Disable TSX on the system. This option takes effect only on newer CPUs which are not vulnerable to Microarchitectural Data Sampling (MDS). In other words they have MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 and get the new IA32_TSX_CTRL Model-specific register (MSR) through a microcode update. This new MSR allows for a reliable deactivation of the TSX functionality.
  • auto - Disable TSX if X86_BUG_TAA is present, otherwise enable TSX on the system.

Not specifying this parameter is equivalent to tsx=off.

For details see the upstream kernel documentation.

tsx_async_abort = [X86,INTEL]

This parameter controls mitigation for the TSX Async Abort (TAA) vulnerability.

Similar to Micro-architectural Data Sampling (MDS), certain CPUs that support Transactional Synchronization Extensions (TSX) are vulnerable to an exploit against CPU internal buffers. The exploit is able to forward information to a disclosure gadget under certain conditions.

In vulnerable processors, the speculatively forwarded data can be used in a cache side channel attack, to access data to which the attacker does not have direct access.

The options are:

  • full - Enable TAA mitigation on vulnerable CPUs if TSX is enabled.
  • full,nosmt - Enable TAA mitigation and disable Simultaneous Multi Threading (SMT) on vulnerable CPUs. If TSX is disabled, SMT is not disabled because CPU is not vulnerable to cross-thread TAA attacks.
  • off - Unconditionally disable TAA mitigation.

    On MDS-affected machines, the tsx_async_abort=off parameter can be prevented by an active MDS mitigation as both vulnerabilities are mitigated with the same mechanism. Therefore, to disable this mitigation, you need to specify the mds=off parameter as well.

    Not specifying this option is equivalent to tsx_async_abort=full. On CPUs which are MDS affected and deploy MDS mitigation, TAA mitigation is not required and does not provide any additional mitigation.

For details see the upstream kernel documentation.

6.2. Updated kernel parameters

intel_iommu = [DMAR]

Intel IOMMU driver Direct Memory Access Remapping (DMAR).

The options are:

  • sm_on [Default Off] - By default, scalable mode will be disabled even if the hardware advertises that it has support for the scalable mode translation. With this option set, scalable mode will be used on hardware which claims to support it.
isolcpus = [KNL,SMP,ISOL]

This parameter isolates a given set of CPUs from disturbance.

  • managed_irq - A sub-parameter, which prevents the isolated CPUs from being targeted by managed interrupts, which have an interrupt mask containing isolated CPUs. The affinity of managed interrupts is handled by the kernel and cannot be changed via the /proc/irq/* interfaces.

    This isolation is the best effort and is only effective if the automatically assigned interrupt mask of a device queue contains isolated and housekeeping CPUs. If the housekeeping CPUs are online then such interrupts are directed to the housekeeping CPU so that I/O submitted on the housekeeping CPU cannot disturb the isolated CPU.

    If the queue’s affinity mask contains only isolated CPUs then this parameter has no effect on the interrupt routing decision. However the interrupts are only delivered when the tasks running on those isolated CPUs submit I/O. I/O submitted on the housekeeping CPUs has no influence on those queues.

mds = [X86,INTEL]

The changes to options:

  • off - On TSX Async Abort (TAA)-affected machines, mds=off can be prevented by an active TAA mitigation as both vulnerabilities are mitigated with the same mechanism. So in order to disable this mitigation, you need to specify the tsx_async_abort=off kernel parameter too.

Not specifying this parameter is equivalent to mds=full.

For details see the upstream kernel documentation.

mem_encrypt = [X86-64]

AMD Secure Memory Encryption (SME) control

For details on when the memory encryption can be activated, see the upstream kernel documentation.

mitigations =

The changes to options:

  • off - Disable all optional CPU mitigations. This improves system performance, but it may also expose users to several CPU vulnerabilities.

    Equivalent to:

    • nopti [X86,PPC]
    • kpti=0 [ARM64]
    • nospectre_v1 [X86,PPC]
    • nobp=0 [S390]
    • nospectre_v2 [X86,PPC,S390,ARM64]
    • spectre_v2_user=off [X86]
    • spec_store_bypass_disable=off [X86,PPC]
    • ssbd=force-off [ARM64]
    • l1tf=off [X86]
    • mds=off [X86]
    • tsx_async_abort=off [X86]
    • kvm.nx_huge_pages=off [X86]

      Exceptions:

      This does not have any effect on kvm.nx_huge_pages when kvm.nx_huge_pages=force.

  • auto,nosmt - Mitigate all CPU vulnerabilities, disabling Simultaneous Multi Threading (SMT) if needed. This option is for users who always want to be fully mitigated, even if it means losing SMT.

    Equivalent to:

    • l1tf=flush,nosmt [X86]
    • mds=full,nosmt [X86]
    • tsx_async_abort=full,nosmt [X86]
rcutree.jiffies_till_sched_qs = [KNL]

This parameter sets the required age in jiffies for a given grace period before Read-copy update (RCU) starts soliciting quiescent-state help from the rcu_note_context_switch() and cond_resched() functions. If not specified, the kernel will calculate a value based on the most recent settings of the rcutree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs kernel parameters.

This calculated value may be viewed in the rcutree.jiffies_to_sched_qs kernel parameter. Any attempt to set rcutree.jiffies_to_sched_qs will be overwritten.

tsc =

This parameter disables clocksource stability checks for Time Stamp Counter (TSC).

Format: <string>

The options are:

  • reliable [x86] - Marks the TSC clocksource as reliable. This option disables the clocksource verification at runtime, as well as the stability checks done at bootup. The option also enables the high-resolution timer mode on older hardware, and in virtualized environment.
  • noirqtime [x86] - Do not use TSC to do Interrupt Request (IRQ) accounting. Used to run time disable IRQ_TIME_ACCOUNTING on any platforms where Read Time-Stamp Counter (RDTSC) is slow and this accounting can add overhead.
  • unstable [x86] - Marks the TSC clocksource as unstable. This option marks the TSC unconditionally unstable at bootup and avoids any further wobbles once the TSC watchdog notices.
  • nowatchdog [x86] - Disables the clocksource watchdog. The option is used in situations with strict latency requirements where interruptions from the clocksource watchdog are not acceptable.

6.3. New /proc/sys/kernel parameters

panic_print

Bitmask for printing the system info when panic occurs.

The user can chose the combination of the following bits:

  • bit 0: print all tasks info
  • bit 1: print system memory info
  • bit 2: print timer info
  • bit 3: print locks info if the CONFIG_LOCKDEP kernel configuration item is on
  • bit 4: print ftrace buffer

    For example, to print tasks and memory info on panic, execute:

    # echo 3 > /proc/sys/kernel/panic_print
sched_energy_aware

This parameter enables or disables Energy Aware Scheduling (EAS).

EAS starts automatically on platforms with asymmetric CPU topologies which have an Energy Model available.

If your platform meets the requirements for EAS but you do not want to use it, change this value to 0.

6.4. Updated /proc/sys/kernel parameters

threads-max

This parameter controls the maximum number of threads the fork() function can create.

During initialization, the kernel sets this value in such a way that even if the maximum number of threads is created, the thread structures occupy only a part (1/8th) of the available RAM pages.

The minimum value that can be written to threads-max is 1. The maximum value is given by the constant FUTEX_TID_MASK (0x3fffffff).

If a value outside of this range is written to threads-max, an error EINVAL occurs.

6.5. Updated /proc/sys/net parameters

bpf_jit_enable

This parameter enables the Berkeley Packet Filter Just-in-Time (BPF JIT) compiler.

BPF is a flexible and efficient infrastructure allowing to execute bytecode at various hook points. It is used in a number of Linux kernel subsystems such as networking (for example XDP, tc), tracing (for example kprobes, uprobes, tracepoints) and security (for example seccomp).

LLVM has a BPF back-end that can compile restricted C into a sequence of BPF instructions. After program load through the bpf() system call and passing a verifier in the kernel, JIT will then translate these BPF proglets into native CPU instructions.

There are two flavors of JIT, the newer eBPF JIT is currently supported on the following CPU architectures:

  • x86_64
  • arm64
  • ppc64 (both little and big endians)
  • s390x