The constant logging of "kernel: CMCI storm subsided" errors

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux

Issue

  • What is CMCI storm?
  • Why the CMCI storm error occurs within 5 minute gap?
  • Performance degradation due to CMCI errors in /var/log/messages:

    kernel: CMCI storm subsided: switching to interrupt mode
    kernel: CMCI storm detected: switching to poll mode
    

Resolution

  • Contact your hardware vendor to resolve this issue.
  • If the hardware vendor has their own mechanism for capturing hardware errors, then you can disable features such as EDAC and CMCI that push the hardware reporting into the OS instead of the hardware.
  • To disable CMCI, add "mce=no_cmci" in /boot/grub/grub.cfg

       mce=no_cmci
            Disable CMCI(Corrected Machine Check Interrupt) that
            Intel processor supports.  Usually this disablement is
            not recommended, but it might be handy if your hardware
            is misbehaving.
            Note that you'll get more problems without CMCI than with
            due to the shared banks, i.e. you might get duplicated
            error logs.
    

Root Cause

  • Starting with 45 nm Intel 64 processor on which CPUID reports DisplayFamily_DisplayModel as 06H_1AH,the processor can report information on corrected machine-check errors and deliver a programmable interrupt for software to respond to MC errors, referred to as corrected machine-check error interrupt (CMCI) Intel® 64 and IA-32 Architectures Software Developer’s Manual(Page 509).

  • CMCI is Intel's way of communicating EDAC messages. Intel's hardware can deliver interrupts when the level of errors exceeds a programmable threshold. If the error is persistent, the CPU(s) will then receive a constant influx or storm of interrupts at a high enough rate that it will impact the CPU's ability to do useful work. When that happens, the kernel disables the CMCI mechanism and reverts back to a more classical approach of polling regularly for machine check errors. Once the rate of errors decreases, the kernel re-enables CMCI.

  • The CMCI kernel code switches to polling until the report rate is low again. This 5 minute period is defined in the kernel source file below.

    arch/x86/kernel/cpu/mcheck/mce-internal.h
    
     19 #define INITIAL_CHECK_INTERVAL  5 * 60 /* 5 minutes */
    
    Here 5 x 60s = 300s = 5 minutes
    
    arch/x86/kernel/cpu/mcheck/mce.c
    ---------------------------------
    /*
     * Periodic polling timer for "silent" machine check errors.  If the
     * poller finds an MCE, poll 2x faster.  When the poller finds no more
     * errors, poll 2x slower (up to check_interval seconds).
     */
    static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
    

Diagnostic Steps

  • The CMCI messages appear in 5 minute interval:

    Dec  2 15:25:22 localhost kernel: CMCI storm subsided: switching to interrupt mode
    Dec  2 15:25:22 localhost kernel: CMCI storm detected: switching to poll mode
    [..]
    
    Dec  2 15:30:23 localhost kernel: CMCI storm subsided: switching to interrupt mode
    Dec  2 15:30:23 localhost kernel: CMCI storm detected: switching to poll mode
    [..]
    
    Dec  2 15:35:24 localhost kernel: CMCI storm subsided: switching to interrupt mode
    Dec  2 15:35:29 localhost kernel: CMCI storm detected: switching to poll mode
    
  • The below corrected error can be seen in /var/log/mcelog.

    Hardware event. This is not a software error.
    MCE 0
    CPU 47 BANK 1 TSC d1202c8ecb
    ADDR 7faac180
    TIME 1476507878 Sat Oct 15 01:04:38 2016
    MCG status:
    MCi status:
    Corrected error
    Error enabled
    MCi_ADDR register valid
    MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
    Transaction: Memory read error
    STATUS 940000000000009f MCGSTATUS 0
    MCGCAP 7000c16 APICID 3b SOCKETID 1
    CPUID Vendor Intel Family 6 Model 79
    
  • The elevated CPU usage at kernel level can be seen due to the consistently high rate of interrupts:

    12:00:02 AM     CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest    %gnice     %idle
    12:10:02 AM     all     14.98      0.15     17.86      0.00      0.00      0.00      3.74      0.00      0.00     63.26
    12:10:02 AM       0     32.14      0.09     33.88      0.00      0.00      0.00      7.79      0.00      0.00     26.10
    12:10:02 AM       1      3.16      0.33      5.78      0.01      0.00      0.00      0.54      0.00      0.00     90.17
    12:10:02 AM       2     31.83      0.13     33.80      0.00      0.00      0.00      5.69      0.00      0.00     28.55
    12:10:02 AM       3      1.39      0.31      3.51      0.01      0.00      0.00      0.33      0.00      0.00     94.46
    12:10:02 AM       4     31.14      0.23     33.53      0.00      0.00      0.00      5.90      0.00      0.00     29.20
    12:10:02 AM       5      1.67      0.16      3.08      0.01      0.00      0.00      0.29      0.00      0.00     94.78
    12:10:02 AM       6     31.95      0.13     32.74      0.00      0.00      0.00      6.63      0.00      0.00     28.55
    12:10:02 AM       7      2.18      0.13      3.52      0.01      0.00      0.00      0.27      0.00      0.00     93.89
    12:10:02 AM       8     32.42      0.27     32.86      0.00      0.00      0.00      6.84      0.00      0.00     27.61
    12:10:02 AM       9      2.95      0.13      3.61      0.01      0.00      0.00      0.26      0.00      0.00     93.04
    12:10:02 AM      10     31.73      0.21     32.64      0.00      0.00      0.00      7.68      0.00      0.00     27.74
    12:10:02 AM      11      1.47      0.14      4.07      0.00      0.00      0.00      0.27      0.00      0.00     94.04
    12:10:02 AM      12     31.97      0.14     32.77      0.00      0.00      0.00      7.37      0.00      0.00     27.74
    12:10:02 AM      13      1.40      0.11      3.70      0.00      0.00      0.00      0.28      0.00      0.00     94.50
    12:10:02 AM      14     31.39      0.06     32.05      0.00      0.00      0.00      8.09      0.00      0.00     28.42
    12:10:02 AM      15      1.41      0.12      3.00      0.01      0.00      0.00      0.32      0.00      0.00     95.14
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments