Unexplained behaviour of RHEL systems with Skylake CPUs

Environment

Red Hat Enterprise Linux
Synergy 480 Gen10 05/22/2019 BIOS Rev: 2.10
- Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz (fam: 06, model: 55, stepping: 04)
- microcode: sig=0x50654, pf=0x80, revision=0x200005e
ThinkSystem SR950 07/03/2019 BIOS Rev: 1.53(2.34)
- Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz (fam: 06, model: 55, stepping: 04)
- microcode: sig=0x50654, pf=0x80, revision=0x200005e
Eslim SU7-2212 Purley 06/26/2018
- Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, revision=0x200005e
Inspur Prod: NF5280M5 Vers: 4.1.8 Date: 05/21/2019 BIOS Rev: 5.14
- Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz (fam: 06, model: 55, stepping: 04)
- microcode: CPU0 sig=0x50654, pf=0x80, revision=0x200005e
HPE ProLiant DL580 Gen10/ProLiant DL580 Gen10, BIOS U34 03/09/2020 BIOS Rev: 2.32
- Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (fam: 06, model: 55, stepping: 04)
- microcode: sig=0x50654, pf=0x80, revision=0x2006906
Intel Corporation S2600BPB/S2600BPB, BIOS SE5C620.86B.00.01.0013.030920180427 03/09/2018
- Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (fam: 06, model: 55, stepping: 04)
- microcode: sig=0x50654, pf=0x80, revision=0x200004d
Lenovo ThinkSystem SR630 -[7X02CTO1WW]-/-[7X02CTO1WW]-, BIOS -[IVE660P-2.72]- 09/28/2020
- Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (fam: 06, model: 55, stepping: 07)
- microcode: sig=0x50657, pf=0x80, revision=0x5002f01
Dell Inc. PowerEdge R740xd/0YNX56, BIOS 2.15.1 06/15/2022
- Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz (fam: 06, model: 55, stepping: 04)
- microcode: sig=0x50654, pf=0x80, revision=0x2006e05

Issue

Kernel crashes with below logs:

[49082.161931] mm/memory.c:413: bad pmd ffff8a930c69b230(0000000200000000)
[49082.168625] mm/memory.c:413: bad pmd ffff8a931db33770(0000000200000000)
[49082.175326] BUG: unable to handle kernel paging request at 00000afffffd3f60
[49082.182346] IP: [<ffffffff9897ce5c>] __radix_tree_lookup+0x2c/0xf0
[49082.188572] PGD 0 
[49082.190604] Oops: 0000 [#1] SMP 
[49082.193870] Modules linked in: udp_diag unix_diag af_packet_diag netlink_diag tcp_diag inet_diag ebtable_filt
er ebtables devlink overlay(T) scini(POE) 8021q garp mrp bonding vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_nat_ipv6 nf_nat_ipv4 nf_nat nls_utf8 isofs nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables iptable_raw nf_log_ipv4 nf_log_common xt_LOG ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_multiport xt_conntrack iptable_filter skx_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass pcspkr ses enclosure sg joydev mei_me mei lpc_ich hpilo hpwdt ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter nf_conntrack br_netfilter bridge stp llc ip_tables xfs dm_service_time sd_mod crc_t10dif crct10dif_generic
[49082.304541] CPU: 61 PID: 338389 Comm: crond Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-957.21.3.el7.x86_64 #1
[49082.316079] Hardware name: HPE Synergy 480 Gen10/Synergy 480 Gen10 Compute Module, BIOS I42 05/22/2019
[49082.325434] task: ffff8a93a3772080 ti: ffff8a939d41c000 task.ti: ffff8a939d41c000
[49082.332957] RIP: 0010:[<ffffffff9897ce5c>]  [<ffffffff9897ce5c>] __radix_tree_lookup+0x2c/0xf0
[49082.341627] RSP: 0000:ffff8a939d41fcd0  EFLAGS: 00010246
[49082.346964] RAX: 0000000000000000 RBX: 00000afffffd3f58 RCX: ffff8a939d41fd18
[49082.354136] RDX: 0000000000000000 RSI: 0003fffffeffffff RDI: 00000afffffd3f58
[49082.361307] RBP: ffff8a939d41fd08 R08: ffff8a930bf35e70 R09: 0000000000000000
[49082.368480] R10: 00007fabc99c4b10 R11: 00003ffffffff000 R12: 0003fffffeffffff
[49082.375653] R13: 00000afffffd3f58 R14: ffff8a939d41fd18 R15: 0000000000000000
[49082.382826] FS:  00007fabc99c4840(0000) GS:ffff8ab7afb40000(0000) knlGS:0000000000000000
[49082.390959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[49082.396734] CR2: 00000afffffd3f60 CR3: 000000231fa90000 CR4: 00000000007607e0
[49082.403915] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[49082.411086] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[49082.418258] PKRU: 55555554
[49082.420977] Call Trace:
[49082.423436]  [<ffffffff9897cf42>] radix_tree_lookup_slot+0x22/0x50
[49082.429650]  [<ffffffff987b600e>] __find_get_page+0x1e/0xa0
[49082.435251]  [<ffffffff987b609e>] find_get_page+0xe/0x20
[49082.440592]  [<ffffffff987ff8b7>] lookup_swap_cache+0x37/0x50
[49082.446368]  [<ffffffff987e969d>] handle_pte_fault+0x24d/0xd10
[49082.452233]  [<ffffffff9889e5aa>] ? __posix_lock_file+0x21a/0x550
[49082.458358]  [<ffffffff987ec27d>] handle_mm_fault+0x39d/0x9b0
[49082.464134]  [<ffffffff98d70603>] __do_page_fault+0x203/0x4f0
[49082.469910]  [<ffffffff98d70925>] do_page_fault+0x35/0x90
[49082.475337]  [<ffffffff98d6c768>] page_fault+0x28/0x30
[49082.481174] Code: 48 89 e5 41 57 49 89 d7 41 56 49 89 ce 41 55 49 89 fd 41 54 49 89 f4 53 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 <49> 8b 55 08 48 89 d6 48 89 55 c8 83 e6 03 48 83 fe 01 0f 85 a4 
[49082.502134] RIP  [<ffffffff9897ce5c>] __radix_tree_lookup+0x2c/0xf0
[49082.509106]  RSP <ffff8a939d41fcd0>
[49082.513266] CR2: 00000afffffd3f60

Another pattern of kernel crash:

[138468.326624] BUG: unable to handle kernel paging request at 0000000200000018
[138468.334526] IP: [<ffffffff8f96b74c>] _raw_spin_lock+0xc/0x30
[138468.341092] PGD 0 
[138468.343960] Oops: 0002 [#1] SMP 
[138468.348049] Modules linked in: vhost_net vhost macvtap macvlan tun xt_CT xt_mac xt_sctp xt_physdev veth udp_diag
 unix_diag af_packet_diag netlink_diag tcp_diag inet_diag ebtable_filter ebtables devlink overlay(T) scini(POE) 8021
q garp mrp bonding vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_nat_ipv6 nf_nat_ipv4 nf_nat nls_utf8 isofs nf_log_
ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables iptable_raw nf_log_ipv4 nf_log_common 
xt_LOG ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_multiport xt_conntrack iptable_filte
r skx_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass pcspkr ses enclosure
 sg joydev mei_me lpc_ich mei hpilo hpwdt ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter nf_conntrack br_netf
ilter
[138468.473671] CPU: 82 PID: 0 Comm: swapper/82 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-957.21.
3.el7.x86_64 #1

[138468.486873] Hardware name: HPE Synergy 480 Gen10/Synergy 480 Gen10 Compute Module, BIOS I42 05/22/2019
[138468.497110] task: ffff8db2036a30c0 ti: ffff8db2036b0000 task.ti: ffff8db2036b0000
[138468.505517] RIP: 0010:[<ffffffff8f96b74c>]  [<ffffffff8f96b74c>] _raw_spin_lock+0xc/0x30
[138468.514552] RSP: 0018:ffff8dd52fd83960  EFLAGS: 00010246
[138468.520780] RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
[138468.528841] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000200000018
[138468.536896] RBP: ffff8dd52fd839a8 R08: ffff8db121ed5a00 R09: ffff8d85a1b95050
[138468.544949] R10: ffff8dd52fd83898 R11: 000000000000e513 R12: 000000000000010c
[138468.553003] R13: 0000000000000052 R14: ffff8d85a1b94ec0 R15: 0000000200000018
[138468.561055] FS:  0000000000000000(0000) GS:ffff8dd52fd80000(0000) knlGS:0000000000000000
[138468.570074] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[138468.576733] CR2: 0000000200000018 CR3: 0000001456610000 CR4: 00000000007627e0
[138468.584775] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[138468.592808] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[138468.600830] PKRU: 00000000
[138468.604422] Call Trace:
[138468.607709]  <IRQ> 
[138468.609737]  [<ffffffffc05bdc27>] ? ovs_flow_stats_update+0x57/0x160 [openvswitch]
[138468.619078]  [<ffffffffc05c4e04>] ? ovs_flow_tbl_lookup_stats+0x84/0xc0 [openvswitch]
[138468.627759]  [<ffffffffc05bc7cf>] ovs_dp_process_packet+0x6f/0x120 [openvswitch]
[138468.635999]  [<ffffffffc05c7334>] ? ovs_ct_update_key+0xc4/0x150 [openvswitch]
[138468.644053]  [<ffffffffc05c6193>] ovs_vport_receive+0x73/0xd0 [openvswitch]
[138468.651864]  [<ffffffffc0519e1b>] ? apic_timer_fn+0x1b/0x70 [kvm]
[138468.658771]  [<ffffffffc05c6c3e>] netdev_frame_hook+0xde/0x180 [openvswitch]
[138468.666629]  [<ffffffff8f83a04a>] __netif_receive_skb_core+0x1fa/0xa10
[138468.673956]  [<ffffffff8f8b7eb3>] ? udp4_gro_receive+0x1c3/0x2a0
[138468.680753]  [<ffffffff8f419a00>] ? kmalloc_order_trace+0x10/0xa0
[138468.687628]  [<ffffffff8f83a878>] __netif_receive_skb+0x18/0x60
[138468.694317]  [<ffffffff8f83a900>] netif_receive_skb_internal+0x40/0xc0
[138468.701608]  [<ffffffff8f83b588>] napi_gro_receive+0xd8/0x100
[138468.708124]  [<ffffffffc13095e0>] bnx2x_rx_int+0xa70/0x1950 [bnx2x]
[138468.715145]  [<ffffffff8f2db978>] ? __enqueue_entity+0x78/0x80
[138468.721722]  [<ffffffff8f83854d>] ? __dev_kfree_skb_any+0x3d/0x50
[138468.728572]  [<ffffffffc1308898>] ? bnx2x_free_tx_pkt+0x218/0x300 [bnx2x]
[138468.736119]  [<ffffffffc130c40d>] bnx2x_poll+0x1dd/0x260 [bnx2x]
[138468.742863]  [<ffffffff8f83af1f>] net_rx_action+0x26f/0x390
[138468.749159]  [<ffffffff8f2a1075>] __do_softirq+0xf5/0x280
[138468.755263]  [<ffffffff8f97932c>] call_softirq+0x1c/0x30
[138468.761262]  [<ffffffff8f22e675>] do_softirq+0x65/0xa0
[138468.767072]  [<ffffffff8f2a13f5>] irq_exit+0x105/0x110
[138468.772864]  [<ffffffff8f97a606>] do_IRQ+0x56/0xf0
[138468.778293]  [<ffffffff8f96c362>] common_interrupt+0x162/0x162
[138468.784763]  <EOI> 
[138468.786788]  [<ffffffff8f7aef07>] ? cpuidle_enter_state+0x57/0xd0
[138468.794172]  [<ffffffff8f7af05e>] cpuidle_idle_call+0xde/0x230
[138468.800596]  [<ffffffff8f2366de>] arch_cpu_idle+0xe/0xc0
[138468.806482]  [<ffffffff8f2fc6da>] cpu_startup_entry+0x14a/0x1e0
[138468.812971]  [<ffffffff8f258047>] start_secondary+0x1f7/0x270
[138468.819272]  [<ffffffff8f2000d5>] start_cpu+0x5/0x14
[138468.824777] Code: 5d c3 0f 1f 44 00 00 85 d2 74 e4 0f 1f 40 00 eb ed 66 0f 1f 44 00 00 b8 01 00 00 00 5d c3 90 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 20 1b ff ff 5d 
[138468.845377] RIP  [<ffffffff8f96b74c>] _raw_spin_lock+0xc/0x30
[138468.851699]  RSP <ffff8dd52fd83960>
[138468.855741] CR2: 0000000200000018

Another pattern:

[2593265.569184] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 13.125 msecs
[2593265.569817] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 23.748 msecs
[2593265.572319] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1781.055 msecs
[2593265.572333] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1804.183 msecs
[2593265.572940] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1804.185 msecs
[2593265.576056] sched: RT throttling activated
[2593265.577932] hrtimer: interrupt took 1794008697 ns
[2593265.582328] perf: interrupt took too long (14058867 > 180000), lowering kernel.perf_event_max_sample_rate to 1000
[2593283.992584] NMI watchdog: Watchdog detected hard LOCKUP on cpu 51
[2593283.992589] NMI watchdog: Watchdog detected hard LOCKUP on cpu 41
[2593284.002545] INFO: NMI handler (ghes_notify_nmi) took too long to run: 2121.158 msecs
[2593285.027143] Hardware name: Lenovo ThinkSystem SR950 -[7X13CTO1WW]-/-[7X13CTO1WW]-, BIOS -[PSE122R-1.53]- 07/03/2019

Another pattern:

[618088.801338] BUG: unable to handle kernel paging request at 000000000009a000
[618088.820849] IP: [<          (null)>]           (null)
[618088.840388] PGD 0 
[618088.859907] Oops: 0010 [#1] SMP 
[618089.155718] CPU: 70 PID: 0 Comm: swapper/70 Kdump: loaded Tainted: G               ------------ T 3.10.0-1127.18.2.el7.x86_64 #1
[618089.195815] Hardware name: HPE ProLiant DL580 Gen10/ProLiant DL580 Gen10, BIOS U34 03/09/2020
[618089.216874] task: ffff8dfe2dfa1070 ti: ffff8dfe2dfac000 task.ti: ffff8dfe2dfac000
[618089.236460] RIP: 9a00:[<0000000000000000>]  [<          (null)>]           (null)
[618089.256158] RSP: 0018:ffff8dfe2dfafe20  EFLAGS: 00010046
[618089.275241] RAX: 0000000000000020 RBX: 0000000000000008 RCX: 0000000000000001
[618089.294351] RDX: 0000000000000000 RSI: ffff8dfe2dfaffd8 RDI: 000000225c610000
[618089.313103] RBP: ffff8dfe2dfafe50 R08: 000000000029d92e R09: 0000000000000018
[618089.332016] R10: 00000000000bbc17 R11: 7fffffffffffffff R12: 0000000000000003
[618089.350261] R13: 0000000000000020 R14: ffff8dfe2dfaffd8 R15: ffffffffaf2de7a0
[618089.368246] FS:  0000000000000000(0000) GS:ffff8e098fd80000(0000) knlGS:0000000000000000
[618089.388737] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[618089.406451] CR2: 000000000009a000 CR3: 000000225c610000 CR4: 00000000007607e0
[618089.424016] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[618089.441143] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[618089.457922] PKRU: 00000000
[618089.474314] Call Trace:
[618089.490603]  [<ffffffffaebc61d5>] cpuidle_enter_state+0x45/0xd0
[618089.506844]  [<ffffffffaebc633e>] cpuidle_idle_call+0xde/0x230
[618089.522743]  [<ffffffffae637c8e>] arch_cpu_idle+0xe/0xc0
[618089.538260]  [<ffffffffae701c7a>] cpu_startup_entry+0x14a/0x1e0
[618089.553428]  [<ffffffffae65a727>] start_secondary+0x1f7/0x270
[618089.568330]  [<ffffffffae6000d5>] start_cpu+0x5/0x14
[618089.585610] Code:  Bad RIP value.
[618089.600157] RIP  [<          (null)>]           (null)
[618089.614387]  RSP <ffff8dfe2dfafe20>
[618089.628178] CR2: 000000000009a000

Resolution

Please contact H/W vendor for further assistance with bios update.
One of customers has confirmed that CPU replacement with microcode update to 0x2000065, issue has been resolved.

Before CPU replacement:
 DMI: HPE Synergy 480 Gen10/Synergy 480 Gen10 Compute Module, BIOS I42 05/22/2019
 smpboot: CPU0: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz (fam: 06, model: 55, stepping: 04)
 microcode: sig=0x50654, pf=0x80, revision=0x200005e   96 logical processors (48 CPU cores)

After CPU replacement:
 DMI: HPE Synergy 480 Gen10/Synergy 480 Gen10 Compute Module, BIOS I42 11/13/2019
 smpboot: CPU0: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz (fam: 06, model: 55, stepping: 04)
 microcode: sig=0x50654, pf=0x80, revision=0x2000065   96 logical processors (48 CPU cores)

Note: If updating microcode via newer microcode_ctl package from RHEL, then one should be aware of issues with a later version of microcode for cpu 06-55-04.

Root Cause

Intel® Xeon® Processor Scalable Family Specification Update 2022 September
- <hhttps://cdrdv2-public.intel.com/613537/613537_Intel%C2%AE%20Xeon%C2%AE%20Processor%20Scalable%20Family%20Specification%20Update_Rev032US.pdf>

SKX8.          Intel® CAT/CDP Might Not Restrict Cacheline Allocation Under Certain Conditions 
               (Intel® Xeon® Processor Scalable Family)
  Problem:     Under certain microarchitectural conditions involving heavy memory traffic, cache lines
               might fill outside the allocated L3 capacity bitmask (CBM) associated with the current
               Class of Service (CLOS).
  Implication: Cache Allocation Technology/Code and Data Prioritization (CAT/CDP) might see performance
               side effects and a reduction in the effectiveness of the CAT feature for certain classes 
               of applications, including cache-sensitive workloads than seen on previous platforms.
  Workaround:  None identified. Contact your Intel representative for details of possible mitigations. 
               None identified.
  Status:      No Fix.

SKX18.         Intel® CAT Might Not Restrict Cacheline Allocation Under Certain Conditions
               (Intel® Xeon® Processor Scalable Family)
  Problem:     Under certain micro-architectural conditions involving heavy memory traffic, cachelines
               might fill outside the allocated L3 capacity bit-mask (CBM) associated with the current 
               Class of Service (CLOS).
  Implication: CAT might appear less effective at protecting certain classes of applications, including
               cache-sensitive workloads than on previous platforms.
  Workaround:  None identified. Contact your Intel representative for details of possible mitigations.
  Status:      No Fix.

SKX40.         Masked Bytes in a Vector Masked Store Instructions May Cause Write Back of a Cache Line
  Problem:     Vector masked store instructions to WB (write-back) memory-type that cross cache lines
               may lead to CPU writing back cached data even for cache lines where all of the bytes are
               masked.

  Implication: The processor may generate writes of un-modified data. This can affect Memory Mapped I/O
               (MMIO) or non-coherent agents in the following ways:
                1. For MMIO range that is mapped as WB memory type, this erratum may lead to Machine
                   Check Exception (MCE) due to writing back data into the MMIO space. This applies
                   only to cross page vector masked stores where one of the pages is in MMIO range.
                2. If the CPU cached data is stale, for example in the case of memory written directly
                   by a non-coherent agent (agent that uses non-coherent writes), this erratum may lead
                   to writing back stale cached data even if these bytes are masked.
  Workaround:  Platforms should not map MMIO memory space or non-coherent device memory space as WB
               memory. If WB is used for MMIO range, software or VMM should not map such MMIO page
               adjacent to a regular WB page (adjacent on the linear address space, before or after the
               I/O page). Memory that may be written by non-coherent agents should be separated by at
               least 64 bytes from regular memory used for other purposes (on the linear address space).
  Status:      No Fix.

SKX59.         Vector Masked Store Instructions May Cause Write Back of Cache Line Where Bytes Are
               Masked
  Problem:     Vector masked store instructions to write-back (WB) memory-type that cross cache lines
               may lead to CPU writing back cached data even for cache lines where all of the bytes are
               masked. This can affect Memory Mapped I/O (MMIO) or non-coherent agents in the following
               ways:
                1. For MMIO range that is mapped as WB memory type, this erratum may lead to Machine
                   Check Exception (MCE) due to writing back data into the MMIO space. This applies only
                   to cross page vector masked stores where one of the pages is in MMIO range.
                2. If the CPU cached data is stale, for example in the case of memory written directly
                   by a non-coherent agent (agent that uses non-coherent writes), this erratum may lead
                   to writing back stale cached data even if these bytes are masked.
  Implication: CPU may generate writes into MMIO space which lead to MCE, or may write stale data into
               memory also written by non-coherent agents.
  Workaround:  It is recommended not to map MMIO range as WB. If WB is used for MMIO range, OS or VMM
               should not map such MMIO page adjacent to a regular WB page (adjacent on the linear
               address space, before or after the I/O page). Memory that may be written by non-coherent
               agents should be separated by at least 64 bytes from regular memory used for other 
               purposes (on the linear address space).
  Status:      No Fix.

SKX61.         MOVNTDQA From WC Memory May Pass Earlier Locked Instructions
  Problem:     An execution of (V)MOVNTDQA (streaming load instruction) that loads from Write Combining
               (WC) memory may appear to pass an earlier locked instruction to a different cache line.
  Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate
               properly.
  Workaround:  Software should not rely on a locked instruction to fence subsequent executions of
               MOVNTDQA. Software should insert an MFENCE instruction if it needs to preserve order
               between streaming loads and other memory operations.
  Status:      No Fix.

SKX68.         A Spurious APIC Timer Interrupt May Occur After Timed MWAIT
  Problem:     Due to this erratum, a Timed MWAIT that completes for a reason other than the Timestamp
               Counter reaching the target value may be followed by a spurious APIC timer interrupt.
               This erratum can occur only if the APIC timer is in TSC-deadline mode and only if the
               mask bit is clear in the LVT Timer Register.
  Implication: Spurious APIC timer interrupts may occur when the APIC timer is in TSC-deadline mode.
  Workaround:  TSC-deadline timer interrupt service routines should detect and deal with spurious
               interrupts.
  Status:      No Fix.

SKX72.         Processor May Hang When Executing Code In an HLE Transaction Region
  Problem:     Under certain conditions, if the processor acquires an HLE (Hardware Lock Elision) lock
               via the XACQUIRE instruction in the Host Physical Address range between 40000000H and
               403FFFFFH, it may hang with an internal timeout error (MCACOD 0400H) logged into
               IA32_MCi_STATUS.
  Implication: Due to this erratum, the processor may hang after acquiring a lock via XACQUIRE.
  Workaround:  BIOS can reserve the host physical address ranges of 40000000H and 403FFFFFH (e.g. map
               it as UC/MMIO). Alternatively, the VMM (Virtual Machine Monitor) can reserve that
               address range so no guest can use it. In non-virtualized systems, the OS can reserve
               that memory space.
  Status:      No fix.

SKX73.         IDI_MISC Performance Monitoring Events May be Inaccurate
  Problem:     The IDI_MISC.WB_UPGRADE and IDI_MISC.WB_DOWNGRADE performance monitoring events (Event
               FEH; UMask 02H and 04H) counts cache lines evicted from the L2 cache. Due to this 
               erratum, the per logical processor count may be incorrect when both logical processors 
               on the same physical core are active. The aggregate count of both logical processors is
               not affected by this erratum.
  Implication: IDI_MISC performance monitoring events may be inaccurate. None identified.
  Status:      No fix.

SKX85.         Intel® PT ToPA Tables Read From Non-Cacheable Memory During an Intel® TSX Transaction May
               Lead to Processor Hang
  Problem:     If an Intel® PT (Processor Trace) ToPA (Table of Physical Addresses) table is placed in
               UC (Uncacheable) or USWC (Uncacheable Speculative Write Combining) memory, and a ToPA
               output region is filled during an Intel® TSX (Transaction Synchronization) transaction,
               the resulting ToPA table read may cause a processor hang.
  Implication: Placing Intel® PT ToPA tables in non-cacheable memory when Intel® TSX is in use may lead
               to a processor hang.
  Workaround:  None identified. Intel® PT ToPA tables should be located in WB memory if Intel® TSX is in
               use.
  Status:      No fix.

SKX88.         Using Intel® TSX Instructions May Lead to Unpredictable System Behavior
  Problem:     Under complex microarchitectural conditions, software using Intel® Transactional
               Synchronization Extensions (Intel® TSX) may result in unpredictable system behavior.
               Intel has only seen this under synthetic testing conditions. Intel is not aware of any
               commercially available software exhibiting this behavior.
  Implication: Due to this erratum, unpredictable system behavior may occur.
  Workaround:  It is possible for BIOS to contain a workaround for this erratum.
  Status:      No fix.

SKX91.         Performance in an 8sg System May Be Lower Than Expected
  Problem:     In 8sg (8-socket glueless) systems, certain workloads may generate a significant
               stream of accesses to remote nodes, leading to unexpected congestion in the processor's
               snoop responses.
  Implication: Due to this erratum, 8sg system performance may be lower than expected.
  Workaround:  A BIOS code change has been identified and may be implemented as a work around for this
               erratum
  Status:      No fix.

SKX92.         Memory May Continue to Throttle after MEMHOT# De-assertion
  Problem:     When MEMHOT# is asserted by an external agent, the CPU may continue to throttle memory
               after MEMHOT# de-assertion.
  Implication: When this erratum occurs, memory throttling occurs even after de-assertion of MEMHOT#.
  Workaround:  It is possible for the BIOS to contain a workaround for this erratum.
  Status:      No fix.

SKX93.         Unexpected Uncorrected Machine Check Errors May Be Reported
  Problem:     In rare micro-architectural conditions, the processor may report unexpected machine
               check errors. When this erratum occurs, IA32_MC0_STATUS (MSR 401H) will have the valid
               bit set (bit 63), the uncorrected error bit set (bit 61), a model specific error code
               of 03H (bits [31:16]) and an MCA error code of 05H (bits [15:0]).
  Implication: Due to this erratum, software may observe unexpected machine check exceptions.
  Workaround:  It is possible for the BIOS to contain a workaround for this erratum.
  Status:      No fix.

SKX100.        A Pending Fixed Interrupt May Be Dispatched Before an Interrupt of The Same Priority
               Completes
  Problem:     Resuming from C6 Sleep-State, with Fixed Interrupts of the same priority queued (in
               the corresponding bits of the IRR and ISR APIC registers), the processor may dispatch
               the second interrupt (from the IRR bit) before the first interrupt has completed and
               written to the EOI register, causing the first interrupt to never complete.
  Implication: Due to this erratum, Software may behave unexpectedly when an earlier call to an
               Interrupt Handler routine is overridden with another call (to the same Interrupt
               Handler) instead of completing its execution.
  Workaround:  None identified.
  Status:      No fix.

SKX102.        Processor May Behave Unpredictably on Complex Sequence of Conditions Which Involve
               Branches That Cross 64 Byte Boundaries
  Problem:     Under complex micro-architectural conditions involving branch instructions bytes that
               span multiple 64 byte boundaries (cross cache line), unpredictable system behavior may
               occur.
  Implication: When this erratum occurs, the system may behave unpredictably.
  Workaround:  It is possible for BIOS to contain a workaround for this erratum.
  Status:      No fix.

SKX103.        Executing Some Instructions May Cause Unpredictable Behavior
  Problem:     Under complex micro-architectural conditions, executing an X87, AVX, or integer divide
               instruction may result in unpredictable system behavior.
  Implication: When this erratum occurs, the system may behave unpredictably. Intel has not observed
               this erratum with any commercially available software.
  Workaround:  It is possible for the BIOS to contain a workaround for this erratum.
  Status:      No fix.

SKX114.        A Fixed Interrupt May Be Lost When a Core Exits C6
  Problem:     Under complex micro-architectural conditions, when performance throttling happens during
               a core C6 exit, a fixed interrupt may be lost.
  Implication: Due to this erratum, a fixed interrupt may be lost when internal throttling happens
               during a core C6 exit.  Intel has only observed this erratum in synthetic test
               conditions.
  Workaround:  None identified.
  Status:      No fix.

Diagnostic Steps

There are two patterns of crashes here.

 - RIP: _raw_spin_lock
 - RIP: __radix_tree_lookup
 - Hard Lockup

[13441.913186] perf: interrupt took too long (2508 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[25076.171245] perf: interrupt took too long (3143 > 3135), lowering kernel.perf_event_max_sample_rate to 63000
[46270.175313] perf: interrupt took too long (3931 > 3928), lowering kernel.perf_event_max_sample_rate to 50000
[138468.326624] BUG: unable to handle kernel paging request at 0000000200000018
[138468.334526] IP: [<ffffffff8f96b74c>] _raw_spin_lock+0xc/0x30

[33261.892297] BUG: unable to handle kernel paging request at 0000000200000018
[33261.900058] IP: [<ffffffffbeb6b74c>] _raw_spin_lock+0xc/0x30

[49082.161931] mm/memory.c:413: bad pmd ffff8a930c69b230(0000000200000000)
[49082.168625] mm/memory.c:413: bad pmd ffff8a931db33770(0000000200000000)
[49082.175326] BUG: unable to handle kernel paging request at 00000afffffd3f60
[49082.182346] IP: [<ffffffff9897ce5c>] __radix_tree_lookup+0x2c/0xf0

[  598.259136] perf: interrupt took too long (2526 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[  605.546725] INFO: NMI handler (ghes_notify_nmi) took too long to run: 41.849 msecs
[  606.163857] INFO: NMI handler (ghes_notify_nmi) took too long to run: 59.590 msecs
[  606.163884] perf: interrupt took too long (3446 > 3157), lowering kernel.perf_event_max_sample_rate to 58000
[  619.945543] perf: interrupt took too long (4420 > 4307), lowering kernel.perf_event_max_sample_rate to 45000
[  621.379484] perf: interrupt took too long (182586 > 5525), lowering kernel.perf_event_max_sample_rate to 1000
[  633.744911] sched: RT throttling activated
[  637.023061] perf: interrupt took too long (354826 > 228232), lowering kernel.perf_event_max_sample_rate to 1000
[  647.439715] usb 1-1.1: USB disconnect, device number 3
[  669.391826] NMI watchdog: Watchdog detected hard LOCKUP on cpu 28

vmcore analysis:

      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-957.21.3.el7.x86_64/vmlinux
        CPUS: 96
        DATE: Tue Sep 22 11:20:01 GMT 2020
      UPTIME: 13:39:57
LOAD AVERAGE: 0.15, 0.05, 0.06
       TASKS: 1380
     RELEASE: 3.10.0-957.21.3.el7.x86_64
      MEMORY: 575.7 GB
       PANIC: "BUG: unable to handle kernel paging request at 00000afffffd3f60"
         PID: 338389
     COMMAND: "crond"
        TASK: ffff8a93a3772080  [THREAD_INFO: ffff8a939d41c000]
         CPU: 61
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 338389  TASK: ffff8a93a3772080  CPU: 61  COMMAND: "crond"
 #0 [ffff8a939d41f960] machine_kexec at ffffffff98663934
 #1 [ffff8a939d41f9c0] __crash_kexec at ffffffff9871d162
 #2 [ffff8a939d41fa90] crash_kexec at ffffffff9871d250
 #3 [ffff8a939d41faa8] oops_end at ffffffff98d6d778
 #4 [ffff8a939d41fad0] no_context at ffffffff98d5bdbe
 #5 [ffff8a939d41fb20] __bad_area_nosemaphore at ffffffff98d5be55
 #6 [ffff8a939d41fb70] bad_area_nosemaphore at ffffffff98d5bfc6
 #7 [ffff8a939d41fb80] __do_page_fault at ffffffff98d706d0
 #8 [ffff8a939d41fbf0] do_page_fault at ffffffff98d70925
 #9 [ffff8a939d41fc20] page_fault at ffffffff98d6c768
    [exception RIP: __radix_tree_lookup+44]
    RIP: ffffffff9897ce5c  RSP: ffff8a939d41fcd0  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: 00000afffffd3f58  RCX: ffff8a939d41fd18
    RDX: 0000000000000000  RSI: 0003fffffeffffff  RDI: 00000afffffd3f58
    RBP: ffff8a939d41fd08   R8: ffff8a930bf35e70   R9: 0000000000000000
    R10: 00007fabc99c4b10  R11: 00003ffffffff000  R12: 0003fffffeffffff
    R13: 00000afffffd3f58  R14: ffff8a939d41fd18  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
#10 [ffff8a939d41fd10] radix_tree_lookup_slot at ffffffff9897cf42
#11 [ffff8a939d41fd30] __find_get_page at ffffffff987b600e
#12 [ffff8a939d41fd50] find_get_page at ffffffff987b609e
#13 [ffff8a939d41fd60] lookup_swap_cache at ffffffff987ff8b7
#14 [ffff8a939d41fd70] handle_pte_fault at ffffffff987e969d
#15 [ffff8a939d41fe08] handle_mm_fault at ffffffff987ec27d
#16 [ffff8a939d41feb0] __do_page_fault at ffffffff98d70603
#17 [ffff8a939d41ff20] do_page_fault at ffffffff98d70925
#18 [ffff8a939d41ff50] page_fault at ffffffff98d6c768
    RIP: 00007fabc97ce960  RSP: 00007ffe48eec478  RFLAGS: 00010206
    RAX: 000000005f695141  RBX: 000055ee5dc56e22  RCX: 000055ee5de9a300
    RDX: 00007fabc938cf20  RSI: 0000000000000170  RDI: 0000000000000001
    RBP: 000055ee5f624f50   R8: 00007fabc871b260   R9: 00000000000060df
    R10: 00007fabc99c4b10  R11: 0000000000000246  R12: 000055ee5de9a300
    R13: 000055ee5de598c0  R14: 000055ee5f624f80  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

crash> dis -rl ffffffff9897ce5c | tail
0xffffffff9897ce40 <__radix_tree_lookup+16>:    mov    %rdi,%r13
0xffffffff9897ce43 <__radix_tree_lookup+19>:    push   %r12
0xffffffff9897ce45 <__radix_tree_lookup+21>:    mov    %rsi,%r12
0xffffffff9897ce48 <__radix_tree_lookup+24>:    push   %rbx
0xffffffff9897ce49 <__radix_tree_lookup+25>:    sub    $0x10,%rsp
0xffffffff9897ce4d <__radix_tree_lookup+29>:    mov    %gs:0x28,%rax
0xffffffff9897ce56 <__radix_tree_lookup+38>:    mov    %rax,-0x30(%rbp)
0xffffffff9897ce5a <__radix_tree_lookup+42>:    xor    %eax,%eax
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/lib/radix-tree.c: 453
0xffffffff9897ce5c <__radix_tree_lookup+44>:    mov    0x8(%r13),%rdx

R13: 00000afffffd3f58: invalid address

 450 static unsigned radix_tree_load_root(struct radix_tree_root *root,
 451                 struct radix_tree_node **nodep, unsigned long *maxindex)
 452 {
 453         struct radix_tree_node *node = rcu_dereference_raw(root->rnode);

/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/mm/memory.c: 3433
0xffffffff987e948a <handle_pte_fault+58>:       mov    0x10(%rdi),%r12

3429 static int handle_pte_fault(struct vm_fault *vmf)
3430 {
3431         struct vm_area_struct *vma = vmf->vma;
3432         struct mm_struct *mm = vma->vm_mm;
3433         unsigned long address = (unsigned long)vmf->virtual_address;

crash> vm_fault ffff8a939d41fe20
struct vm_fault {
  flags = 680, 
  pgoff = 22, 
  virtual_address = 0x7fabc97ce960, <<
  page = 0x0, 
  cow_page = 0x0, 
  orig_pte = {
    pte = 0
  }, 
  pmd = 0xffff8a930c69b258, 
  vma = 0xffff8a93a66cae58,
  gfp_mask = 131290, 
  pte = 0xffff8a930bf35e70,
  pud = 0xffff8a92af3a4578
}

Page fault happened.

crash> vtop -c 338389 -u 00007fabc97ce960
VIRTUAL     PHYSICAL        
7fabc97ce960  (not mapped)

   PGD: 231fa907f8 => 80000022af3a4067
   PUD: 22af3a4578 => 230c69b067
   PMD: 230c69b258 => 230bf35067
   PTE: 230bf35e70 => 200000000

   PTE     vtop: cannot determine swap location

Another vmcore:

crash> sys
      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-957.21.3.el7.x86_64/vmlinux
        CPUS: 96
        DATE: Sat Aug 29 08:07:23 2020
      UPTIME: 1 days, 14:33:13
LOAD AVERAGE: 0.61, 0.91, 1.11
       TASKS: 1510
     RELEASE: 3.10.0-957.21.3.el7.x86_64
     MACHINE: x86_64  (2700 Mhz)
      MEMORY: 575.7 GB
       PANIC: "BUG: unable to handle kernel paging request at 0000000200000018"

        DMI_BIOS_VENDOR: HPE
       DMI_BIOS_VERSION: I42
          DMI_BIOS_DATE: 05/22/2019
         DMI_SYS_VENDOR: HPE
       DMI_PRODUCT_NAME: Synergy 480 Gen10

Kernel ring buffer shows as below.

crash> ps -S
  RU: 96
  IN: 1412
  WA: 2

Backtrace shows that exception RIP is _raw_spin_lock.

crash> bt
PID: 0      TASK: ffff8db2036a30c0  CPU: 82  COMMAND: "swapper/82"
 #0 [ffff8dd52fd835f0] machine_kexec at ffffffff8f263934
 #1 [ffff8dd52fd83650] __crash_kexec at ffffffff8f31d162
 #2 [ffff8dd52fd83720] crash_kexec at ffffffff8f31d250
 #3 [ffff8dd52fd83738] oops_end at ffffffff8f96d778
 #4 [ffff8dd52fd83760] no_context at ffffffff8f95bdbe
 #5 [ffff8dd52fd837b0] __bad_area_nosemaphore at ffffffff8f95be55
 #6 [ffff8dd52fd83800] bad_area_nosemaphore at ffffffff8f95bfc6
 #7 [ffff8dd52fd83810] __do_page_fault at ffffffff8f9706d0
 #8 [ffff8dd52fd83880] do_page_fault at ffffffff8f970925
 #9 [ffff8dd52fd838b0] page_fault at ffffffff8f96c768
    [exception RIP: _raw_spin_lock+12]
    RIP: ffffffff8f96b74c  RSP: ffff8dd52fd83960  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: 0000000200000000  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 0000000200000018
    RBP: ffff8dd52fd839a8   R8: ffff8db121ed5a00   R9: ffff8d85a1b95050
    R10: ffff8dd52fd83898  R11: 000000000000e513  R12: 000000000000010c
    R13: 0000000000000052  R14: ffff8d85a1b94ec0  R15: 0000000200000018
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8dd52fd83960] ovs_flow_stats_update at ffffffffc05bdc27 [openvswitch]
#11 [ffff8dd52fd839b0] ovs_dp_process_packet at ffffffffc05bc7cf [openvswitch]
#12 [ffff8dd52fd83a28] ovs_vport_receive at ffffffffc05c6193 [openvswitch]
#13 [ffff8dd52fd83c28] netdev_frame_hook at ffffffffc05c6c3e [openvswitch]
#14 [ffff8dd52fd83c50] __netif_receive_skb_core at ffffffff8f83a04a
#15 [ffff8dd52fd83cc8] __netif_receive_skb at ffffffff8f83a878
#16 [ffff8dd52fd83ce8] netif_receive_skb_internal at ffffffff8f83a900
#17 [ffff8dd52fd83d18] napi_gro_receive at ffffffff8f83b588
#18 [ffff8dd52fd83d40] bnx2x_rx_int at ffffffffc13095e0 [bnx2x]
#19 [ffff8dd52fd83e48] bnx2x_poll at ffffffffc130c40d [bnx2x]
#20 [ffff8dd52fd83e78] net_rx_action at ffffffff8f83af1f
#21 [ffff8dd52fd83ef8] __do_softirq at ffffffff8f2a1075
#22 [ffff8dd52fd83f68] call_softirq at ffffffff8f97932c
#23 [ffff8dd52fd83f80] do_softirq at ffffffff8f22e675
#24 [ffff8dd52fd83fa0] irq_exit at ffffffff8f2a13f5
#25 [ffff8dd52fd83fb8] do_IRQ at ffffffff8f97a606
--- <IRQ stack> ---
#26 [ffff8db2036b3db8] ret_from_intr at ffffffff8f96c362
    [exception RIP: cpuidle_enter_state+87]
    RIP: ffffffff8f7aef07  RSP: ffff8db2036b3e60  RFLAGS: 00000202
    RAX: 00007e3b57c7b902  RBX: 0000000000000000  RCX: 0000000000000018
    RDX: 0000000225c17d03  RSI: ffff8db2036b3fd8  RDI: 00007e3b57c7b902
    RBP: ffff8db2036b3e88   R8: 0000000000000053   R9: 0000000000000018
    R10: 0000000000000157  R11: 00007e855a52d4c0  R12: ffff8db2036b3e30
    R13: 0000000000000086  R14: ffff8dd52fd95960  R15: ffff8dd52fd959a0
    ORIG_RAX: ffffffffffffff28  CS: 0010  SS: 0018
#27 [ffff8db2036b3e90] cpuidle_idle_call at ffffffff8f7af05e
#28 [ffff8db2036b3ed0] arch_cpu_idle at ffffffff8f2366de
#29 [ffff8db2036b3ee0] cpu_startup_entry at ffffffff8f2fc6da
#30 [ffff8db2036b3f28] start_secondary at ffffffff8f258047
#31 [ffff8db2036b3f50] start_cpu at ffffffff8f2000d5

Disassembly code shows that invalid valud of RDI was dereferencing.

crash> dis -rl ffffffff8f96b74c | tail
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/kernel/spinlock.c: 150
0xffffffff8f96b740 <_raw_spin_lock>:    nopl   0x0(%rax,%rax,1) [FTRACE NOP]
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/arch/x86/include/asm/atomic.h: 205
0xffffffff8f96b745 <_raw_spin_lock+5>:  xor    %eax,%eax
0xffffffff8f96b747 <_raw_spin_lock+7>:  mov    $0x1,%edx
0xffffffff8f96b74c <_raw_spin_lock+12>: lock cmpxchg %edx,(%rdi)

RDI: 0000000200000018: invalid

 255 /* Must be called with rcu_read_lock. */
 256 void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 ...
 287         ovs_flow_stats_update(flow, key->tp.flags, skb);
              _______/
             /
 71 void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags,
 72                            const struct sk_buff *skb)
 73 {                        
 74         struct flow_stats *stats;
 75         unsigned int cpu = smp_processor_id();
 76         int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
 77 
 78         stats = rcu_dereference(flow->stats[cpu]);
 79 
 80         /* Check if already have CPU-specific stats. */
 81         if (likely(stats)) {
 82                 spin_lock(&stats->lock);

crash> kmem ffff8d85a1b94ec0
CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
ffff8db12ae36000     2016        419      5520    345    32k  sw_flow
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  fffff9343086e400  ffff8d85a1b90000     0     16          5    11
  FREE / [ALLOCATED]
  [ffff8d85a1b94ec0]

      PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
fffff9343086e500 1c21b94000                0        0  0 2fffff00008000 tail

sw_flow's contents are partially null'ed out.

crash> dis -rl ffffffffc05bdc27
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/net/openvswitch/flow.c: 73
0xffffffffc05bdbd0 <ovs_flow_stats_update>: nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffc05bdbd5 <ovs_flow_stats_update+5>:   push   %rbp
0xffffffffc05bdbd6 <ovs_flow_stats_update+6>:   mov    %rsp,%rbp
0xffffffffc05bdbd9 <ovs_flow_stats_update+9>:   push   %r15
0xffffffffc05bdbdb <ovs_flow_stats_update+11>:  push   %r14
0xffffffffc05bdbdd <ovs_flow_stats_update+13>:  mov    %rdi,%r14
0xffffffffc05bdbe0 <ovs_flow_stats_update+16>:  push   %r13
0xffffffffc05bdbe2 <ovs_flow_stats_update+18>:  push   %r12
0xffffffffc05bdbe4 <ovs_flow_stats_update+20>:  push   %rbx
0xffffffffc05bdbe5 <ovs_flow_stats_update+21>:  sub    $0x18,%rsp
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/net/openvswitch/flow.c: 76
0xffffffffc05bdbe9 <ovs_flow_stats_update+25>:  movzwl 0xa2(%rdx),%r12d
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/net/openvswitch/flow.c: 73
0xffffffffc05bdbf1 <ovs_flow_stats_update+33>:  mov    %esi,-0x2c(%rbp)
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/net/openvswitch/flow.c: 75
0xffffffffc05bdbf4 <ovs_flow_stats_update+36>:  mov    %gs:0x3fa50420(%rip),%r13d        # 0xe01c
0xffffffffc05bdbfc <ovs_flow_stats_update+44>:  mov    %r13d,%eax
0xffffffffc05bdbff <ovs_flow_stats_update+47>:  lea    (%rdi,%rax,8),%r15
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/net/openvswitch/flow.c: 76
0xffffffffc05bdc03 <ovs_flow_stats_update+51>:  shr    $0xa,%r12d
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/net/openvswitch/flow.c: 78
0xffffffffc05bdc07 <ovs_flow_stats_update+55>:  mov    0x4e0(%r15),%rbx`
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/net/openvswitch/flow.c: 76
0xffffffffc05bdc0e <ovs_flow_stats_update+62>:  and    $0x4,%r12d
0xffffffffc05bdc12 <ovs_flow_stats_update+66>:  add    0x68(%rdx),%r12d
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/net/openvswitch/flow.c: 81
0xffffffffc05bdc16 <ovs_flow_stats_update+70>:  test   %rbx,%rbx
0xffffffffc05bdc19 <ovs_flow_stats_update+73>:  je     0xffffffffc05bdc82 <ovs_flow_stats_update+178>
/usr/src/debug/kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/include/linux/spinlock.h: 337
0xffffffffc05bdc1b <ovs_flow_stats_update+75>:  lea    0x18(%rbx),%r15
0xffffffffc05bdc1f <ovs_flow_stats_update+79>:  mov    %r15,%rdi
0xffffffffc05bdc22 <ovs_flow_stats_update+82>:  callq  0xffffffff8f96b740 <_raw_spin_lock>

RBX: 0000000200000000
R15: 0000000200000018

crash> sw_flow ffff8d85a1b94ec0
struct sw_flow {
  rcu = {
    next = 0x0, 
    func = 0x0
  }, 
  flow_table = {
    node = {{
        next = 0x0, 
        pprev = 0x0
      }, {
        next = 0x0, 
        pprev = 0xffff8dd527f21d88
      }}, 
    hash = 1001454895
  }, 
  ufid_table = {
    node = {{
        next = 0x0, 
        pprev = 0xffff8db0b4e3bfb0
      }, {
        next = 0x0, 
        pprev = 0x0
      }}, 
    hash = 756155237
  }, 
...
  mask = 0xffff8d8d2d347000, 
  sf_acts = 0xffff8d8d27aa2ac0, 
  stats = 0xffff8d85a1b953a0
}


Data is mostly corrupted.

crash> rd ffff8d85a1b94ec0 200
ffff8d85a1b94ec0:  0000000000000000 0000000000000000   ................
ffff8d85a1b94ed0:  0000000000000000 0000000000000000   ................
ffff8d85a1b94ee0:  0000000000000000 ffff8dd527f21d88   ...........'....
ffff8d85a1b94ef0:  000000003bb0fd2f 0000000000000000   /..;............
ffff8d85a1b94f00:  ffff8db0b4e3bfb0 0000000000000000   ................
ffff8d85a1b94f10:  0000000000000000 000000002d120365   ........e..-....
...
ffff8d85a1b951b0:  0000000200000000 0000000000000000   ................
ffff8d85a1b951c0:  0000000000000000 0000000000000000   ................
ffff8d85a1b951d0:  0000000000000000 0000000000000000   ................
ffff8d85a1b951e0:  0000000000000000 0000000000000000   ................
ffff8d85a1b951f0:  0000000200000000 0000000000000000   ................
ffff8d85a1b95200:  0000000000000000 0000000000000000   ................
ffff8d85a1b95210:  0000000000000000 0000000000000000   ................
ffff8d85a1b95220:  0000000000000000 0000000000000000   ................
ffff8d85a1b95230:  0000000200000000 0000000000000000   ................
ffff8d85a1b95240:  0000000000000000 0000000000000000   ................
ffff8d85a1b95250:  0000000000000000 0000000000000000   ................
ffff8d85a1b95260:  0000000000000000 0000000000000000   ................
ffff8d85a1b95270:  0000000200000000 0000000000000000   ................
ffff8d85a1b95280:  0000000000000000 0000000000000000   ................
ffff8d85a1b95290:  0000000000000000 0000000000000000   ................
ffff8d85a1b952a0:  0000000000000000 0000000000000000   ................
ffff8d85a1b952b0:  0000000200000000 0000000000000000   ................
ffff8d85a1b952c0:  0000000000000000 0000000000000000   ................
ffff8d85a1b952d0:  0000000000000000 0000000000000000   ................
ffff8d85a1b952e0:  0000000000000000 0000000000000000   ................
ffff8d85a1b952f0:  0000000200000000 0000000000000000   ................
ffff8d85a1b95300:  0000000000000000 0000000000000000   ................
ffff8d85a1b95310:  0000000000000000 0000000000000000   ................
ffff8d85a1b95320:  0000000000000000 0000000000000000   ................
ffff8d85a1b95330:  0000000200000000 0000000000000000   ................
ffff8d85a1b95340:  0000000000000000 0000000000000000   ................
ffff8d85a1b95350:  0000000000000000 0000000000000000   ................
ffff8d85a1b95360:  0000000000000000 0000000000000000   ................
ffff8d85a1b95370:  0000000200000000 0000000000000000   ................
ffff8d85a1b95380:  0000000000000000 0000000000000000   ................
ffff8d85a1b95390:  ffff8d8d2d347000 ffff8d8d27aa2ac0   .p4-.....*.'....
ffff8d85a1b953a0:  ffff8d8cb9f2dc00 0000000000000000   ................
ffff8d85a1b953b0:  0000000200000000 0000000000000000   ................

3rd party module 'scini' is in use.

crash> mod -t
NAME     TAINTS
overlay  T
scini    POE

crash> mod|grep scini
ffffffffc08fa000  scini                         803568  (not loaded)  [CONFIG_KALLSYMS]

crash> module.version ffffffffc08fa000
  version = 0xffff8df92e244c00 "DellEMC ScaleIO Version: R2_6.11000.113"

Another Pattern

Backtrace of panic task:

crash> bt
PID: 0      TASK: ffff8dfe2dfa1070  CPU: 70  COMMAND: "swapper/70"
 #0 [ffff8dfe2dfafab0] machine_kexec at ffffffffae666254
 #1 [ffff8dfe2dfafb10] __crash_kexec at ffffffffae722ef2
 #2 [ffff8dfe2dfafbe0] crash_kexec at ffffffffae722fe0
 #3 [ffff8dfe2dfafbf8] oops_end at ffffffffaed8a798
 #4 [ffff8dfe2dfafc20] no_context at ffffffffae675d74
 #5 [ffff8dfe2dfafc70] __bad_area_nosemaphore at ffffffffae676042
 #6 [ffff8dfe2dfafcc0] bad_area_nosemaphore at ffffffffae676164
 #7 [ffff8dfe2dfafcd0] __do_page_fault at ffffffffaed8d750
 #8 [ffff8dfe2dfafd40] do_page_fault at ffffffffaed8d975
 #9 [ffff8dfe2dfafd70] page_fault at ffffffffaed89778
#10 [ffff8dfe2dfafe58] cpuidle_enter_state at ffffffffaebc61d5
#11 [ffff8dfe2dfafe90] cpuidle_idle_call at ffffffffaebc633e
#12 [ffff8dfe2dfafed0] arch_cpu_idle at ffffffffae637c8e
#13 [ffff8dfe2dfafee0] cpu_startup_entry at ffffffffae701c7a
#14 [ffff8dfe2dfaff28] start_secondary at ffffffffae65a727
#15 [ffff8dfe2dfaff50] start_cpu at ffffffffae6000d5

crash> whatis cpuidle_enter_state
int cpuidle_enter_state(struct cpuidle_device *, struct cpuidle_driver *, int);

crash> dis -r ffffffffaebc633e | tail | grep mov
0xffffffffaebc6325 <cpuidle_idle_call+197>: mov    0x4(%rbx),%eax
0xffffffffaebc6328 <cpuidle_idle_call+200>: mov    %eax,-0x2c(%rbp)
0xffffffffaebc6330 <cpuidle_idle_call+208>: mov    %r12d,%edx
0xffffffffaebc6333 <cpuidle_idle_call+211>: mov    %r14,%rsi
0xffffffffaebc6336 <cpuidle_idle_call+214>: mov    %rbx,%rdi
0xffffffffaebc633e <cpuidle_idle_call+222>: mov    0x4(%rbx),%r12d

crash> dis -r ffffffffaebc61d5 | head -15 | grep push
0xffffffffaebc6195 <cpuidle_enter_state+5>: push   %rbp
0xffffffffaebc619c <cpuidle_enter_state+12>:    push   %r15
0xffffffffaebc619e <cpuidle_enter_state+14>:    push   %r14
0xffffffffaebc61a7 <cpuidle_enter_state+23>:    push   %r13
0xffffffffaebc61ad <cpuidle_enter_state+29>:    push   %r12
0xffffffffaebc61b7 <cpuidle_enter_state+39>:    push   %rbx

crash> bt -f|grep cpuidle_idle -B4
    ffff8dfe2dfafe60: ffffbeba0ff83890 0000000000000003 %rbx and %r12
    ffff8dfe2dfafe70: 0000000000000003 ffffffffaf2de680 %r13 and %r14
    ffff8dfe2dfafe80: ffff8dfe2dfac000 ffff8dfe2dfafec8 %r15 and %rbp
    ffff8dfe2dfafe90: ffffffffaebc633e                  Return value
#11 [ffff8dfe2dfafe90] cpuidle_idle_call at ffffffffaebc633e

crash> cpuidle_device.cpu,state_count ffffbeba0ff83890
  cpu = 70
  state_count = 4

crash> cpuidle_driver.name,state_count ffffffffaf2de680
  name = 0xffffffffaf0b46d5 "intel_idle"
  state_count = 4

index:0x03

CPU 70 states is C6.

crash> cpuidle_driver.states[3] ffffffffaf2de680
  states[3] =   {
    name = "C6-SKX\000\000\000\000\000\000\000\000\000",
    desc = "MWAIT 0x20\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
    flags = 536936448,
    exit_latency = 133,
    power_usage = 0,
    target_residency = 600,
    disabled = false,
    enter = 0xffffffffaed88060,
    enter_dead = 0x0
  },

Another pattern

2408 [ 1591.421146] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
2409 [ 1797.992688] perf: interrupt took too long (3161 > 3133), lowering kernel.perf_event_max_sample_rate to 63000
2410 [ 2076.376223] perf: interrupt took too long (3980 > 3951), lowering kernel.perf_event_max_sample_rate to 50000
2411 [ 2561.353779] perf: interrupt took too long (4981 > 4975), lowering kernel.perf_event_max_sample_rate to 40000
2412 [79636.755011] perf: interrupt took too long (6252 > 6226), lowering kernel.perf_event_max_sample_rate to 31000
2414 [240767.457308] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
2415 [240767.465243] IP: [<ffffffffbaf2bd59>] audit_copy_inode+0x29/0xb0
2416 [240767.471262] PGD 80000017d0263067 PUD 180ccbe067 PMD 0
2417 [240767.476544] Oops: 0000 [#1] SMP
2418 [240767.479899] Modules linked in: mmfs26(OE) mmfslinux(OE) tracedev(OE) nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grac     e fscache iptable_filter ib_isert iscsi_target_mod target_core_mod ib_ucm dm_mirror dm_region_hash vfat dm_log fat dm_mod intel_powerclamp coretemp      intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper rpcrdma cryptd iTCO_wdt sunrpc i     TCO_vendor_support rdma_ucm ib_uverbs ib_iser rdma_cm iw_cm opa_vnic libiscsi ib_umad ib_ipoib pcspkr mei_me joydev i2c_i801 nfit scsi_transport_isc     si ib_cm ipmi_si sg shpchp wmi acpi_power_meter libnvdimm ipmi_devintf ipmi_msghandler mei acpi_pad lpc_ich acpi_cpufreq binfmt_misc ip_tables ext4      mbcache jbd2 sd_mod crc_t10dif crct10dif_generic hfi1(OE) drm_kms_helper
2419 [240767.552082]  syscopyarea sysfillrect sysimgblt ixgbe fb_sys_fops rdmavt(OE) ttm ahci ib_core crct10dif_pclmul crct10dif_common crc32c_intel liba     hci mdio drm ptp libata i2c_algo_bit pps_core i2c_core dca
2420 [240767.569242] CPU: 23 PID: 293049 Comm: SU2_CFD Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
2421 [240767.580937] Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS SE5C620.86B.00.01.0013.030920180427 03/09/2018
2422 [240767.591423] task: ffff89fc9d300fd0 ti: ffff89fd3be7c000 task.ti: ffff89fd3be7c000
2423 [240767.598973] RIP: 0010:[<ffffffffbaf2bd59>]  [<ffffffffbaf2bd59>] audit_copy_inode+0x29/0xb0
2424 [240767.607411] RSP: 0018:ffff89fd3be7fc48  EFLAGS: 00010246
2425 [240767.612798] RAX: 0000000000000000 RBX: ffff89fc9a775060 RCX: 0000000000000000
2426 [240767.619998] RDX: 0000000000000000 RSI: ffff89fc9a77509c RDI: ffff89fc9a775060
2427 [240767.627199] RBP: ffff89fd3be7fc78 R08: 0000000000000000 R09: 0000000000000000
2428 [240767.634402] R10: ffff8a0883cf1c80 R11: ffff8a0883cf1c80 R12: ffff8a0883cf1c80
2429 [240767.641605] R13: ffff89fc9a775060 R14: 00000000000066a6 R15: 0000000000000000
2430 [240767.648809] FS:  00002b99cdcea6c0(0000) GS:ffff8a0d8bf40000(0000) knlGS:0000000000000000
2431 [240767.656962] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2432 [240767.662781] CR2: 0000000000000040 CR3: 0000000cda3ec000 CR4: 00000000005607e0
2433 [240767.669981] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2434 [240767.677185] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2435 [240767.684384] PKRU: 55555554
2436 [240767.687178] Call Trace:
2437 [240767.689714]  [<ffffffffbaed1fcb>] ? wake_up_q+0x5b/0x80
2438 [240767.695015]  [<ffffffffbaf329ea>] __audit_inode+0x18a/0x3c0
2439 [240767.700661]  [<ffffffffbb02fba3>] do_last+0xd13/0x12c0
2440 [240767.705875]  [<ffffffffbb1569b2>] ? __rb_insert_augmented+0x92/0x1f0
2441 [240767.712296]  [<ffffffffbb030227>] path_openat+0xd7/0x640
2442 [240767.717684]  [<ffffffffbb031dbd>] do_filp_open+0x4d/0xb0
2443 [240767.723070]  [<ffffffffbb03f167>] ? __alloc_fd+0x47/0x170
2444 [240767.728544]  [<ffffffffbb01e0d7>] do_sys_open+0x137/0x240
2445 [240767.734016]  [<ffffffffbb01e1fe>] SyS_open+0x1e/0x20
2446 [240767.739058]  [<ffffffffbb52579b>] system_call_fastpath+0x22/0x27
2447 [240767.745134] Code: 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 48 8d 77 3c 53 48 89 fb 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 e8 31      c0 <48> 8b 42 40 48 89 47 20 48 8b 42 28 8b 40 10 89 47 28 0f b7 02
2448 [240767.765529] RIP  [<ffffffffbaf2bd59>] audit_copy_inode+0x29/0xb0
2449 [240767.771632]  RSP <ffff89fd3be7fc48>
2450 [240767.775203] CR2: 0000000000000040

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

Unexplained behaviour of RHEL systems with Skylake CPUs

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links