Soft lockup detected on a large NUMA system under a heavy memory usage

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 7.1
  • Red Hat Enterprise Linux 6.6
  • Large NUMA systems (for example, systems with 12TB Memory and 288 CPUs where the numa factor is lower than or equal to 30)
  • Red Hat Enterprise OpenStack Platform (RHELOSP) 5
  • Red Hat Enterprise OpenStack Platform (RHELOSP) 6
  • Red Hat Enterprise OpenStack Platform (RHELOSP) 7
  • Red Hat OpenStack Platform (RHOSP) 8
  • Red Hat OpenStack Platform (RHOSP) 9
  • Red Hat OpenStack Platform (RHOSP) 10
  • Red Hat OpenStack Platform (RHOSP) 11

Issue

  • Systems with numa factor lower than or equal to 30 may hang under the high load.
  • Soft lockup detected under a heavy memory pressure on a large NUMA system.

Resolution

Root Cause

  • The high load causes the pagecache memory not evenly distributed between the numa nodes when the files are read into the pagescache on a small subset of the CPUs. When the memory pages will be fragmented and/or short, then the kernel tries to compact memory to acquire 2MB pages for THP.
  • About THP, also refer to How to use, monitor, and disable transparent hugepages in Red Hat Enterprise Linux 6 and 7?
  • This resulted in a total memory exaustion of some numa nodes and no memory usage of other nodes.
  • The default value of /proc/sys/vm/zone_reclaim_mode results in the CPUs running on the memory exhausted nodes to skip over to the next node with available memory to attempt the memory allocation.
  • The boot time initialization code that sets /proc/sys/vm/zone_reclaim_mode was changed upstream and backported to RHEL7.1 and RHEL 6.6.
  • For RHEL6.6 or later and 7.0, the change set increased setting /proc/sys/vm/zone_reclaim_mode to 1 whenever the largest numa factor is higher than 30, whereas it was 20 prior to the modification.
  • System such as 12TB with 288CPU(Haswell) may experience the problem, which requires the tuning of /proc/sys/vm/zone_reclaim_mode.

Diagnostic Steps

  • Look for "BUG: soft lockup" messages in syslog. Refer to below as an example.
kernel: BUG: soft lockup - CPU#102 stuck for 22s! [forkoff:235364]
kernel: Modules linked in: fuse btrfs zlib_deflate raid6_pq xor msdos ext4
mbcache jbd2 binfmt_misc xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT iptable_filter ip_tables
tun bridge stp llc dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt
iTCO_vendor_support vfat fat intel_powerclamp coretemp intel_rapl kvm_intel
kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich
i2c_i801 mfd_core shpchp ipmi_si ipmi_msghandler tpm_infineon nls_utf8 isofs
loop uinput xfs libcrc32c sd_mod crc_t10dif crct10dif_common mgag200
syscopyarea sysfillrect sysimgblt drm_kms_helper igb qla2xxx e1000e
kernel: ttm dca ptp scsi_transport_fc drm i2c_algo_bit pps_core scsi_tgt
megaraid_sas i2c_core
kernel: CPU: 102 PID: 235364 Comm: forkoff Not tainted 3.10.0-229.el7.x86_64
#1                                                                                                                                                                                                                                                                             
kernel:
kernel: task: ffff911d25eea220 ti: ffff927835154000 task.ti: ffff927835154000
kernel: RIP: 0010:[<ffffffff811798ef>]  [<ffffffff811798ef>]
isolate_freepages_block+0xaf/0x380
kernel: RSP: 0000:ffff927835157860  EFLAGS: 00000286
kernel: RAX: 00000000ffffffff RBX: 00000014e4b60000 RCX: ffff927835157aa8
kernel: RDX: 0000000053345c00 RSI: 0000000053345a00 RDI: ffff927835157a50
kernel: RBP: ffff9278351578f8 R08: 0000000000000000 R09: ffff8e007ffda000
kernel: R10: 00000014cd168000 R11: 0000000060080000 R12: 0000000000000301
kernel: R13: ffff927835157850 R14: 0000000000000000 R15: 00007f4501f64000
kernel: FS:  00007f4f4d540740(0000) GS:ffff90fa7f9a0000(0000)
knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f4690400000 CR3: 000009baa7060000 CR4: 00000000001407e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Stack:
kernel: ffffea2a02311040 0000000060080000 0000000000000094 ffff927835157948
kernel: 0000000053345a00 ffff927835157a50 000000004808ec38 00ff8e0000000000
kernel: ffff927835157a90 ffff927835157aa8 ffff927835157a50 ffff8e007ffad000
kernel: Call Trace:
kernel: [<ffffffff81179d8f>] compaction_alloc+0x1cf/0x240
kernel: [<ffffffff811b15ce>] migrate_pages+0xce/0x610
kernel: [<ffffffff81179bc0>] ? isolate_freepages_block+0x380/0x380
kernel: [<ffffffff8117abb9>] compact_zone+0x299/0x400
kernel: [<ffffffff8117adbc>] compact_zone_order+0x9c/0xf0
kernel: [<ffffffff8117b171>] try_to_compact_pages+0x121/0x1a0
kernel: [<ffffffff815ff336>] __alloc_pages_direct_compact+0xac/0x196
kernel: [<ffffffff81160758>] __alloc_pages_nodemask+0x788/0xb90
kernel: [<ffffffff810b11c0>] ? task_numa_fault+0x8d0/0xbb0
kernel: [<ffffffff811a24aa>] alloc_pages_vma+0x9a/0x140
kernel: [<ffffffff811b674b>] do_huge_pmd_anonymous_page+0x10b/0x410
kernel: [<ffffffff81182334>] handle_mm_fault+0x184/0xd60
kernel: [<ffffffff8160f1e6>] __do_page_fault+0x156/0x520
kernel: [<ffffffff8118a945>] ? change_protection+0x65/0xa0
kernel: [<ffffffff811a0dbb>] ? change_prot_numa+0x1b/0x40
kernel: [<ffffffff810adb86>] ? task_numa_work+0x266/0x300
kernel: [<ffffffff8160f5ca>] do_page_fault+0x1a/0x70
kernel: [<ffffffff81013b0c>] ? do_notify_resume+0x9c/0xb0
kernel: [<ffffffff8160b808>] page_fault+0x28/0x30
kernel: Code: 89 ee 48 89 4d b0 41 89 c5 eb 1d 90 49 83 c7 01 48 83 c3 40 4d
39 fc 0f 86 07 01 00 00 41 83 c5 01 4d 85 f6 4c 0f 44 f3 8b 43 18 <83> f8 80
75 dc 48 8b 45 b8 0f b6 55 c0 48 8d 75 c8 4c 8b 45 b0

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

1 Comments

Hello, Helpful article but could you explain what you mean by "numa factor" ? Do you mean numa_distance ? ie the distance between 2 numa nodes ?