page allocation failures noticed while using nvidia GPU related commands

Solution Unverified - Updated -

Issue

  • page allocation failures noticed while using nvidia GPU related commands.
  • nvidia proprietary kernel modules were noticed in call traces of the page allocation failures.
  • Below or similar messages are noticed in the logs:
Apr 14 23:02:00 localhost kernel: UVM GPU5 BH: page allocation failure: order:1, mode:0x2050d0
Apr 14 23:02:00 localhost kernel: CPU: 64 PID: 133632 Comm: UVM GPU5 BH Tainted: P           OE  ------------   3.10.0-1062.18.1.el7.x86_64 #1
Apr 14 23:02:00 localhost kernel: Hardware name: Cray Inc. SYS-4029GP-TVRT/X11DGO-T, BIOS 3.3 03/11/2020
Apr 14 23:02:00 localhost kernel: Call Trace:
Apr 14 23:02:00 localhost kernel: [<ffffffff8d57b416>] dump_stack+0x19/0x1b
Apr 14 23:02:00 localhost kernel: [<ffffffff8cfc3fc0>] warn_alloc_failed+0x110/0x180
Apr 14 23:02:00 localhost kernel: [<ffffffff8d57698a>] __alloc_pages_slowpath+0x6bb/0x729
Apr 14 23:02:00 localhost kernel: [<ffffffff8cfc8636>] __alloc_pages_nodemask+0x436/0x450
Apr 14 23:02:00 localhost kernel: [<ffffffff8d016c58>] alloc_pages_current+0x98/0x110
Apr 14 23:02:00 localhost kernel: [<ffffffff8d024fed>] new_slab+0x44d/0x4e0
Apr 14 23:02:00 localhost kernel: [<ffffffff8d02544c>] ___slab_alloc+0x3cc/0x520
Apr 14 23:02:00 localhost kernel: [<ffffffffc16c8c4a>] ? alloc_internal+0x6a/0x80 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16c1a51>] ? pick_chunk+0x51/0x70 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16c1aa0>] ? try_claim_chunk+0x30/0x80 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16c8c4a>] ? alloc_internal+0x6a/0x80 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffff8d577dbc>] __slab_alloc+0x40/0x5c
Apr 14 23:02:00 localhost kernel: [<ffffffff8d026150>] __kmalloc+0x1c0/0x230
Apr 14 23:02:00 localhost kernel: [<ffffffffc16c8c4a>] alloc_internal+0x6a/0x80 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16c8e55>] __uvm_kvmalloc_zero+0x25/0x60 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16bda90>] allocate_directory_with_location+0x90/0x130 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16becc8>] uvm_page_tree_get_ptes_async+0x2a8/0x560 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16a2604>] ? block_mark_memory_used+0x64/0x70 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16a6cf2>] ? block_copy_resident_pages_between+0x952/0xfb0 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16bef92>] uvm_page_tree_get_ptes+0x12/0x30 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16a2ce9>] block_alloc_ptes_with_retry+0x339/0x4a0 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16a292b>] ? block_gpu_supports_2m.part.26+0x2b/0x40 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16a326d>] block_alloc_ptes_new_state+0x2d/0x90 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16aca31>] uvm_va_block_map+0x421/0xf00 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16c3c12>] ? uvm_tracker_add_tracker_safe+0x12/0x90 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16a7868>] ? block_copy_resident_pages+0x418/0x950 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16cd552>] ? uvm_pmm_gpu_mark_root_chunk_used+0x12/0x20 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16b14ea>] uvm_va_block_service_locked+0x77a/0xf90 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16b6898>] service_batch_managed_faults_in_block_locked+0x778/0x980 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffff8cee9f4e>] ? pick_next_task_fair+0x61e/0x870
Apr 14 23:02:00 localhost kernel: [<ffffffffc16b72d4>] service_fault_batch+0x424/0x660 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16b8175>] uvm_gpu_service_replayable_faults+0x125/0xb10 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16c270f>] ? thread_context_non_interrupt_add+0x10f/0x200 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16936d4>] replayable_faults_isr_bottom_half+0x44/0x60 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc16937b4>] replayable_faults_isr_bottom_half_entry+0x54/0xb0 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffff8cecc303>] ? down_interruptible+0x33/0x60
Apr 14 23:02:00 localhost kernel: [<ffffffffc16818f1>] _main_loop+0x91/0x190 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffffc1681860>] ? nvstatusToString+0x50/0x50 [nvidia_uvm]
Apr 14 23:02:00 localhost kernel: [<ffffffff8cec6321>] kthread+0xd1/0xe0
Apr 14 23:02:00 localhost kernel: [<ffffffff8cec6250>] ? insert_kthread_work+0x40/0x40
Apr 14 23:02:00 localhost kernel: [<ffffffff8d58dd1d>] ret_from_fork_nospec_begin+0x7/0x21
Apr 14 23:02:00 localhost kernel: [<ffffffff8cec6250>] ? insert_kthread_work+0x40/0x40

Environment

  • Red Hat Enterprise Linux
  • nvidia GPU
  • nvidia proprietary kernel modules (nvidia, nvidia_uvm, nvidia_modeset, nvidia_drm)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content