Select Your Language

Infrastructure and Management

Cloud Computing

Storage

Runtimes

Integration and Automation

  • Comments
  • Server went into the freezer, I believe it is a result of bad cgrop distribution, how can I prove it?

    Posted on

    Good morning, I have a server that went into a freezer!

    When looking for the reason I found errors related to cgroup and memory, how can I draw the conclusion that it crashed due to cgroup memory misallocation?

    lab$ sar -f /var/log/sa/sa30

    11:10:01 AM all 9.39 0.00 10.06 11.71 0.00 68.85
    11:20:02 AM all 16.80 0.00 8.95 5.05 0.00 69.20
    11:30:01 AM all 4.14 0.00 5.88 7.12 0.00 82.85
    Average: all 5.89 0.00 5.73 5.70 0.00 82.68
    12:28:37 PM LINUX RESTART
    12:39:30 PM LINUX RESTART
    12:40:01 PM CPU %user %nice %system %iowait %steal %idle
    12:50:01 PM all 1.66 0.01 3.55 0.22 0.00 94.56
    Average: all 1.66 0.01 3.55 0.22 0.00 94.56
    12:57:58 PM LINUX RESTART
    01:50:01 PM CPU %user %nice %system %iowait %steal %idle
    02:00:01 PM all 1.56 0.00 1.83 0.07 0.00 96.54
    02:10:01 PM all 0.72 0.00 1.01 0.06 0.00 98.22
    02:20:01 PM all 2.11 0.00 1.16 0.07 0.00 96.66

    lab log $ lspci | grep ERROR
    7f:14.2 System peripheral: Intel Corporation Haswell-E Integrated Memory Controller 0 Channel 0 ERROR Registers (rev 02)
    7f:14.3 System peripheral: Intel Corporation Haswell-E Integrated Memory Controller 0 Channel 1 ERROR Registers (rev 02)
    7f:17.2 System peripheral: Intel Corporation Haswell-E Integrated Memory Controller 1 Channel 0 ERROR Registers (rev 02)
    7f:17.3 System peripheral: Intel Corporation Haswell-E Integrated Memory Controller 1 Channel 1 ERROR Registers (rev 02)
    ff:14.2 System peripheral: Intel Corporation Haswell-E Integrated Memory Controller 0 Channel 0 ERROR Registers (rev 02)
    ff:14.3 System peripheral: Intel Corporation Haswell-E Integrated Memory Controller 0 Channel 1 ERROR Registers (rev 02)
    ff:17.2 System peripheral: Intel Corporation Haswell-E Integrated Memory Controller 1 Channel 0 ERROR Registers (rev 02)
    ff:17.3 System peripheral: Intel Corporation Haswell-E Integrated Memory Controller 1 Channel 1 ERROR Registers (rev 02)

    lab$ ls -lha dmesg
    -rw-r--r-- 1 root root 121K Jan 30 12:57 dmesg

    lab$ cat dmesg |egrep -i "Memory|error|fail"

    Reserving 145MB of memory at 48MB for crashkernel (System RAM: 264192MB)
    PM: Registered nosave memory: 000000000009c000 - 00000000000a0000
    PM: Registered nosave memory: 00000000000a0000 - 00000000000e0000
    PM: Registered nosave memory: 00000000000e0000 - 0000000000100000
    PM: Registered nosave memory: 000000007a289000 - 000000007af0b000
    PM: Registered nosave memory: 000000007af0b000 - 000000007b93b000
    PM: Registered nosave memory: 000000007b93b000 - 000000007bab4000
    PM: Registered nosave memory: 000000007bae9000 - 000000007baff000
    PM: Registered nosave memory: 000000007bb00000 - 0000000090000000
    PM: Registered nosave memory: 0000000090000000 - 00000000feda8000
    PM: Registered nosave memory: 00000000feda8000 - 00000000fedac000
    PM: Registered nosave memory: 00000000fedac000 - 00000000ff310000
    PM: Registered nosave memory: 00000000ff310000 - 0000000100000000
    Memory: 264373124k/270532608k available (5325k kernel code, 2193048k absent, 3966436k reserved, 7013k data, 1276k init)
    please try 'cgroup_disable=memory' option if you don't want memory cgroups
    Initializing cgroup subsys memory
    Freeing initrd memory: 16711k freed
    ipmi_si ipmi_si.0: Could not enable interrupts, failed set, using polled mode.
    ERST: Error Record Serialization Table (ERST) support is initialized.
    Non-volatile memory driver v1.3
    crash memory driver: version 1.1
    Freeing unused kernel memory: 1276k freed
    Freeing unused kernel memory: 800k freed
    Freeing unused kernel memory: 1588k freed
    megaraid_sas 0000:03:00.0: Controller type: MR,Memory size is: 1024MB
    ACPI Error: No handler for Region [SYSI] (ffff884053edf2b8) [IPMI] (20090903/evregion-319)
    ACPI Error: Region IPMI(7) has no handler (20090903/exfldio-295)
    ACPI Error (psparse-0537): Method parse/execution failed [_SB_.PMI0.GHL] (Node ffff8820538b41a0), AE_NOT_EXIST
    ACPI Error (psparse-0537): Method parse/execution failed [_SB
    .PMI0._PMC] (Node ffff8820538b41f0), AE_NOT_EXIST

    by

    points

    Responses

    Red Hat LinkedIn YouTube Facebook X, formerly Twitter

    Quick Links

    Help

    Site Info

    Related Sites

    © 2026 Red Hat