Red Hat Training

A Red Hat training course is available for Red Hat Enterprise Linux

21.6. 检查硬件错误

红帽企业 Linux 7 引入了新的硬件事件报告机制 (HERM.) 这种机制收集系统报告的内存错误,以及错误检测和更正 (EDAC)机制报告的错误,用于双行内存模块(DIMM),并将它们报告给用户空间。用户空间守护进程 rasdaemon 捕获和处理来自内核追踪机制的所有 可靠性、可用性和可维护性 (RAS)错误事件,并记录它们。以前由 edac-utils 提供的函数现在由 rasdaemon 替代。

要安装 install rasdaemon,以 root 用户身份输入以下命令:

~]# yum install rasdaemon

按如下所示启动服务:

~]# systemctl start rasdaemon

要使服务在系统启动时运行,请输入以下命令:

~]# systemctl enable rasdaemon

The ras-mc-ctl 实用程序提供了一种使用 EDAC 驱动程序的方法。输入以下命令查看命令选项列表:

~]$ ras-mc-ctl --help
Usage: ras-mc-ctl [OPTIONS...]
 --quiet      Quiet operation.
 --mainboard    Print mainboard vendor and model for this hardware.
 --status      Print status of EDAC drivers.
output truncated

要查看内存控制器事件摘要,以 root 用户身份运行:

~]# ras-mc-ctl --summary
Memory controller events summary:
    Corrected on DIMM Label(s): 'CPU_SrcID#0_Ha#0_Chan#0_DIMM#0' location: 0:0:0:-1 errors: 1

No PCIe AER errors.

No Extlog errors.
MCE records summary:
    1 MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error errors
    2 No Error errors

要查看内存控制器报告的错误列表,以 root 用户身份运行:

~]# ras-mc-ctl --errors
Memory controller events:
1 3172-02-17 00:47:01 -0500 1 Corrected error(s): memory read error at CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 location: 0:0:0:-1, addr 65928, grain 7, syndrome 0 area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0

No PCIe AER errors.

No Extlog errors.

MCE events:
1 3171-11-09 06:20:21 -0500 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x01000c16, status=0x8c00004000010090, addr=0x1018893000, misc=0x15020a086, walltime=0x57e96780, cpuid=0x00050663, bank=0x00000007
2 3205-06-22 00:13:41 -0400 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x01000c16, status=0x9400000000000000, addr=0x0000abcd, walltime=0x57e967ea, cpuid=0x00050663, bank=0x00000001
3 3205-06-22 00:13:41 -0400 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x01000c16, status=0x9400000000000000, addr=0x00001234, walltime=0x57e967ea, cpu=0x00000001, cpuid=0x00050663, apicid=0x00000002, bank=0x00000002

这些命令在 ras-mc-ctl(8)man page 中进行了说明。