Can Red Hat Enterprise Linux monitor hardware faults through syslog in Linux?
Environment
Red Hat Enterprise Linux (RHEL) 6
Red Hat Enterprise Linux (RHEL) 7
Issue
Can Red Hat Enterprise Linux monitor hardware faults through syslog
in Linux?
Is hardware fault management through syslog
possible?
Resolution
Here are some options that are available
-
Error Detection and Correction (EDAC) & Machine Check Exception (MCE) can be monitored using the
mcelogd
daemon, and with the--syslog
option will log events tosyslog
. For more information please see:- Error Detection and Correction (EDAC) Support available in Red Hat Enterprise Linux
- What is
mcelog
and how can I install it? - Is it necessary to have both EDAC and MCE error reporting modules loaded in the kernel ?
- What does the message "kernel: Machine check events logged" mean?
-
man mcelog
X86 CPUs report errors detected by the CPU as machine check events (MCEs). These can be data corruption detected in the CPU caches, in main memory by an integrated memory controller, data transfer errors on the front side bus or CPU interconnect or other internal errors. Pos- sible causes can be cosmic radiation, instable power supplies, cooling problems, broken hardware, or bad luck. Most errors can be corrected by the CPU by internal error correction mechanisms. Uncorrected errors cause machine check exceptions which may panic the machine. When a corrected error happens the x86 kernel writes a record describ- ing the MCE into a internal ring buffer available through the /dev/mcelog device mcelog retrieves errors from /dev/mcelog, decodes them into a human readable format and prints them on the standard out- put or optionally into the system log. <snip> When the --syslog option is specified redirect output to system log. The --syslog-error option causes the normal machine checks to be logged as LOG_ERR (implies --syslog ). Normally only fatal errors or high level remarks are logged with error level. High level one line sum- maries of specific errors are also logged to the syslog by default unless mcelog operates in --ascii mode.
-
System Event Log (SEL) can be monitored using the
ipmievd
daemon. Here is some information on the daemon fromman ipmievd
ipmievd is a daemon which will listen for events from the BMC that are being sent to the SEL and also log those messages to syslog. It is able to run in one of two modes: either using the Event Message Buffer and asynchronous event notification from the OpenIPMI kernel driver or actively polling the contents of the SEL for new events. Upon receipt of an event via either mechanism it will be logged to syslog with the LOG_LOCAL4 facility. It is based on the ipmitool utility and shares the same IPMI interface support and session setup options. Please see the ipmitool manpage for more information on supported IPMI interfaces.
-
Intelligent Platform Management Interface (IPMI) can also be used to query hardware sensors via the
ipmitool
utility. This would make it possible to manually monitor these sensors and then trigger a log message via thelogger
utility. Here is output ofipmitool sensor
to show what data is available:System Temp | 52.000 | degrees C | ok | -9.000 | -7.000 | -5.000 | 75.000 | 77.000 | 79.000 CPU Temp | 64.000 | degrees C | ok | -11.000 | -8.000 | -5.000 | 85.000 | 90.000 | 95.000 CPU FAN | na | RPM | na | na | na | na | na | na | na SYS FAN | na | RPM | na | na | na | na | na | na | na CPU Vcore | 1.168 | Volts | ok | 0.640 | 0.664 | 0.688 | 1.344 | 1.408 | 1.472 Vnbcore | 1.056 | Volts | ok | 0.808 | 0.824 | 0.840 | 1.160 | 1.176 | 1.192 +3.3VCC | 3.312 | Volts | ok | 2.816 | 2.880 | 2.944 | 3.584 | 3.648 | 3.712 VDIMM | 1.848 | Volts | ok | 1.448 | 1.480 | 1.512 | 1.960 | 1.992 | 2.024 +5 V | 5.088 | Volts | ok | 4.096 | 4.320 | 4.576 | 5.344 | 5.600 | 5.632 +12 V | 12.160 | Volts | ok | 10.368 | 10.496 | 10.752 | 12.928 | 13.056 | 13.312 +3.3VSB | 3.312 | Volts | ok | 2.816 | 2.880 | 2.944 | 3.584 | 3.648 | 3.712 VBAT | 2.864 | Volts | ok | 2.560 | 2.624 | 2.688 | 3.328 | 3.392 | 3.456 Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na PS Status | 0x1 | discrete | 0x01ff| na | na | na | na | na | na
-
Hardware Vendor Specific Monitoring is another option for monitoring your hardware if they are available. Please contact your Hardware Vendor for more information.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments