Can Red Hat Enterprise Linux monitor hardware faults through syslog in Linux?

Solution Verified - Updated -

Environment

Red Hat Enterprise Linux (RHEL) 6
Red Hat Enterprise Linux (RHEL) 7

Issue

Can Red Hat Enterprise Linux monitor hardware faults through syslog in Linux?
Is hardware fault management through syslog possible?

Resolution

Here are some options that are available

  1. Error Detection and Correction (EDAC) & Machine Check Exception (MCE) can be monitored using the mcelogd daemon, and with the --syslog option will log events to syslog. For more information please see:

    • Error Detection and Correction (EDAC) Support available in Red Hat Enterprise Linux
    • What is mcelog and how can I install it?
    • Is it necessary to have both EDAC and MCE error reporting modules loaded in the kernel ?
    • What does the message "kernel: Machine check events logged" mean?
    • man mcelog

      X86  CPUs  report  errors  detected  by the CPU as machine check events
      (MCEs).  These can be data corruption detected in the  CPU  caches,  in
      main memory by an integrated memory controller, data transfer errors on
      the front side bus or CPU interconnect or other internal errors.   Pos-
      sible  causes can be cosmic radiation, instable power supplies, cooling
      problems, broken hardware, or bad luck.
      
      Most errors can be corrected by the CPU by  internal  error  correction
      mechanisms. Uncorrected errors cause machine check exceptions which may
      panic the machine.
      
      When a corrected error happens the x86 kernel writes a record  describ-
      ing  the  MCE  into  a  internal  ring  buffer  available  through  the
      /dev/mcelog device mcelog retrieves errors  from  /dev/mcelog,  decodes
      them  into a human readable format and prints them on the standard out-
      put or optionally into the system log.
      
      <snip>
      
      When  the  --syslog  option is specified redirect output to system log.
      The --syslog-error option causes the normal machine checks to be logged
       as  LOG_ERR  (implies  --syslog  ).  Normally only fatal errors or high
      level remarks are logged with error level.  High level  one  line  sum-
      maries  of  specific  errors  are  also logged to the syslog by default
      unless mcelog operates in --ascii mode.
      
  2. System Event Log (SEL) can be monitored using the ipmievd daemon. Here is some information on the daemon from man ipmievd

    ipmievd  is a daemon which will listen for events from the BMC that are
    being sent to the SEL and also log those messages  to  syslog.   It  is
    able  to run in one of two modes: either using the Event Message Buffer
    and asynchronous event notification from the OpenIPMI kernel driver  or
    actively  polling the contents of the SEL for new events.  Upon receipt
    of an event via either mechanism it will be logged to syslog  with  the
    LOG_LOCAL4 facility.
    
    It  is based on the ipmitool utility and shares the same IPMI interface
    support and session setup options.  Please see the ipmitool manpage for
    more information on supported IPMI interfaces.
    
  3. Intelligent Platform Management Interface (IPMI) can also be used to query hardware sensors via the ipmitool utility. This would make it possible to manually monitor these sensors and then trigger a log message via the logger utility. Here is output of ipmitool sensor to show what data is available:

    System Temp      | 52.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 75.000    | 77.000    | 79.000    
    CPU Temp         | 64.000     | degrees C  | ok    | -11.000   | -8.000    | -5.000    | 85.000    | 90.000    | 95.000    
    CPU FAN          | na         | RPM        | na    | na        | na        | na        | na        | na        | na        
    SYS FAN          | na         | RPM        | na    | na        | na        | na        | na        | na        | na        
    CPU Vcore        | 1.168      | Volts      | ok    | 0.640     | 0.664     | 0.688     | 1.344     | 1.408     | 1.472     
    Vnbcore          | 1.056      | Volts      | ok    | 0.808     | 0.824     | 0.840     | 1.160     | 1.176     | 1.192     
    +3.3VCC          | 3.312      | Volts      | ok    | 2.816     | 2.880     | 2.944     | 3.584     | 3.648     | 3.712     
    VDIMM            | 1.848      | Volts      | ok    | 1.448     | 1.480     | 1.512     | 1.960     | 1.992     | 2.024     
    +5 V             | 5.088      | Volts      | ok    | 4.096     | 4.320     | 4.576     | 5.344     | 5.600     | 5.632     
    +12 V            | 12.160     | Volts      | ok    | 10.368    | 10.496    | 10.752    | 12.928    | 13.056    | 13.312    
    +3.3VSB          | 3.312      | Volts      | ok    | 2.816     | 2.880     | 2.944     | 3.584     | 3.648     | 3.712     
    VBAT             | 2.864      | Volts      | ok    | 2.560     | 2.624     | 2.688     | 3.328     | 3.392     | 3.456     
    Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na        
    PS Status        | 0x1        | discrete   | 0x01ff| na        | na        | na        | na        | na        | na
    
  4. Hardware Vendor Specific Monitoring is another option for monitoring your hardware if they are available. Please contact your Hardware Vendor for more information.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments