rhel7.9: The AER recovery process works unexpectedly due to PCIe correctable errors and network goes down

Solution Verified - Updated -

Issue

This is a regression issue. This issue does not occur on RHEL7.8. After updating to RHEL 7.9, the AER recovery process works unexpectedly due to PCIe correctable errors and network goes down. RHEL 7.9 /var/log/messages:

kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {1}[Hardware Error]: event severity: corrected <--- PCIe correctable errors
kernel: {1}[Hardware Error]:  Error 0, type: corrected
kernel: {1}[Hardware Error]:   section_type: PCIe error
kernel: {1}[Hardware Error]:   port_type: 4, root port
kernel: {1}[Hardware Error]:   version: 1.16
kernel: {1}[Hardware Error]:   command: 0x0010, status: 0x0546
kernel: {1}[Hardware Error]:   device_id: 0000:03:00.1
kernel: {1}[Hardware Error]:   slot: 0
kernel: {1}[Hardware Error]:   secondary_bus: 0x00
kernel: {1}[Hardware Error]:   vendor_id: 0x10df, device_id: 0x0720
kernel: {1}[Hardware Error]:   class_code: 000200
kernel: be2net 0000:03:00.1: aer_status: 0x00000001, aer_mask: 0x000031c0
kernel: Receiver Error
kernel: be2net 0000:03:00.1: aer_layer=Physical Layer, aer_agent=Receiver ID
kernel: be2net 0000:03:00.0: EEH error detected <--- The AER recovery process
kernel: be2net 0000:03:00.0: eno1: Link down
kernel: be2net 0000:03:00.0: did not receive flush compl
kernel: be2net 0000:03:00.0: did not receive flush compl
kernel: be2net 0000:03:00.0: did not receive flush compl
kernel: be2net 0000:03:00.0: did not receive flush compl
kernel: be2net 0000:03:00.1: EEH error detected
kernel: be2net 0000:03:00.1: eno2: Link down
kernel: be2net 0000:03:00.1: did not receive flush compl
kernel: be2net 0000:03:00.1: did not receive flush compl
kernel: be2net 0000:03:00.1: did not receive flush compl
kernel: be2net 0000:03:00.1: did not receive flush compl
kernel: be2net 0000:03:00.0: EEH reset
kernel: be2net 0000:03:00.0: Waiting for FW to be ready after EEH reset
kernel: be2net 0000:03:00.1: EEH reset
kernel: be2net 0000:03:00.1: Waiting for FW to be ready after EEH reset
kernel: be2net 0000:03:00.0: EEH resume
kernel: be2net 0000:03:00.0: FW config: function_mode=0x2010802, function_caps=0x7
kernel: be2net 0000:03:00.0: Using profile 0x10
kernel: be2net 0000:03:00.0: Max: txqs 70, rxqs 64, rss 64, eqs 32, vfs 0
kernel: be2net 0000:03:00.0: Max: uc-macs 126, mc-macs 64, vlans 64

RHEL 7.8 /var/log/messages:

ernel: {7010}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
kernel: {7010}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {7010}[Hardware Error]: event severity: corrected <--- PCIe correctable errors
kernel: {7010}[Hardware Error]:  Error 0, type: corrected
kernel: {7010}[Hardware Error]:   section_type: PCIe error
kernel: {7010}[Hardware Error]:   port_type: 4, root port
kernel: {7010}[Hardware Error]:   version: 1.16
kernel: {7010}[Hardware Error]:   command: 0x0010, status: 0x0546
kernel: {7010}[Hardware Error]:   device_id: 0000:03:00.1
kernel: {7010}[Hardware Error]:   slot: 0
kernel: {7010}[Hardware Error]:   secondary_bus: 0x00
kernel: {7010}[Hardware Error]:   vendor_id: 0x10df, device_id: 0x0720
kernel: {7010}[Hardware Error]:   class_code: 000200
systemd: Created slice User Slice of user1.

The AER recovery process is not working in RHEL 7.9.

Environment

  • Red Hat Enterprise Linux (RHEL) 7.9
  • kernels
    • first affected kernel: 3.10.0-1160.el7
    • last affected kernel: 3.10.0-1160.36.1.el7
  • AER recovery process

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content