opa-fm service crashed and unable to start for Intel Omnipath fabric

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 7.3 (RHEL)
    • Intel Corporation Omni-Path HFI Silicon 100 Series

Issue

  • Below messages are observed when starting opa-fm.
Jul 27 15:25:57 localhost fm0_sm[11451]: [main]: Subnet Manager starting up. LogLevel: 2 LogMode: 0
Jul 27 15:25:57 localhost fm0_sm[11451]: [main]: SM: Disabling CoreDumps: CoreDumpLimit: 0
Jul 27 15:25:57 localhost fm0_sm[11451]: [main]: SM: Using dynamic packet lifetime values 14, 15, 15, 16, 16, 16, 16, 17, 17
Jul 27 15:25:57 localhost fm0_sm[11451]: [main]: SM: GidPrefix=0xfe80000000000000, Key=0x0000000000000000, MKey=0x0000000000000000 : protect_level=1 : lease=0 seconds, dbsync interval=900 seconds
Jul 27 15:25:57 localhost fm0_sm[11451]: [main]: SM: Size Limits: EndNodePorts=9216 Nodes=11904 Ports=76416 Links=55296
Jul 27 15:25:57 localhost fm0_sm[11451]: [main]: SM: Memory: Pool=262144K SA Resp=38208K
Jul 27 15:25:57 localhost fm0_sm[11451]: PROGR[main]: SM: sm_assign_base_sls: 2 active VF(s), 0 standby VF(s) requires 1 SLs and 1 SCs for operation
Jul 27 15:25:57 localhost fm0_sm[11451]: PROGR[main]: SM: : VF Bandwidth Allocations :
Jul 27 15:25:57 localhost fm0_sm[11451]: PROGR[main]: SM: [VF:Default] : Sharing 100% BW among remaining VFs
Jul 27 15:25:57 localhost fm0_sm[11451]: PROGR[main]: SM: [VF:Admin] : Sharing 100% BW among remaining VFs
Jul 27 15:25:57 localhost fm0_sm[11451]: PROGR[main]: SM: [VF:Default] : Base SL:0 Base SC:0 NumScs:1 QOS:0 HP:0
Jul 27 15:25:57 localhost fm0_sm[11451]: PROGR[main]: SM: [VF:Admin] : Base SL:0 Base SC:0 NumScs:1 QOS:0 HP:0
Jul 27 15:25:58 localhost fm0_sm[11451]: oib_utils ERROR: [11451] oib_get_portguid: No hfi names found, no port GUID to find.
Jul 27 15:25:58 localhost fm0_sm[11451]: ERROR[main]: APP: ib_init_devport: Failed to bind to device 1, port 1; status: 5
Jul 27 15:25:58 localhost fm0_sm[11451]: ; MSG:NOTICE|SM:Default SM:port 1|COND:#7 SM shutdown|DETAIL:sm_main: Failed to bind to device; terminating
Jul 27 15:25:58 localhost fm0_sm[11451]: FATAL[main]: SM: sm_main: sm_main: Failed to bind to device; terminating
Jul 27 15:25:58 localhost FATAL:[11451]: sm_main: Failed to bind to device; terminating
Jul 27 15:25:58 localhost abrt-hook-ccpp: Process 11451 (sm) of user 0 killed by SIGABRT - dumping core

Resolution

  • Update all opa packages to same version.

Root Cause

  • Mismatch in version of opa-fm and other opa packages caused compatibility issue.

Diagnostic Steps

  • Check the opa packages version in installed-rpm file.
opa-address-resolution-10.1.0.0-127.el7.x86_64              Tue Jul 25 16:38:15 2017
opa-basic-tools-10.1.0.0-127.el7.x86_64                     Tue Jul 25 16:37:31 2017
opa-fastfabric-10.1.0.0-127.el7.x86_64                      Tue Jul 25 16:38:14 2017
opa-fm-10.1.0.0-145.el7.x86_64                              Tue Jul 25 16:37:39 2017       <<<

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments