Some systems with large numbers of logical CPUs hang during boot intermittantly.

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5.7 and older
  • The issue has been reported on Westmere-EX based processors

  • E7-8867L – WSM-EX 2.13GHz 10C 105W
    E7-2870 – WSM-EX 2.40GHz 10C 130W
    E7-2860 – WSM-EX 2.26GHz 10C 130W
    E7-2850 – WSM-EX 2.00GHz 10C 130W
    E7- 4870 - WSM-EX 2.4GHz 10C 130W

Issue

  • Some systems with a large number of CPUs (either physical or logical via hyperthreading) hang during boot intermittantly. Dell has reported this problem to exist on some of their Westmere-EX based Poweredge systems. However, the issue may not be restricted to Dell systems alone and could possibly affect other vendor hardware.

Resolution

  • The fix is to institute a write memory barrier across the cpus and disable interrupts on the cpu which is issuing the IPI before the IPI is issues. The IPI handler will disable interrupts on each of the other cpus.
  • This fix will be available in a future RHEL5 kernel.  Please contact your support representative for additional details.

Root Cause

  • When set_mtrr() is sending an IPI to each of the other 79 CPU an interrupt occurs on the CPU that's running set_mtrr().

    This interrupt handler gets stuck in __rcu_process_callbacks()waiting for the rcp->lock spin_lock that is held by one of the other CPUs, which got interrupted by the IPI right after it
    acquired that spin lock.
    
    This is a deadlock situation.
    
    CPU B acquires the rcp spinlock.
    CPU A is sending IPI to all the other CPUS but not disabling interrupts for itself.
    CPU B is interrupted by the IPI while it has held on to the spin lock. The IPI handler is waiting for the gate count(critical variable to change before it can exit the IPI handler)
    CPU A is interrupted by another interrupt that needs the spin lock held by CPU B before it was interrupted by the IPI.
    
    CPU B is waiting for the gate count to change by CPU A and CPU A is waiting for the spin lock to be released by CPU B>
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments