vertica cluster hung running on RHEL

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5
  • Red Hat Enterprise Linux 6
  • vertica cluster
  • Oracle

Issue

3 node cluster hangs after the load average reaching 200 within approximately 7 hours of application running time.

  • 3 node vertica cluster, out of 3 nodes, 2 node experience hung condition, only the master node is not experiencing the problem.

  • load average of the server goes up to 200 and then the system hangs, unable to ssh into server.

  • Many processes in D-state, creating a deadlock situation

Resolution

  • Monitor the server for presence of d-state processes using something like the following command may provide an early warning indicator of sorts.
# ps auwwx|gawk '$8 ~ /^D.*|^Z.*/'
  • check whether these are all stuck on the same processor
ps -eo psr,stat,cmd|gawk '$2 ~ /^D.*|^Z.*/'

Core dump analysis revealed the following hardware related issue;

rash> runq|grep "CPU 18" -A4
CPU 18 RUNQUEUE: ffff880061756700
CURRENT: PID: 0 TASK: ffff8820293cf500 COMMAND: "swapper"
RT PRIO_ARRAY: ffff880061756888
[ 0] PID: 78 TASK: ffff8810290e0040 COMMAND: "watchdog/18"
[ 0] PID: 75 TASK: ffff8810290d8080 COMMAND: "migration/18"

crash> px runqueues|grep 18
[18]: ffff880061756700

crash> rq.clock ffff880061756700
clock = 36054601117214011
crash> pd ((struct task_struct *)0xffff8810290e0040)->sched_info.last_arrival
$6 = 36049749176669338

crash> p/d (36054601117214011-36049749176669338)/1000000000/60
$11 = 80

As you can see in the above example, CPU 18 hasn't run for 80 minutes. This does look like an hardware issue since CPU18 is not responding.

  • Replacing the faulty CPU, resolved the issue.

Root Cause

Faulty CPU

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments