atomic-openshift-node constantly crashing

Solution In Progress - Updated -

Environment

  • Openshift Enterprise 3.1

Issue

  • We did a reboot of one vm. Since the reboot the atomic-openshift-node service does not run anymore.

  • We are experiencing this behavior in our OSE environment in the AWS Cloud

This occurs every time we do a systemctl start atomic-openshift-node

Logentries from journalctl:

Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: I0105 16:19:21.057773    3094 multitenant.go:82] Output of adding table=4,tcp,nw_dst=172.30.x.x,tp_dst=8080,priority=200,reg0=12,actions=ou
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: I0105 16:19:21.060531    3094 multitenant.go:82] Output of adding table=4,tcp,nw_dst=172.30.x.x,tp_dst=8080,priority=200,reg0=12,actions=out
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: I0105 16:19:21.063159    3094 multitenant.go:82] Output of adding table=4,tcp,nw_dst=172.30.x.x,tp_dst=27017,priority=200,reg0=12,actions=ou
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: I0105 16:19:21.065915    3094 multitenant.go:82] Output of adding table=4,tcp,nw_dst=172.30.x.x,tp_dst=3306,priority=200,reg0=12,actions=ou
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: panic: runtime error: index out of range
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: goroutine 70 [running]:
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: runtime.gopanic(0x2623fe0, 0xc20802a000)
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: /usr/lib/golang/src/runtime/panic.go:425 +0x2a3 fp=0xc208ecd300 sp=0xc208ecd298

It looks like there is a problem with some iptables rules. Is this a known problem?

How can we delete everything from a node?

The master says that on this node are still some pods with state terminating:
[root@master01 ~]# oadm manage-node node03.ose.com --list-pods

Listing matched pods on node: node03.ose.com

NAME READY STATUS RESTARTS AGE
logging-fluentd-2-9aa1i 0/1 Terminating 0 7d
hawkular-cassandra-1-8nltx 0/1 Terminating 0 3h
hawkular-cassandra-1-9uc6e 0/1 Terminating 0 5h
hawkular-cassandra-1-zgess 0/1 Terminating 0 7h
hawkular-metrics-30avv 0/1 Terminating 0 5h
hawkular-metrics-wpcui 0/1 Terminating 0 4h
heapster-h1egj 0/1 Terminating 0 4h

Can can we delete / stop those hanging pods?

Resolution

The following steps fixed the problem:
- Stop master01
- Stop ectd01
- Start etcd01
- Start master01
- Start node03

It looks like the master got unstable because the node03 did not answer for some time during the reboot. For future reboots always first evacuate the node.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.