atomic-openshift-node constantly crashing
Environment
- Openshift Enterprise 3.1
Issue
-
We did a reboot of one vm. Since the reboot the atomic-openshift-node service does not run anymore.
-
We are experiencing this behavior in our OSE environment in the AWS Cloud
This occurs every time we do a systemctl start atomic-openshift-node
Logentries from journalctl:
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: I0105 16:19:21.057773 3094 multitenant.go:82] Output of adding table=4,tcp,nw_dst=172.30.x.x,tp_dst=8080,priority=200,reg0=12,actions=ou
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: I0105 16:19:21.060531 3094 multitenant.go:82] Output of adding table=4,tcp,nw_dst=172.30.x.x,tp_dst=8080,priority=200,reg0=12,actions=out
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: I0105 16:19:21.063159 3094 multitenant.go:82] Output of adding table=4,tcp,nw_dst=172.30.x.x,tp_dst=27017,priority=200,reg0=12,actions=ou
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: I0105 16:19:21.065915 3094 multitenant.go:82] Output of adding table=4,tcp,nw_dst=172.30.x.x,tp_dst=3306,priority=200,reg0=12,actions=ou
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: panic: runtime error: index out of range
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: goroutine 70 [running]:
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: runtime.gopanic(0x2623fe0, 0xc20802a000)
Jan 05 16:19:21 ip-192-168-x-x.eu-central-1.compute.internal atomic-openshift-node[3094]: /usr/lib/golang/src/runtime/panic.go:425 +0x2a3 fp=0xc208ecd300 sp=0xc208ecd298
It looks like there is a problem with some iptables rules. Is this a known problem?
How can we delete everything from a node?
The master says that on this node are still some pods with state terminating:
[root@master01 ~]# oadm manage-node node03.ose.com --list-pods
Listing matched pods on node: node03.ose.com
NAME READY STATUS RESTARTS AGE
logging-fluentd-2-9aa1i 0/1 Terminating 0 7d
hawkular-cassandra-1-8nltx 0/1 Terminating 0 3h
hawkular-cassandra-1-9uc6e 0/1 Terminating 0 5h
hawkular-cassandra-1-zgess 0/1 Terminating 0 7h
hawkular-metrics-30avv 0/1 Terminating 0 5h
hawkular-metrics-wpcui 0/1 Terminating 0 4h
heapster-h1egj 0/1 Terminating 0 4h
Can can we delete / stop those hanging pods?
Resolution
The following steps fixed the problem:
- Stop master01
- Stop ectd01
- Start etcd01
- Start master01
- Start node03
It looks like the master got unstable because the node03 did not answer for some time during the reboot. For future reboots always first evacuate the node.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
