DHCP Master Node Recovery

Posted on

I am working in a lab. It uses IPI to deploy various lab and POC spaces. I have a few times forgotten to , post deployment, convert static leases from short to long term leases.... or... do an upgrade POC, which switches out IP for master nodes, and so again, not set static IP for master nodes as long term lease.

As such the cluster boots up 3 x master 3 x workers but no VIPs bind for API or *.apps

Logs within master on pods show no services on master nodes listening to 6443 (netstat |grep 6443) so API services never start.

Then noted within pods that "kube-apiserver" keeps restarting and so cluster never resubstantiates.

#

[root@os01-6mn8m-master-0 ~]# crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
e42105258820d 6a22e7a5f22d51be5714fcf5f7b8a10f16770f888c70970b35aba26258c1a653 10 seconds ago Running kube-apiserver-check-endpoints 625 6b73a0f56986c kube-apiserver-os01-6mn8m-master-0
c1ca5cf198b77 b485a5c42c63481a6502863144a40fd8b298f72308de60e5d6d5d0d774674407 10 seconds ago Running kube-apiserver 507 6b73a0f56986c kube-apiserver-os01-6mn8m-master-0
98852c0b63595 f936c10ab7a7621cc8dd226a2d58b7e80831e8afd0be500565ef5bb7a846820f About a minute ago Running keepalived 717 af6d4480b460b keepalived-os01-6mn8m-master-0
bcdec1c8811e8 7e67d27148585ca0ed0b0dec0040e4a44d67edd68a5764c1db5feff5b21516e5 3 minutes ago Running keepalived-monitor 4 af6d4480b460b keepalived-os01-6mn8m-master-0
b80016a9bef99 6a22e7a5f22d51be5714fcf5f7b8a10f16770f888c70970b35aba26258c1a653 3 minutes ago Running kube-apiserver-insecure-readyz 4 6b73a0f56986c kube-apiserver-os01-6mn8m-master-0
a1269075fbb31 3d60197d715417c170333ff191920f1eb696e05e7f066c47ad3290d634ecd488 3 minutes ago Running kube-scheduler-recovery-controller 4 5a11b12d367f4 openshift-kube-scheduler-os01-6mn8m-master-0
3084a430e744d 6a22e7a5f22d51be5714fcf5f7b8a10f16770f888c70970b35aba26258c1a653 3 minutes ago Running kube-apiserver-cert-regeneration-controller 4 6b73a0f56986c kube-apiserver-os01-6mn8m-master-0
cb8b9a946dfc2 6a22e7a5f22d51be5714fcf5f7b8a10f16770f888c70970b35aba26258c1a653 3 minutes ago Running kube-apiserver-cert-syncer 315 6b73a0f56986c kube-apiserver-os01-6mn8m-master-0
0084f400e5a26 3d60197d715417c170333ff191920f1eb696e05e7f066c47ad3290d634ecd488 3 minutes ago Running kube-scheduler-cert-syncer 315 5a11b12d367f4 openshift-kube-scheduler-os01-6mn8m-master-0
6c4cd7b278a0c 7e67d27148585ca0ed0b0dec0040e4a44d67edd68a5764c1db5feff5b21516e5 3 minutes ago Running coredns-monitor 4 44f392c5948b7 coredns-os01-6mn8m-master-0
2e1e87b097387 358222c662f024140d083d8ac1193eacb6d9e69ce51a186095a4f93d9167645e 3 minutes ago Running kube-controller-manager-recovery-controller 4 b224ce9e4ab70 kube-controller-manager-os01-6mn8m-master-0
9ecc1c29cd3d4 7cf0c5d0ec021f13854727d5bbc10d545248b8e8bd7bc0975671b9025f44b361 3 minutes ago Running coredns 5 44f392c5948b7 coredns-os01-6mn8m-master-0
88a2a0d53c0ab b485a5c42c63481a6502863144a40fd8b298f72308de60e5d6d5d0d774674407 3 minutes ago Running kube-scheduler 5 5a11b12d367f4 openshift-kube-scheduler-os01-6mn8m-master-0
db0097dc3d74c 358222c662f024140d083d8ac1193eacb6d9e69ce51a186095a4f93d9167645e 3 minutes ago Running kube-controller-manager-cert-syncer 314 b224ce9e4ab70 kube-controller-manager-os01-6mn8m-master-0
4d6e5ddf32ca2 b485a5c42c63481a6502863144a40fd8b298f72308de60e5d6d5d0d774674407 3 minutes ago Running kube-controller-manager 19 b224ce9e4ab70 kube-controller-manager-os01-6mn8m-master-0
[root@os01-6mn8m-master-0 ~]# crictl logs -f c1ca5cf198b77
flock: getting lock took 0.000007 seconds
Copying system trust bundle ...
I0508 15:56:35.850522 1 loader.go:374] Config loaded from file: /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig
Copying termination logs to "/var/log/kube-apiserver/termination.log"
I0508 15:56:35.854707 1 main.go:161] Touching termination lock file "/var/log/kube-apiserver/.terminating"

}. Err: connection error: desc = "transport: Error while dialing dial tcp [::1]:2379: connect: connection refused"
W0508 15:56:39.410905 15 logging.go:59] [core] [Channel #1 SubChannel #4] grpc: addrConn.createTransport failed to connect to {
"Addr": "172.16.100.163:2379",
"ServerName": "172.16.100.163",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null

#

New master node IPs
172.16.100.163
172.16.100.169
172.16.100.159

I think.... Old master node IPs (via scraping above logs)
172.16.100.163
172.16.100.150
172.16.100.143

I have to imagine that I am by no means the first one to have to remediate a cluster like this. Is there a guide on how to do that? Other K8 vendors though do not have this issue, so I am wondering if there is a post-deployment step of UPI that can lean more into DDNS on "machine network" network that could help clusters avoid this issue, as every upgrade, Operations team will have to remediate DHCP long leases and clean out old ones.

Questions:
1) How can I fix the cluster (vs redeploy)
2) How can I remediate an IPI build such as to avoid this pet cycle for more production class clusters.

Responses