OCP 4 - egress behaviour when node reboots (e.g. during cluster-update)
Hey folks,
our new ocp-cluster is in production since a couple of weeks and we have now performed our first update (4.7.9 -> 4.7.21). Unfortunately, we had a short service-outage while the workernodes were rebooting.
The failed service (two replicas on different nodes) needs a constant jdbc-connection to an external database, so we have already implemented an according liveness-check which already restarts the pod succesfully.
The problem is, that this connection interrupts and there is still a timeperiod between the first failed connection and the liveness initiated pod restarts.
We use projectbased egressIPs, annotated to our four worker-nodes, for outgoing traffic ( oc patch netnamespace mynamespace--type=merge -p '{"egressIPs": ["A.B.C.D"]}' )
My assumptions:
1. An egressIP can only exist on one node
2. After or while the egressIP floats from one node to another, the clusteroperator "network" is processing.
3. There is a time period while the egressIP can`t be used by the assigned namespaces
My questions:
1. Are my assumptions right?
2. What can we do to prevent this situation? We want to update our cluster without any service downtime.
Maybe the problem is on a total different place, if so, please let me know.
Thank you!
Kind regards,
Sascha