on-disk validation fails on file content mismatch during MCO upgrade in OpenShift 4
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
Issue
-
MCO
(Machine Config Operator) upgrade fails when on-disk validation against current renderedMachineConfig
:message: 'Node node-worker-xfgj is reporting: "unexpected on-disk state validating against rendered-worker-03dbc49f09784223c1b11a53db9d92eb"'
message: 'Node node-worker-xfgj is reporting: "failed to drain node : node-worker-xfgj after 1 hour", Node node-worker-xfgj is reporting: "unexpected on-disk state validating against rendered-worker-03dbc49f09784223c1b11a53db9d92eb: content mismatch for file \"/usr/local/bin/configure-ovs.sh\""'
-
An error like below is repeating in the
Machine Config Daemon
's Pod logs:daemon.go:1350] content mismatch for file /etc/chrony.conf:
Resolution
Check the node and the MachineConfig
causing the issue as explained in the Diagnostic Steps section.
Get the encoding of the source
field of the file in the MachineConfig
(change the [node_name]
and [rendered-worker_name]
accordingly, and the [escaped_path]
needs to be escaped like for example \/etc\/chrony.conf
):
$ oc get node [node_name] -o yaml | grep desiredConfig
machineconfiguration.openshift.io/desiredConfig: [rendered-worker_name]
$ oc get mc [rendered-worker_name] -o yaml | grep -B5 "path: [escaped_path]" | grep source | tail -n 1 | cut -d"," -f1
source: data:text/plain;charset=utf-8;base64
To get the encoded string from the source
field:
$ oc get mc [rendered-worker_name] -o yaml | grep -B5 "path: [escaped_path]" | grep source | tail -n 1 | cut -d"," -f2
c2VydmVyIDEwLjE2Ny42MC4zMSBpYnVyc3QKc2VydmVyIDEwLjE2Mi42MC4zMSBpYnVyc3QgCmRyaWZ0ZmlsZSAvdmFyL2xpYi9jaHJvbnkvZHJpZnQKbWFrZXN0ZXAgMS4wIDMKcnRjc3luYwpsb2dkaXIgL3Zhci9sb2cvY2hyb255Cg==
Note: If the
source
field is inbase64
(contains the stringdata:text/plain;charset=utf-8;base64
at the beginning), the commandbase64 -d
can be used for decoding the content. If thesource
field is url encode (contains onlydata:
), it can be decoded with a function likefunction urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }
.
Note: The followingpython3
commands can be also used to decode it:python3 -c "import sys, urllib.parse; print(urllib.parse.unquote(sys.argv[1]))" [string to decode]
forbase64
, and ` python3 -c "import sys, urllib.parse; print(urllib.parse.unquote(sys.argv[1]))" [string_to_decode] for url encode.
To decode the source
field to a file:
### For `base64`
$ oc get mc [rendered-worker_name] -o yaml | grep -B5 "path: [escaped_path]" | grep source | tail -n 1 | cut -d"," -f2 | base64 -d > [file_name]
### For url encode
$ function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; } ; urldecode "$(oc get mc [rendered-worker_name] -o yaml | grep -B5 "path: [escaped_path]" | grep source | tail -n 1 | cut -d"," -f2)" > [file_name]
Example to correct the /etc/chrony.conf
file on-disk from a base64
encoded field:
# oc debug node/[node_name] -- chroot /host /bin/bash -c " echo -e c2VydmVyIDEwLjE2Ny42MC4zMSBpYnVyc3QKc2VydmVyIDEwLjE2Mi42MC4zMSBpYnVyc3QgCmRyaWZ0ZmlsZSAvdmFyL2xpYi9jaHJvbnkvZHJpZnQKbWFrZXN0ZXAgMS4wIDMKcnRjc3luYwpsb2dkaXIgL3Zhci9sb2cvY2hyb255Cg== | base64 -d > /etc/chrony.conf ; cat /etc/chrony.conf "
Starting pod/worker-<nodename>-debug ...
To use host binaries, run `chroot /host`
server 10.167.60.31 iburst
server 10.162.60.31 iburst
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
logdir /var/log/chrony
Removing debug pod ...
IMPORTANT: Be sure that above command is entered as a single line with no line breaks. The final command in the sequence is only to confirm the contents of the file on the node.
Once the correct file content is in place on the affected node, the MCO upgrade should progress to completion.
Root Cause
Generally, the issue is that the contents of the file on the node have been manually changed, and so is not registered by the Machine Config Operator (MCO
); once an upgrade is in flight, the MCO
will check the image on the node to ensure that its configuration matches that of the rendered MachineConfig
in the definition of that node. Should there be any variance, including extraneous hidden characters or whitespace, the upgrade will fail on the on-disk validation.
Diagnostic Steps
Check for the failing MachineConfigPools
:
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
[...]
worker rendered-worker-03dbc49f09784223c1b11a53db9d92eb False True True 11 5 5 2 67d
[...]
$ oc get mcp worker -o yaml
[...]
message: 'Node node-worker-xfgj is reporting: "failed
to drain node : node-worker-xfgj after 1 hour", Node
node-worker-xfgj is reporting: "unexpected on-disk
state validating against rendered-worker-03dbc49f09784223c1b11a53db9d92eb: content
mismatch for file \"/usr/local/bin/configure-ovs.sh\""'
reason: 2 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
[...]
Check the machine-config-daemon
pod in the failing node (change [node_name]
with the name of the node as shown in the MCP):
$ oc get pods -n openshift-machine-config-operator -o wide | grep [node_name]
[...]
The machine-config-daemon
log will contain the following error, showing a content mismatch on a file (in this instance /etc/chrony.conf
):
$ oc logs -n openshift-machine-config-operator -c machine-config-daemon [machine-config-daemon_pod_name]
[...]
2020-08-10T21:19:47.033347705Z I0810 21:19:47.033307 3841697 rpm-ostree.go:366] Running captured: journalctl --list-boots
2020-08-10T21:19:47.048652655Z I0810 21:19:47.048614 3841697 daemon.go:785] journalctl --list-boots:
2020-08-10T21:19:47.048652655Z 0 79eb94b96a9149098067de11d2a9c147 Sun 2020-08-02 10:35:55 UTC—Mon 2020-08-10 21:19:47 UTC
2020-08-10T21:19:47.048652655Z I0810 21:19:47.048633 3841697 daemon.go:528] Starting MachineConfigDaemon
2020-08-10T21:19:47.048856095Z I0810 21:19:47.048770 3841697 daemon.go:535] Enabling Kubelet Healthz Monitor
2020-08-10T21:19:57.491720171Z I0810 21:19:57.491669 3841697 daemon.go:724] Current+desired config: rendered-worker-03dbc49f09784223c1b11a53db9d92eb
2020-08-10T21:19:57.500706376Z I0810 21:19:57.500650 3841697 daemon.go:958] Validating against current config rendered-worker-03dbc49f09784223c1b11a53db9d92eb
2020-08-10T21:19:57.500948198Z E0810 21:19:57.500910 3841697 daemon.go:1350] content mismatch for file /etc/chrony.conf:
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z A: # Use public servers from the pool.ntp.org project.
2020-08-10T21:19:57.500948198Z # Please consider joining the pool (http://www.pool.ntp.org/join.html).
2020-08-10T21:19:57.500948198Z pool 2.rhel.pool.ntp.org iburst
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Record the rate at which the system clock gains/losses time.
2020-08-10T21:19:57.500948198Z driftfile /var/lib/chrony/drift
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Allow the system clock to be stepped in the first three updates
2020-08-10T21:19:57.500948198Z # if its offset is larger than 1 second.
2020-08-10T21:19:57.500948198Z makestep 1.0 3
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Enable kernel synchronization of the real-time clock (RTC).
2020-08-10T21:19:57.500948198Z rtcsync
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Enable hardware timestamping on all interfaces that support it.
2020-08-10T21:19:57.500948198Z #hwtimestamp *
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Increase the minimum number of selectable sources required to adjust
2020-08-10T21:19:57.500948198Z # the system clock.
2020-08-10T21:19:57.500948198Z #minsources 2
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Allow NTP client access from local network.
2020-08-10T21:19:57.500948198Z #allow 192.168.0.0/16
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Serve time even if not synchronized to a time source.
2020-08-10T21:19:57.500948198Z #local stratum 10
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Specify file containing keys for NTP authentication.
2020-08-10T21:19:57.500948198Z keyfile /etc/chrony.keys
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Get TAI-UTC offset and leap seconds from the system tz database.
2020-08-10T21:19:57.500948198Z leapsectz right/UTC
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Specify directory for log files.
2020-08-10T21:19:57.500948198Z logdir /var/log/chrony
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Select which information is logged.
2020-08-10T21:19:57.500948198Z #log measurements statistics tracking
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z B: server 10.167.60.31 iburst
2020-08-10T21:19:57.500948198Z server 10.162.60.31 iburst
2020-08-10T21:19:57.500948198Z driftfile /var/lib/chrony/drift
2020-08-10T21:19:57.500948198Z makestep 1.0 3
2020-08-10T21:19:57.500948198Z rtcsync
2020-08-10T21:19:57.500948198Z logdir /var/log/chrony
2020-08-10T21:19:57.500948198Z
[...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments