on-disk validation fails on file content mismatch during MCO upgrade in OpenShift 4
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
Issue
-
MCO(Machine Config Operator) upgrade fails when on-disk validation against current renderedMachineConfig:message: 'Node node-worker-xfgj is reporting: "unexpected on-disk state validating against rendered-worker-03dbc49f09784223c1b11a53db9d92eb"'message: 'Node node-worker-xfgj is reporting: "failed to drain node : node-worker-xfgj after 1 hour", Node node-worker-xfgj is reporting: "unexpected on-disk state validating against rendered-worker-03dbc49f09784223c1b11a53db9d92eb: content mismatch for file \"/usr/local/bin/configure-ovs.sh\""' -
An error like below is repeating in the
Machine Config Daemon's Pod logs:daemon.go:1350] content mismatch for file /etc/chrony.conf:
Resolution
Important: If the mismatch file was modified by the user, use the following steps to fix the mismatch file. However, if the mismatch file was not modified by the user, consider following this KCS to force update to the desired state.
Check the node and the MachineConfig causing the issue, as explained in the Diagnostic Steps section.
-
Get the encoding of the
sourcefield of the file in theMachineConfig
Set theFILE_PATHandNODEvariables as desired.$ FILE_PATH=[/path/and/filename] $ NODE=[nodename] $ CURRENTCONFIG=$(oc get nodes ${NODE} -o json | \ jq -r '.metadata.annotations."machineconfiguration.openshift.io/currentConfig"') $ oc get mc ${CURRENTCONFIG} -o json | \ jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == $file_path) | .contents.source' | \ cut -d',' -f1 data:text/plain;charset=utf-8;base64Note:
- If the
sourcefield is inbase64(contains the stringdata:text/plain;charset=utf-8;base64at the beginning), the commandbase64 -dcan be used for decoding the content. - If the
sourcefield isurl encoded(contains onlydata:), it can be decoded with a function likefunction urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }
- If the
-
Recreate the file using one of the following options:
-
Using the currentconfig file already available on the node
- base64 source
$ oc debug node/${NODE} -- chroot /host bash -c "jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == \$file_path) | .contents.source' /etc/machine-config-daemon/currentconfig | cut -d',' -f2- | base64 -d" > ${FILE_PATH}- url encoded source
$ oc debug node/${NODE} sh-5.1# chroot /host sh-5.1# FILE_PATH=[Same/as/set/earlier] sh-5.1# function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; } ; urldecode "$(jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == $file_path) | .contents.source' /etc/machine-config-daemon/currentconfig | cut -d',' -f2-)" > ${FILE_PATH} -
Using the current config available in the MC
To get the encoded string from thesourcefield from the MC1. :$ oc get mc ${CURRENTCONFIG} -o json | jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == $file_path) | .contents.source' | cut -d',' -f2- c2VydmVyIDEwLjE2Ny42MC4zMSBpYnVyc3QKc2VydmVyIDEwLjE2Mi42MC4zMSBpYnVyc3QgCmRyaWZ0ZmlsZSAvdmFyL2xpYi9jaHJvbnkvZHJpZnQKbWFrZXN0ZXAgMS4wIDMKcnRjc3luYwpsb2dkaXIgL3Zhci9sb2cvY2hyb255Cg==Note: The following
python3commands can also be used to decode it:python3 -c "import sys, urllib.parse; print(urllib.parse.unquote(sys.argv[1]))" [string to decode]forbase64, and ` python3 -c "import sys, urllib.parse; print(urllib.parse.unquote(sys.argv[1]))" [string_to_decode] for url encode.To decode the
sourcefield to a file:### For `base64` $ oc get mc ${CURRENTCONFIG} -o json | jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == $file_path) | .contents.source' | cut -d',' -f2- | base64 -d > [file_name] ### For url encoding $ function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; } ; urldecode "$(oc get mc ${CURRENTCONFIG} -o json | jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == $file_path) | .contents.source' | cut -d',' -f2-)" > [file_name]Example to correct the
/etc/chrony.conffile on-disk from abase64encoded field:# oc debug node/${NODE} -- chroot /host /bin/bash -c " echo -e c2VydmVyIDEwLjE2Ny42MC4zMSBpYnVyc3QKc2VydmVyIDEwLjE2Mi42MC4zMSBpYnVyc3QgCmRyaWZ0ZmlsZSAvdmFyL2xpYi9jaHJvbnkvZHJpZnQKbWFrZXN0ZXAgMS4wIDMKcnRjc3luYwpsb2dkaXIgL3Zhci9sb2cvY2hyb255Cg== | base64 -d > /etc/chrony.conf ; cat /etc/chrony.conf " Starting pod/worker-<nodename>-debug ... To use host binaries, run `chroot /host` server 10.167.60.31 iburst server 10.162.60.31 iburst driftfile /var/lib/chrony/drift makestep 1.0 3 rtcsync logdir /var/log/chrony Removing debug pod ...IMPORTANT: Be sure that the above command is entered as a single line with no line breaks. The final command in the sequence is only to confirm the contents of the file on the node.
-
Once the correct file content is in place on the affected node, the MCO upgrade should progress to completion.
Root Cause
Generally, the issue is that the contents of the file on the node have been manually changed, and so is not registered by the Machine Config Operator (MCO); The MCO will keep checking the image on the node to ensure that its configuration matches that of the rendered MachineConfig in the definition of that node. Should there be any variance, including extraneous hidden characters or whitespace, the upgrade will fail on the on-disk validation.
Diagnostic Steps
Check for the failing MachineConfigPools:
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
[...]
worker rendered-worker-03dbc49f09784223c1b11a53db9d92eb False True True 11 5 5 2 67d
[...]
$ oc get mcp worker -o yaml
[...]
message: 'Node node-worker-xfgj is reporting: "failed
to drain node : node-worker-xfgj after 1 hour", Node
node-worker-xfgj is reporting: "unexpected on-disk
state validating against rendered-worker-03dbc49f09784223c1b11a53db9d92eb: content
mismatch for file \"/usr/local/bin/configure-ovs.sh\""'
reason: 2 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
[...]
Check the machine-config-daemon pod in the failing node (change [node_name] with the name of the node as shown in the MCP):
$ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o wide | grep [node_name]
[...]
The machine-config-daemon log will contain the following error, showing a content mismatch on a file (in this instance /etc/chrony.conf):
$ oc logs -n openshift-machine-config-operator -c machine-config-daemon [machine-config-daemon_pod_name]
[...]
2020-08-10T21:19:47.033347705Z I0810 21:19:47.033307 3841697 rpm-ostree.go:366] Running captured: journalctl --list-boots
2020-08-10T21:19:47.048652655Z I0810 21:19:47.048614 3841697 daemon.go:785] journalctl --list-boots:
2020-08-10T21:19:47.048652655Z 0 79eb94b96a9149098067de11d2a9c147 Sun 2020-08-02 10:35:55 UTC—Mon 2020-08-10 21:19:47 UTC
2020-08-10T21:19:47.048652655Z I0810 21:19:47.048633 3841697 daemon.go:528] Starting MachineConfigDaemon
2020-08-10T21:19:47.048856095Z I0810 21:19:47.048770 3841697 daemon.go:535] Enabling Kubelet Healthz Monitor
2020-08-10T21:19:57.491720171Z I0810 21:19:57.491669 3841697 daemon.go:724] Current+desired config: rendered-worker-03dbc49f09784223c1b11a53db9d92eb
2020-08-10T21:19:57.500706376Z I0810 21:19:57.500650 3841697 daemon.go:958] Validating against current config rendered-worker-03dbc49f09784223c1b11a53db9d92eb
2020-08-10T21:19:57.500948198Z E0810 21:19:57.500910 3841697 daemon.go:1350] content mismatch for file /etc/chrony.conf:
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z A: # Use public servers from the pool.ntp.org project.
2020-08-10T21:19:57.500948198Z # Please consider joining the pool (http://www.pool.ntp.org/join.html).
2020-08-10T21:19:57.500948198Z pool 2.rhel.pool.ntp.org iburst
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Record the rate at which the system clock gains/losses time.
2020-08-10T21:19:57.500948198Z driftfile /var/lib/chrony/drift
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Allow the system clock to be stepped in the first three updates
2020-08-10T21:19:57.500948198Z # if its offset is larger than 1 second.
2020-08-10T21:19:57.500948198Z makestep 1.0 3
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Enable kernel synchronization of the real-time clock (RTC).
2020-08-10T21:19:57.500948198Z rtcsync
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Enable hardware timestamping on all interfaces that support it.
2020-08-10T21:19:57.500948198Z #hwtimestamp *
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Increase the minimum number of selectable sources required to adjust
2020-08-10T21:19:57.500948198Z # the system clock.
2020-08-10T21:19:57.500948198Z #minsources 2
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Allow NTP client access from local network.
2020-08-10T21:19:57.500948198Z #allow 192.168.0.0/16
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Serve time even if not synchronized to a time source.
2020-08-10T21:19:57.500948198Z #local stratum 10
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Specify file containing keys for NTP authentication.
2020-08-10T21:19:57.500948198Z keyfile /etc/chrony.keys
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Get TAI-UTC offset and leap seconds from the system tz database.
2020-08-10T21:19:57.500948198Z leapsectz right/UTC
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Specify directory for log files.
2020-08-10T21:19:57.500948198Z logdir /var/log/chrony
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z # Select which information is logged.
2020-08-10T21:19:57.500948198Z #log measurements statistics tracking
2020-08-10T21:19:57.500948198Z
2020-08-10T21:19:57.500948198Z B: server 10.167.60.31 iburst
2020-08-10T21:19:57.500948198Z server 10.162.60.31 iburst
2020-08-10T21:19:57.500948198Z driftfile /var/lib/chrony/drift
2020-08-10T21:19:57.500948198Z makestep 1.0 3
2020-08-10T21:19:57.500948198Z rtcsync
2020-08-10T21:19:57.500948198Z logdir /var/log/chrony
2020-08-10T21:19:57.500948198Z
[...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments