on-disk validation fails on file content mismatch during MCO upgrade in OpenShift 4

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4

Issue

  • MCO (Machine Config Operator) upgrade fails when on-disk validation against current rendered MachineConfig:

    message: 'Node node-worker-xfgj is reporting: "unexpected on-disk state validating
      against rendered-worker-03dbc49f09784223c1b11a53db9d92eb"'
    
    message: 'Node node-worker-xfgj is reporting: "failed
      to drain node : node-worker-xfgj after 1 hour", Node
      node-worker-xfgj is reporting: "unexpected on-disk
      state validating against rendered-worker-03dbc49f09784223c1b11a53db9d92eb: content
      mismatch for file \"/usr/local/bin/configure-ovs.sh\""'
    
  • An error like below is repeating in the Machine Config Daemon's Pod logs:

    daemon.go:1350] content mismatch for file /etc/chrony.conf:
    

Resolution

Check the node and the MachineConfig causing the issue as explained in the Diagnostic Steps section.

Get the encoding of the source field of the file in the MachineConfig (change the [node_name] and [rendered-worker_name] accordingly, and the [escaped_path] needs to be escaped like for example \/etc\/chrony.conf):

$ oc get node [node_name] -o yaml | grep desiredConfig
    machineconfiguration.openshift.io/desiredConfig: [rendered-worker_name]

$ oc get mc [rendered-worker_name] -o yaml | grep -B5 "path: [escaped_path]" | grep source | tail -n 1 | cut -d"," -f1
          source: data:text/plain;charset=utf-8;base64

To get the encoded string from the source field:

$ oc get mc [rendered-worker_name] -o yaml | grep -B5 "path: [escaped_path]" | grep source | tail -n 1 | cut -d"," -f2

c2VydmVyIDEwLjE2Ny42MC4zMSBpYnVyc3QKc2VydmVyIDEwLjE2Mi42MC4zMSBpYnVyc3QgCmRyaWZ0ZmlsZSAvdmFyL2xpYi9jaHJvbnkvZHJpZnQKbWFrZXN0ZXAgMS4wIDMKcnRjc3luYwpsb2dkaXIgL3Zhci9sb2cvY2hyb255Cg==

Note: If the source field is in base64 (contains the string data:text/plain;charset=utf-8;base64 at the beginning), the command base64 -d can be used for decoding the content. If the source field is url encode (contains only data:), it can be decoded with a function like function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }.
Note: The following python3 commands can be also used to decode it: python3 -c "import sys, urllib.parse; print(urllib.parse.unquote(sys.argv[1]))" [string to decode] for base64, and ` python3 -c "import sys, urllib.parse; print(urllib.parse.unquote(sys.argv[1]))" [string_to_decode] for url encode.

To decode the source field to a file:

### For `base64`
$ oc get mc [rendered-worker_name] -o yaml | grep -B5 "path: [escaped_path]" | grep source | tail -n 1 | cut -d"," -f2 | base64 -d > [file_name]

### For url encode
$ function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; } ;  urldecode "$(oc get mc [rendered-worker_name] -o yaml | grep -B5 "path: [escaped_path]" | grep source | tail -n 1 | cut -d"," -f2)" > [file_name]

Example to correct the /etc/chrony.conf file on-disk from a base64 encoded field:

# oc debug node/[node_name] -- chroot /host /bin/bash -c " echo -e c2VydmVyIDEwLjE2Ny42MC4zMSBpYnVyc3QKc2VydmVyIDEwLjE2Mi42MC4zMSBpYnVyc3QgCmRyaWZ0ZmlsZSAvdmFyL2xpYi9jaHJvbnkvZHJpZnQKbWFrZXN0ZXAgMS4wIDMKcnRjc3luYwpsb2dkaXIgL3Zhci9sb2cvY2hyb255Cg== | base64 -d > /etc/chrony.conf ; cat /etc/chrony.conf "
Starting pod/worker-<nodename>-debug ...
To use host binaries, run `chroot /host`
server 10.167.60.31 iburst
server 10.162.60.31 iburst 
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
logdir /var/log/chrony

Removing debug pod ...

IMPORTANT: Be sure that above command is entered as a single line with no line breaks. The final command in the sequence is only to confirm the contents of the file on the node.

Once the correct file content is in place on the affected node, the MCO upgrade should progress to completion.

Root Cause

Generally, the issue is that the contents of the file on the node have been manually changed, and so is not registered by the Machine Config Operator (MCO); once an upgrade is in flight, the MCO will check the image on the node to ensure that its configuration matches that of the rendered MachineConfig in the definition of that node. Should there be any variance, including extraneous hidden characters or whitespace, the upgrade will fail on the on-disk validation.

Diagnostic Steps

Check for the failing MachineConfigPools:

$ oc get mcp
NAME      CONFIG                                              UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
[...]
worker     rendered-worker-03dbc49f09784223c1b11a53db9d92eb     False     True       True       11             5                   5                     2                       67d
[...]

$ oc get mcp worker -o yaml
[...]
    message: 'Node node-worker-xfgj is reporting: "failed
      to drain node : node-worker-xfgj after 1 hour", Node
      node-worker-xfgj is reporting: "unexpected on-disk
      state validating against rendered-worker-03dbc49f09784223c1b11a53db9d92eb: content
      mismatch for file \"/usr/local/bin/configure-ovs.sh\""'
    reason: 2 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
[...]

Check the machine-config-daemon pod in the failing node (change [node_name] with the name of the node as shown in the MCP):

$ oc get pods -n openshift-machine-config-operator -o wide | grep [node_name]
[...]

The machine-config-daemon log will contain the following error, showing a content mismatch on a file (in this instance /etc/chrony.conf):

$ oc logs -n openshift-machine-config-operator -c machine-config-daemon [machine-config-daemon_pod_name]
[...]
2020-08-10T21:19:47.033347705Z I0810 21:19:47.033307 3841697 rpm-ostree.go:366] Running captured: journalctl --list-boots
2020-08-10T21:19:47.048652655Z I0810 21:19:47.048614 3841697 daemon.go:785] journalctl --list-boots:
2020-08-10T21:19:47.048652655Z  0 79eb94b96a9149098067de11d2a9c147 Sun 2020-08-02 10:35:55 UTC—Mon 2020-08-10 21:19:47 UTC
2020-08-10T21:19:47.048652655Z I0810 21:19:47.048633 3841697 daemon.go:528] Starting MachineConfigDaemon
2020-08-10T21:19:47.048856095Z I0810 21:19:47.048770 3841697 daemon.go:535] Enabling Kubelet Healthz Monitor
2020-08-10T21:19:57.491720171Z I0810 21:19:57.491669 3841697 daemon.go:724] Current+desired config: rendered-worker-03dbc49f09784223c1b11a53db9d92eb
2020-08-10T21:19:57.500706376Z I0810 21:19:57.500650 3841697 daemon.go:958] Validating against current config rendered-worker-03dbc49f09784223c1b11a53db9d92eb
2020-08-10T21:19:57.500948198Z E0810 21:19:57.500910 3841697 daemon.go:1350] content mismatch for file /etc/chrony.conf: 
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z A: # Use public servers from the pool.ntp.org project.
2020-08-10T21:19:57.500948198Z # Please consider joining the pool (http://www.pool.ntp.org/join.html).
2020-08-10T21:19:57.500948198Z pool 2.rhel.pool.ntp.org iburst
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Record the rate at which the system clock gains/losses time.
2020-08-10T21:19:57.500948198Z driftfile /var/lib/chrony/drift
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Allow the system clock to be stepped in the first three updates
2020-08-10T21:19:57.500948198Z # if its offset is larger than 1 second.
2020-08-10T21:19:57.500948198Z makestep 1.0 3
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Enable kernel synchronization of the real-time clock (RTC).
2020-08-10T21:19:57.500948198Z rtcsync
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Enable hardware timestamping on all interfaces that support it.
2020-08-10T21:19:57.500948198Z #hwtimestamp *
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Increase the minimum number of selectable sources required to adjust
2020-08-10T21:19:57.500948198Z # the system clock.
2020-08-10T21:19:57.500948198Z #minsources 2
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Allow NTP client access from local network.
2020-08-10T21:19:57.500948198Z #allow 192.168.0.0/16
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Serve time even if not synchronized to a time source.
2020-08-10T21:19:57.500948198Z #local stratum 10
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Specify file containing keys for NTP authentication.
2020-08-10T21:19:57.500948198Z keyfile /etc/chrony.keys
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Get TAI-UTC offset and leap seconds from the system tz database.
2020-08-10T21:19:57.500948198Z leapsectz right/UTC
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Specify directory for log files.
2020-08-10T21:19:57.500948198Z logdir /var/log/chrony
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Select which information is logged.
2020-08-10T21:19:57.500948198Z #log measurements statistics tracking
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z B: server 10.167.60.31 iburst
2020-08-10T21:19:57.500948198Z server 10.162.60.31 iburst 
2020-08-10T21:19:57.500948198Z driftfile /var/lib/chrony/drift
2020-08-10T21:19:57.500948198Z makestep 1.0 3
2020-08-10T21:19:57.500948198Z rtcsync
2020-08-10T21:19:57.500948198Z logdir /var/log/chrony
2020-08-10T21:19:57.500948198Z
[...]

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments