on-disk validation fails on file content mismatch during MCO upgrade in OpenShift 4

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4

Issue

  • MCO (Machine Config Operator) upgrade fails when on-disk validation against current rendered MachineConfig:

    message: 'Node node-worker-xfgj is reporting: "unexpected on-disk state validating
      against rendered-worker-03dbc49f09784223c1b11a53db9d92eb"'
    
    message: 'Node node-worker-xfgj is reporting: "failed
      to drain node : node-worker-xfgj after 1 hour", Node
      node-worker-xfgj is reporting: "unexpected on-disk
      state validating against rendered-worker-03dbc49f09784223c1b11a53db9d92eb: content
      mismatch for file \"/usr/local/bin/configure-ovs.sh\""'
    
  • An error like below is repeating in the Machine Config Daemon's Pod logs:

    daemon.go:1350] content mismatch for file /etc/chrony.conf:
    

Resolution

Important: If the mismatch file was modified by the user, use the following steps to fix the mismatch file. However, if the mismatch file was not modified by the user, consider following this KCS to force update to the desired state.

Check the node and the MachineConfig causing the issue, as explained in the Diagnostic Steps section.

  1. Get the encoding of the source field of the file in the MachineConfig
    Set the FILE_PATH and NODE variables as desired.

    $ FILE_PATH=[/path/and/filename]
    $ NODE=[nodename]
    $ CURRENTCONFIG=$(oc get nodes ${NODE} -o json | \
      jq -r '.metadata.annotations."machineconfiguration.openshift.io/currentConfig"')
    $ oc get mc ${CURRENTCONFIG} -o json | \
      jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == $file_path) | .contents.source' | \
      cut -d',' -f1
    data:text/plain;charset=utf-8;base64
    

    Note:

    • If the source field is in base64 (contains the string data:text/plain;charset=utf-8;base64 at the beginning), the command base64 -d can be used for decoding the content.
    • If the source field is url encoded (contains only data:), it can be decoded with a function like function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }
  2. Recreate the file using one of the following options:

    • Using the currentconfig file already available on the node

      • base64 source
      $ oc debug node/${NODE} -- chroot /host bash -c "jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == \$file_path)  | .contents.source' /etc/machine-config-daemon/currentconfig | cut -d',' -f2- | base64 -d" > ${FILE_PATH}
      
      • url encoded source
      $ oc debug node/${NODE}
      sh-5.1# chroot /host
      sh-5.1# FILE_PATH=[Same/as/set/earlier]
      sh-5.1# function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; } ; urldecode "$(jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == $file_path)  | .contents.source' /etc/machine-config-daemon/currentconfig | cut -d',' -f2-)" > ${FILE_PATH}
      
    • Using the current config available in the MC
      To get the encoded string from the source field from the MC1. :

      $ oc get mc ${CURRENTCONFIG} -o json | jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == $file_path) | .contents.source' | cut -d',' -f2-
      c2VydmVyIDEwLjE2Ny42MC4zMSBpYnVyc3QKc2VydmVyIDEwLjE2Mi42MC4zMSBpYnVyc3QgCmRyaWZ0ZmlsZSAvdmFyL2xpYi9jaHJvbnkvZHJpZnQKbWFrZXN0ZXAgMS4wIDMKcnRjc3luYwpsb2dkaXIgL3Zhci9sb2cvY2hyb255Cg== 
      

      Note: The following python3 commands can also be used to decode it: python3 -c "import sys, urllib.parse; print(urllib.parse.unquote(sys.argv[1]))" [string to decode] for base64, and ` python3 -c "import sys, urllib.parse; print(urllib.parse.unquote(sys.argv[1]))" [string_to_decode] for url encode.

      To decode the source field to a file:

      ### For `base64` 
      $ oc get mc ${CURRENTCONFIG} -o json | jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == $file_path) | .contents.source' | cut -d',' -f2- | base64 -d > [file_name]
      
      ### For url encoding 
      $ function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; } ;  urldecode "$(oc get mc ${CURRENTCONFIG} -o json | jq -r --arg file_path ${FILE_PATH} '.spec.config.storage.files[] | select(.path == $file_path) | .contents.source' | cut -d',' -f2-)" > [file_name] 
      

      Example to correct the /etc/chrony.conf file on-disk from a base64 encoded field:

      # oc debug node/${NODE} -- chroot /host /bin/bash -c " echo -e c2VydmVyIDEwLjE2Ny42MC4zMSBpYnVyc3QKc2VydmVyIDEwLjE2Mi42MC4zMSBpYnVyc3QgCmRyaWZ0ZmlsZSAvdmFyL2xpYi9jaHJvbnkvZHJpZnQKbWFrZXN0ZXAgMS4wIDMKcnRjc3luYwpsb2dkaXIgL3Zhci9sb2cvY2hyb255Cg== | base64 -d > /etc/chrony.conf ; cat /etc/chrony.conf " 
      Starting pod/worker-<nodename>-debug ... 
      To use host binaries, run `chroot /host` 
      server 10.167.60.31 iburst 
      server 10.162.60.31 iburst 
      driftfile /var/lib/chrony/drift 
      makestep 1.0 3 
      rtcsync 
      logdir /var/log/chrony
      
      Removing debug pod ... 
      

      IMPORTANT: Be sure that the above command is entered as a single line with no line breaks. The final command in the sequence is only to confirm the contents of the file on the node.

Once the correct file content is in place on the affected node, the MCO upgrade should progress to completion.

Root Cause

Generally, the issue is that the contents of the file on the node have been manually changed, and so is not registered by the Machine Config Operator (MCO); The MCO will keep checking the image on the node to ensure that its configuration matches that of the rendered MachineConfig in the definition of that node. Should there be any variance, including extraneous hidden characters or whitespace, the upgrade will fail on the on-disk validation.

Diagnostic Steps

Check for the failing MachineConfigPools:

$ oc get mcp
NAME      CONFIG                                              UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
[...]
worker     rendered-worker-03dbc49f09784223c1b11a53db9d92eb     False     True       True       11             5                   5                     2                       67d
[...]

$ oc get mcp worker -o yaml
[...]
    message: 'Node node-worker-xfgj is reporting: "failed
      to drain node : node-worker-xfgj after 1 hour", Node
      node-worker-xfgj is reporting: "unexpected on-disk
      state validating against rendered-worker-03dbc49f09784223c1b11a53db9d92eb: content
      mismatch for file \"/usr/local/bin/configure-ovs.sh\""'
    reason: 2 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
[...]

Check the machine-config-daemon pod in the failing node (change [node_name] with the name of the node as shown in the MCP):

$ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o wide | grep [node_name]
[...]

The machine-config-daemon log will contain the following error, showing a content mismatch on a file (in this instance /etc/chrony.conf):

$ oc logs -n openshift-machine-config-operator -c machine-config-daemon [machine-config-daemon_pod_name]
[...]
2020-08-10T21:19:47.033347705Z I0810 21:19:47.033307 3841697 rpm-ostree.go:366] Running captured: journalctl --list-boots
2020-08-10T21:19:47.048652655Z I0810 21:19:47.048614 3841697 daemon.go:785] journalctl --list-boots:
2020-08-10T21:19:47.048652655Z  0 79eb94b96a9149098067de11d2a9c147 Sun 2020-08-02 10:35:55 UTC—Mon 2020-08-10 21:19:47 UTC
2020-08-10T21:19:47.048652655Z I0810 21:19:47.048633 3841697 daemon.go:528] Starting MachineConfigDaemon
2020-08-10T21:19:47.048856095Z I0810 21:19:47.048770 3841697 daemon.go:535] Enabling Kubelet Healthz Monitor
2020-08-10T21:19:57.491720171Z I0810 21:19:57.491669 3841697 daemon.go:724] Current+desired config: rendered-worker-03dbc49f09784223c1b11a53db9d92eb
2020-08-10T21:19:57.500706376Z I0810 21:19:57.500650 3841697 daemon.go:958] Validating against current config rendered-worker-03dbc49f09784223c1b11a53db9d92eb
2020-08-10T21:19:57.500948198Z E0810 21:19:57.500910 3841697 daemon.go:1350] content mismatch for file /etc/chrony.conf: 
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z A: # Use public servers from the pool.ntp.org project.
2020-08-10T21:19:57.500948198Z # Please consider joining the pool (http://www.pool.ntp.org/join.html).
2020-08-10T21:19:57.500948198Z pool 2.rhel.pool.ntp.org iburst
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Record the rate at which the system clock gains/losses time.
2020-08-10T21:19:57.500948198Z driftfile /var/lib/chrony/drift
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Allow the system clock to be stepped in the first three updates
2020-08-10T21:19:57.500948198Z # if its offset is larger than 1 second.
2020-08-10T21:19:57.500948198Z makestep 1.0 3
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Enable kernel synchronization of the real-time clock (RTC).
2020-08-10T21:19:57.500948198Z rtcsync
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Enable hardware timestamping on all interfaces that support it.
2020-08-10T21:19:57.500948198Z #hwtimestamp *
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Increase the minimum number of selectable sources required to adjust
2020-08-10T21:19:57.500948198Z # the system clock.
2020-08-10T21:19:57.500948198Z #minsources 2
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Allow NTP client access from local network.
2020-08-10T21:19:57.500948198Z #allow 192.168.0.0/16
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Serve time even if not synchronized to a time source.
2020-08-10T21:19:57.500948198Z #local stratum 10
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Specify file containing keys for NTP authentication.
2020-08-10T21:19:57.500948198Z keyfile /etc/chrony.keys
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Get TAI-UTC offset and leap seconds from the system tz database.
2020-08-10T21:19:57.500948198Z leapsectz right/UTC
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Specify directory for log files.
2020-08-10T21:19:57.500948198Z logdir /var/log/chrony
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z # Select which information is logged.
2020-08-10T21:19:57.500948198Z #log measurements statistics tracking
2020-08-10T21:19:57.500948198Z 
2020-08-10T21:19:57.500948198Z B: server 10.167.60.31 iburst
2020-08-10T21:19:57.500948198Z server 10.162.60.31 iburst 
2020-08-10T21:19:57.500948198Z driftfile /var/lib/chrony/drift
2020-08-10T21:19:57.500948198Z makestep 1.0 3
2020-08-10T21:19:57.500948198Z rtcsync
2020-08-10T21:19:57.500948198Z logdir /var/log/chrony
2020-08-10T21:19:57.500948198Z
[...]

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments