mdsd pods throwing error could not write forward header

Solution Verified - Updated -

Environment

  • Azure RedHat OpenShift [ARO]
    • 4.x

Issue

  • The mdsd pods in the namespace openshift-azure-logging has multiple restarts and throwing error messages.
$ oc get pods -n openshift-azure-logging

NAME        READY  STATUS   RESTARTS  AGE
mdsd-xxxx  2/2    Running  3         298d
mdsd-xxxx  2/2    Running  4         272d
mdsd-xxxx  1/2    Running  19762     298d
  • Error message:
[2022/08/31 12:23:29] [error] [output:forward:forward.0] could not write forward header
[2022/08/31 12:23:29] [error] [output:forward:forward.0] could not write forward header
[2022/08/31 12:23:29] [error] [output:forward:forward.0] could not write forward header
[2022/08/31 12:23:29] [error] [output:forward:forward.0] could not write forward header
[2022/08/31 12:23:29] [ warn] [engine] chunk '1-xxxxx.xxxxx.flb' cannot be retried: task_id=6, input=tail.1 > output=forward.0
[2022/08/31 12:23:29] [ warn] [engine] failed to flush chunk '1-xxxxx.xxxxx.flb', retry in 9 seconds: task_id=1, input=systemd.0 > output=forward.0 (out_id=0)
[2022/08/31 12:23:29] [ warn] [engine] failed to flush chunk '1-xxxxx.xxxxx.flb', retry in 9 seconds: task_id=4, input=tail.1 > output=forward.0 (out_id=0)
[2022/08/31 12:23:29] [ warn] [engine] failed to flush chunk '1-xxxxx.xxxxx.flb', retry in 9 seconds: task_id=5, input=tail.2 > output=forward.0 (out_id=0)
  • The MDSD pod logs also indicates issues most likely due to old certificates/authorization issues between MDSD and Geneva.
{"Message":"Unauthorized","Code":"Forbidden","StackTrace":"","Details":null}
2022-08-29T02:49:18.7588860Z: MdsRestInterface::QueryGcsAccountInfo() failed
2022-08-29T02:49:18.7589320Z: LoadGcsKey() returned false; next reload(minutes): 1
2022-08-29T02:50:10.2042960Z: Blob write failed due to storage exception: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.. Http status code: 403. Extended info: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
  • PUCM updates on the cluster were failing for the cluster because of an NSG or RP permissions modification:
network.SubnetsClient#CreateOrUpdate: Failure sending request: StatusCode=403 -- Original Error: Code="LinkedAuthorizationFailed" Message="The client 'xxxxxx-xxxx-xxxx-xxxx-xxxxxx' with object id 'xxxxxx-xxxx-xxxx-xxxx-xxxxxx' has permission to perform action 'Microsoft.Network/virtualNetworks/subnets/write' on scope '/subscriptions/xxxxxx-xxxx-xxxx-xxxx-xxxxxx/resourceGroups/xxxxxxxx/providers/Microsoft.Network/virtualNetworks/xxxx/subnets/xxxx'; however, it does not have permission to perform action 'Microsoft.Network/networkSecurityGroups/join/action' on the linked scope(s) '/subscriptionsxxxxxx-xxxx-xxxx-xxxx-xxxxxx/resourceGroups/xxxxxxx/providers/Microsoft.Network/networkSecurityGroups/xxxxxxxxx' or the linked scope(s) are invalid."

Resolution

  • Give the respective client join permissions on the required Network Security Group(NSG).
  • Since end users retain full administration rights over cluster resources and groups, it is impossible to anticipate all possible configurations that could be applied, which may prevent normal cluster maintenance tasks.
  • The support agreement states that end users should avoid placing policies within their subscription or management group that hinder SREs from performing regular maintenance on theAzure Red Hat OpenShift` cluster.
  • In situations like these, it is recommended that there is collaboration between all parties involved to address any misconfigurations that hinder maintenance tasks, ensuring uninterrupted normal cluster operations.

Root Cause

  • The NSG created by the end user, and the SP associated with the cluster, did not possess the necessary Network Contributor permissions over the NSG.
  • During the process of PUCM, the SRE ensures that service endpoints are enabled for storage account access and enables them if necessary.
  • This action implicitly triggers the Subnet :CreateOrUpdate operation, which in turn invokes the Microsoft.Network/networkSecurityGroups/join/action operation.
  • It is important to note that even though this action is idempotent when the NSG is already attached to the subnet, the Network Contributor permissions are still required to execute these actions.

Diagnostic Steps

  • Check the pods in the openshift-azure-logging namespace.
$ oc get pods -n openshift-azure-logging
  • Check the events and pod logs of the mdsd pods.
$ oc get events -n openshift-azure-logging
$ oc logs mdsd-xxxxx -n openshift-azure-logging

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments