Prometheus could not scrape fluentd collector metric in dualstack cluster

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4.12+
  • Red Hat OpenShift Logging (RHOL)
    • 5.8+
  • Fluentd

Issue

  • Prometheus could not scrape fluentd collector metric in dualstack cluster.
  • Alert "CollectorNodeDown" continuously firing for all fluentd collectors in dualstack cluster.

Resolution

  • The Issue was reported to the engineering team as a bug and can be traced under - LOG-5106.
  • Below are the workarounds for this issue:

    • Workaround 1:

    • Workaround 2:

      • Downgrade RHOL to version 5.7 as issue is encountered only in version 5.8.
    • Workaround 3:

      This needs to go through Unmanaged status. Read considerations about the Unmanaged status in the Documentation section "Unsupported configurations".

      • Step 1: Put clusterlogging in Unmanaged:
      $  oc -n openshift-logging patch clusterlogging/instance -p '{"spec":{"managementState": "Unmanaged"}}' --type=merge
      
      • Step 2: Take backup of collector-config configmap:
      $ oc -n openshift-logging get cm collector-config -oyaml > collector-config.bkp
      
      • Step 3: Modify line: bind "# {ENV['PROM_BIND_IP']}" to bind "0.0.0.0" in collector-config configmap:
      $ oc edit cm collector-config -n openshift-logging
      
      • Step 4: Save the configmap and restart the collector pods:
      $ oc delete pods component=collector -n openshift-logging
      
  • For more information, please open a new support case with Red Hat Support.

Diagnostic Steps

  • Check netstat for IPv6 port listening on port 24231:

    $ oc -n openshift-logging get pods -l component=collector -o=custom-columns=:metadata.name --no-headers | xargs -r -n1 -I {} oc -n openshift-logging exec {} -c collector -- bash -c 'echo -n {}:; ss -6 -lt | grep 24231;'
    ``
    xargs: warning: options --max-args and --replace/-I/-i are mutually exclusive, ignoring previous --max-args value
    collector-6pd7j:LISTEN 0      4096            [::]:24231         [::]:*
    collector-9wv7l:LISTEN 0      4096            [::]:24231         [::]:*
    collector-czc5r:LISTEN 0      4096            [::]:24231         [::]:*
    
  • Check netstat for IPv4 port listening on port 24231:

    $ oc -n openshift-logging get pods -l component=collector -o=custom-columns=:metadata.name --no-headers | xargs -r -n1 -I {} oc -n openshift-logging exec {} -c collector -- bash -c 'echo -n {}:; ss -4 -lt | grep 24231;'
    ``
    xargs: warning: options --max-args and --replace/-I/-i are mutually exclusive, ignoring previous --max-args value
    collector-6pd7j:command terminated with exit code 1
    collector-9wv7l:command terminated with exit code 1
    collector-czc5r:command terminated with exit code 1
    
  • The logging is working fine, alert is firing due to prometheus not able to connect to collector:

    $ oc project openshift-monitoring
    $ oc rsh prometheus-k8s-0
    $ sh-4.4$ curl -kv https://x.x.x.x:24231/metrics
    ``
    Trying x.x.x.x...
    TCP_NODELAY set
    connect to x.x.x.x port 24231 failed: Connection refused
    Failed to connect to x.x.x.x port 24231: Connection refused
    Closing connection 0
    curl: (7) Failed to connect to x.x.x.x port 24231: Connection refused 
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments