How to generate a sosreport within nodes without SSH in OCP 4

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Red Hat Enterprise Linux CoreOS (RHCOS)
  • sosreport

Issue

  • What is the recommended way for generating a sosreport in Red Hat OpenShift Container Platform?
  • It may not be possible to connect to OpenShift 4 nodes via SSH from outside the cluster by default but sosreport (or other machine binaries) may need to be run for troubleshooting purposes.

Resolution

By design, OpenShift 4 nodes are immutable and rely on ClusterOperators to apply the changes. In turn, this means that accessing the underlying nodes directly by SSH is not the recommended procedure. Additionally, the nodes will be tainted as accessed.

Note: This solution relies on command oc debug node/<node_name>. Under specific circumstances, this command may fail, for example, if kubelet is not properly running on the target node, in that case, consider other available options within section Other ways to generate a sosreport in OpenShift 4.

Generating a sosreport with oc debug node command

The following example shows how to debug node "node-1":

  • First, display the list of nodes in the cluster:

    $ oc get nodes
    NAME      STATUS   ROLES    AGE    VERSION
    node-1    Ready    master   119d   v1.14.6+8e46c0036
    node-2   Ready    worker   119d   v1.14.6+8e46c0036
    [...]
    
  • Then, create a debug session with oc debug node/<node name> (in this example oc debug node/node-1). The debug session will spawn a pod using the tools image from the release (which doesn't contain sos):

    $ oc debug node/node-1
    Starting pod/node-1-debug ...
    To use host binaries, run `chroot /host`
    Pod IP: 10.0.0.11
    If you don't see a command prompt, try pressing enter.
    sh-4.4# cat /etc/redhat-release 
    Red Hat Enterprise Linux Server release 7.7 (Maipo)
    sh-4.4#
    
  • Once in the debug session, one can use chroot to change the apparent root directory to the one of the underlying host:

    sh-4.4# chroot /host bash
    [root@node /]#  cat /etc/redhat-release 
    Red Hat Enterprise Linux CoreOS release 4.12
    [root@node /]# 
    

    Note: in disconnected environments, it is needed to have the registry.redhat.io/rhel9/support-tools mirrored. If the image is already available for the nodes, create a /root/.toolboxrc file within the node as follows before running toolbox (change the REGISTRY var with the URL of the registry, and the IMAGE name in the custom registry):

    [root@node ~]# vi /root/.toolboxrc
    REGISTRY=[custom-private-registry.example.com:5000]
    IMAGE=rhel9/support-tools
    
  • Apply any Proxy variables to the current session, if applicable:

    $ export HTTP_PROXY=http://<username>:<pswd>@<ip>:<port>
    $ export HTTPS_PROXY=http://<username>:<pswd>@<ip>:<port>
    
  • Now, run toolbox command to run a special container with all necessary binaries:

    [root@node /]# toolbox
    Trying to pull registry.redhat.io/rhel9/support-tools...Getting image source signatures
    Copying blob fd8daf2668d1 done
    Copying blob 1457434f891b done
    Copying blob cb3c77f9bdd8 done
    Copying config 517597590f done
    Writing manifest to image destination
    Storing signatures
    517597590ff4236b0e5e3efce75d88b2b238c19a58903f59a018fc4a40cd6cce
    Spawning a container 'toolbox-' with image 'registry.redhat.io/rhel9/support-tools'
    Detected RUN label in the container image. Using that as the default...
    command: podman run -it --name toolbox- --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=toolbox- -e IMAGE=registry.redhat.io/rhel9/support-tools:latest -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host registry.redhat.io/rhel9/support-tools:latest
    [root@node /]#
    

    Note: If running toolbox yields the message Container 'toolbox-' already exists. Trying to start..., it is strongly recommended to remove the running toolbox container with podman rm 'toolbox-'. This will ensure that a new instance of the toolbox container is spawned which in turn will avoid issues with sosreport plugins.

  • Again, apply any proxy variables to the current session, if applicable, because this is a different shell session:

    [root@node /]# export HTTP_PROXY=http://<username>:<pswd>@<ip>:<port>
    [root@node /]# export HTTPS_PROXY=http://<username>:<pswd>@<ip>:<port>
    
  • Finally, proceed with sos report (remove --all-logs parameter if the generated sosreport is too big):
[root@node /]# sos report -e openshift -k crio.all=on -k crio.logs=on  -k podman.all=on -k podman.logs=on --all-logs

sosreport (version 4.5.1)

This command will collect diagnostic and configuration information from
this Red Hat CoreOS system.

An archive containing the collected information will be generated in
/host/var/tmp/sos.idipawos and may be provided to a Red Hat support
representative.

Any information provided to Red Hat will be treated in accordance with
the published support policies at:

        Distribution Website : https://www.redhat.com/
        Commercial Support   : https://www.access.redhat.com/

The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.

No changes will be made to system configuration.

Press ENTER to continue, or CTRL-C to quit.

Optionally, please enter the case id that you are generating this report for []: 01234567

 Setting up archive ...
 Setting up plugins ...
[...]
 Running plugins. Please wait ...

  Finishing plugins              [Running: networking]                                    mon]
  Finished running plugins                                                               
Creating compressed archive...

Your sosreport has been generated and saved in:
    /host/var/tmp/sosreport-node-1.tar.xz

 Size   26.23MiB
 Owner  root
 sha256 64dc2efa6f25c16f1bae9d596f291d899b875a16e0a945bc973387a3fb84382d

Please send this file to your support representative.

[root@node /]#

Note: if any of the plugins times out, or not all the information is collected, it could be needed to add the paramenter --plugin-timeout=600 to increase the plugin timeout.

What options are available to copy/share the generated sosreport?

Refer to How to provide an sosreport from a RHEL CoreOS OpenShift 4 node.

Other ways to generate a sosreport in OpenShift 4

  • It is possible to log into the node directly via SSH and take a sosreport. Check How to generate a sos report in Red Hat Enterprise Linux CoreOS in OpenShift 4 with SSH access to nodes for the instructions.

  • If accessing to the nodes via SSH from outside the cluster is not possible, launch oc debug node/<node_name> against a different working node, create a file with the same private key used for the installation and SSH into the failing node after that, for example:

    $ oc debug node/node-2
    Starting pod/node-2-debug ...
    To use host binaries, run `chroot /host`
    Pod IP: 10.0.0.12
    If you don't see a command prompt, try pressing enter.
    sh-4.4# vim key
    sh-4.4# chmod 400 key
    sh-4.4# ssh -i key -l core node-3
    Red Hat Enterprise Linux CoreOS 43.81.202003191953.0
    Part of OpenShift 4.3, RHCOS is a Kubernetes native operating system
    managed by the Machine Config Operator (`clusteroperator/machine-config`).
    WARNING: Direct SSH access to machines is not recommended; instead,
    [...]
    [core@node-3 ~]$
    

Finally, if all suggestions fail, it is possible to use a simpler script version of sosreport: Sosreport fails. What data should I provide in its place?.

Root Cause

By design, OpenShift 4 nodes are immutable and rely on ClusterOperators to apply the changes.

Diagnostic Steps

  • How to check if your nodes where externally accessed by ssh:

    $ oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" - "}{.metadata.annotations.machineconfiguration\.openshift\.io/ssh}{"\n"}{end}'
    node-1 - accessed
    node-2 -
    node-3 - accessed
    
  • In order to have sosreport available, it is need to use toolbox container but in some early versions it was failing to download the necessary registry.redhat.io/rhel8/support-tools image (even if manually providing the registry.redhat.io credentials):

    sh-4.4# toolbox
    Trying to pull registry.redhat.io/rhel9/support-tools...Failed
    error pulling image "registry.redhat.io/rhel9/support-tools": unable to pull registry.redhat.io/rhel8/support-tools: unable to pull image: Error determining manifest MIME type for docker://registry.redhat.io/rhel8/support-tools:latest: unable to retrieve auth token: invalid username/password
    Would you like to authenticate to registry: 'registry.redhat.io' and try again? [y/N] y
    Authenticating with existing credentials...
    Existing credentials are invalid, please enter valid username and password
    Username: rhn-<username>
    Password: ******
    Login Succeeded!
    Trying to pull registry.redhat.io/rhel8/support-tools...Failed
    error pulling image "registry.redhat.io/rhel8/support-tools": unable to pull registry.redhat.io/rhel8/support-tools: unable to pull image: Error determining manifest MIME type for docker://registry.redhat.io/rhel8/support-tools:latest: unable to retrieve auth token: invalid username/password
    
  • In order to solve this, it is possible to simply download the image first within the node as follows (only needed once per node):

    sh-4.4# podman login registry.redhat.io
    Authenticating with existing credentials...
    Existing credentials are invalid, please enter valid username and password
    Username: rhn-<username>
    Password: ******
    Login Succeeded!
    sh-4.4# 
    sh-4.4# podman pull registry.redhat.io/rhel8/support-tools
    Trying to pull registry.redhat.io/rhel8/support-tools...Getting image source signatures
    Copying blob e61d8721e62e: 0 B / 67.75 MiB [-----------------------------------]
    Copying blob e61d8721e62e: 8.71 MiB / 67.75 MiB [===>--------------------------]
    Copying blob e61d8721e62e: 67.65 MiB / 67.75 MiB [=============================]
    Copying blob e61d8721e62e: 67.75 MiB / 67.75 MiB [=========================] 20s
    Copying blob c585fd5093c6: 1.47 KiB / 1.47 KiB [===========================] 20s
    Copying blob 77392c39ffcb: 8.67 MiB / 8.67 MiB [===========================] 20s
    Copying config 23a6cff4874d: 4.36 KiB / 4.36 KiB [==========================] 0s
    Writing manifest to image destination
    Storing signatures
    23a6cff4874d03f84c7a787557b693afd58a1fb1f1123d5c9d254f785771c8fa
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments