Chapter 4. Operational Management

With the successful deployment of OpenShift, the following section demonstrates how to confirm proper functionality of the Red Hat OpenShift Container Platform installation.

Note

The following subsections are from OpenShift Documentation - Diagnostics Tool site. For the latest version of this section, reference the link directly.

4.1. oc adm diagnostics

The oc adm diagnostics command runs a series of checks for error conditions in the host or cluster. Specifically, it:

  • Verifies that the default registry and router are running and correctly configured.
  • Checks ClusterRoleBindings and ClusterRoles for consistency with base policy.
  • Checks that all of the client configuration contexts are valid and can be connected to.
  • Checks that SkyDNS is working properly and the pods have SDN connectivity.
  • Validates master and node configuration on the host.
  • Checks that nodes are running and available.
  • Analyzes host logs for known errors.
  • Checks that systemd units are configured as expected for the host.

4.2. Using the Diagnostics Tool

Red Hat OpenShift Container Platform may be deployed in numerous scenarios including:

  • built from source
  • included within a VM image
  • as a container image
  • via enterprise RPMs

Each method implies a different configuration and environment. The diagnostics were included within openshift binary to minimize environment assumptions and provide the ability to run the diagnostics tool within an Red Hat OpenShift Container Platform server or client.

To use the diagnostics tool, preferably on a master host and as cluster administrator, run a sudo user:

$ sudo oc adm diagnostics

The above command runs all available diagnostis skipping any that do not apply to the environment.

The diagnostics tool has the ability to run one or multiple specific diagnostics via name or as an enabler to address issues within the Red Hat OpenShift Container Platform environment. For example:

$ sudo oc adm diagnostics <name1> <name2>

The options provided by the diagnostics tool require working configuration files. For example, the NodeConfigCheck does not run unless a node configuration is readily available.

Diagnostics verifies that the configuration files reside in their standard locations unless specified with flags (respectively, --config, --master-config, and --node-config)

The standard locations are listed below:

  • Client:

    • As indicated by the $KUBECONFIG environment variable variable
    • ~/.kube/config file
  • Master:

    • /etc/origin/master/master-config.yaml
  • Node:

    • /etc/origin/node/node-config.yaml

If a configuration file is not found or specified, related diagnostics are skipped.

Available diagnostics include:

Diagnostic NamePurpose

AggregatedLogging

Check the aggregated logging integration for proper configuration and operation.

AnalyzeLogs

Check systemd service logs for problems. Does not require a configuration file to check against.

ClusterRegistry

Check that the cluster has a working Docker registry for builds and image streams.

ClusterRoleBindings

Check that the default cluster role bindings are present and contain the expected subjects according to base policy.

ClusterRoles

Check that cluster roles are present and contain the expected permissions according to base policy.

ClusterRouter

Check for a working default router in the cluster.

ConfigContexts

Check that each context in the client configuration is complete and has connectivity to its API server.

DiagnosticPod

Creates a pod that runs diagnostics from an application standpoint, which checks that DNS within the pod is working as expected and the credentials for the default service account authenticate correctly to the master API.

EtcdWriteVolume

Check the volume of writes against etcd for a time period and classify them by operation and key. This diagnostic only runs if specifically requested, because it does not run as quickly as other diagnostics and can increase load on etcd.

MasterConfigCheck

Check this particular hosts master configuration file for problems.

MasterNode

Check that the master node running on this host is running a node to verify that it is a member of the cluster SDN.

MetricsApiProxy

Check that the integrated Heapster metrics can be reached via the cluster API proxy.

NetworkCheck

Create diagnostic pods on multiple nodes to diagnose common network issues from an application standpoint. For example, this checks that pods can connect to services, other pods, and the external network.

If there are any errors, this diagnostic stores results and retrieved files in a local directory (/tmp/openshift/, by default) for further analysis. The directory can be specified with the --network-logdir flag.

NodeConfigCheck

Checks this particular hosts node configuration file for problems.

NodeDefinitions

Check that the nodes defined in the master API are ready and can schedule pods.

RouteCertificateValidation

Check all route certificates for those that might be rejected by extended validation.

ServiceExternalIPs

Check for existing services that specify external IPs, which are disallowed according to master configuration.

UnitStatus

Check systemd status for units on this host related to Red Hat OpenShift Container Platform. Does not require a configuration file to check against.

4.3. Running Diagnostics in a Server Environment

An Ansible-deployed cluster provides additional diagnostic benefits for nodes within Red Hat OpenShift Container Platform cluster due to:

  • Standard location for both master and node configuration files
  • Systemd units are created and configured for managing the nodes in a cluster
  • All components log to journald.

Standard location of the configuration files placed by an Ansible-deployed cluster ensures that running sudo oc adm diagnostics works without any flags. In the event, the standard location of the configuration files is not used, options flags as those listed in the example below may be used.

$ sudo oc adm diagnostics --master-config=<file_path> --node-config=<file_path>

For proper usage of the log diagnostic, systemd units and log entries within journald are required. If log entries are not using the above method, log diagnostics won’t work as expected and are intentionally skipped.

4.4. Running Diagnostics in a Client Environment

The diagnostics runs using as much access as the existing user running the diagnostic has available. The diagnostic may run as an ordinary user, a cluster-admin user or cluster-admin user.

A client with ordinary access should be able to diagnose its connection to the master and run a diagnostic pod. If multiple users or masters are configured, connections are tested for all, but the diagnostic pod only runs against the current user, server, or project.

A client with cluster-admin access available (for any user, but only the current master) should be able to diagnose the status of the infrastructure such as nodes, registry, and router. In each case, running sudo oc adm diagnostics searches for the standard client configuration file location and uses it if available.

4.5. Ansible-based Health Checks

Additional diagnostic health checks are available through the Ansible-based tooling used to install and manage Red Hat OpenShift Container Platform clusters. The reports provide common deployment problems for the current Red Hat OpenShift Container Platform installation.

These checks can be run either using the ansible-playbook command (the same method used during Advanced Installation) or as a containerized version of openshift-ansible. For the ansible-playbook method, the checks are provided by the atomic-openshift-utils RPM package.

For the containerized method, the openshift3/ose-ansible container image is distributed via the Red Hat Container Registry.

Example usage for each method are provided in subsequent sections.

The following health checks are a set of diagnostic tasks that are meant to be run against the Ansible inventory file for a deployed Red Hat OpenShift Container Platform cluster using the provided health.yml playbook.

Warning

Due to potential changes the health check playbooks could make to the environment, the playbooks should only be run against clusters that have been deployed using Ansible with the same inventory file used during deployment. The changes consist of installing dependencies in order to gather required information. In some circumstances, additional system components (i.e. docker or networking configurations) may be altered if their current state differs from the configuration in the inventory file. These health checks should only be run if the administrator does not expect the inventory file to make any changes to the existing cluster configuration.

Table 4.1. Diagnostic Health Checks

Check NamePurpose

etcd_imagedata_size

This check measures the total size of Red Hat OpenShift Container Platform image data in an etcd cluster. The check fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check fails if the size of image data amounts to 50% or more of the currently used space in the etcd cluster.

A failure from this check indicates that a significant amount of space in etcd is being taken up by Red Hat OpenShift Container Platform image data, which can eventually result in etcd cluster crashing.

A user-defined limit may be set by passing the etcd_max_image_data_size_bytes variable. For example, setting etcd_max_image_data_size_bytes=40000000000 causes the check to fail if the total size of image data stored in etcd exceeds 40 GB.

etcd_traffic

This check detects higher-than-normal traffic on an etcd host. The check fails if a journalctl log entry with an etcd sync duration warning is found.

For further information on improving etcd performance, see Recommended Practices for Red Hat OpenShift Container Platform etcd Hosts and the Red Hat Knowledgebase.

etcd_volume

This check ensures that the volume usage for an etcd cluster is below a maximum user-specified threshold. If no maximum threshold value is specified, it is defaulted to 90% of the total volume size.

A user-defined limit may be set by passing the etcd_device_usage_threshold_percent variable.

docker_storage

Only runs on hosts that depend on the docker daemon (nodes and containerized installations). Checks that docker's total usage does not exceed a user-defined limit. If no user-defined limit is set, docker's maximum usage threshold defaults to 90% of the total size available.

The threshold limit for total percent usage can be set with a variable in the inventory file, for example max_thinpool_data_usage_percent=90.

This also checks that docker's storage is using a supported configuration.

curator, elasticsearch, fluentd, kibana

This set of checks verifies that Curator, Kibana, Elasticsearch, and Fluentd pods have been deployed and are in a running state, and that a connection can be established between the control host and the exposed Kibana URL. These checks run only if the openshift_logging_install_logging inventory variable is set to true. Ensure that they are executed in a deployment where cluster logging has been enabled.

logging_index_time

This check detects higher than normal time delays between log creation and log aggregation by Elasticsearch in a logging stack deployment. It fails if a new log entry cannot be queried through Elasticsearch within a timeout (by default, 30 seconds). The check only runs if logging is enabled.

A user-defined timeout may be set by passing the openshift_check_logging_index_timeout_seconds variable. For example, setting openshift_check_logging_index_timeout_seconds=45 causes the check to fail if a newly-created log entry is not able to be queried via Elasticsearch after 45 seconds.

Note

A similar set of checks meant to run as part of the installation process can be found in Configuring Cluster Pre-install Checks. Another set of checks for checking certificate expiration can be found in Redeploying Certificates.

4.5.1. Running Health Checks via ansible-playbook

The openshift-ansible health checks are executed using the ansible-playbook command and requires specifying the cluster’s inventory file and the health.yml playbook:

# ansible-playbook -i <inventory_file> \
    /usr/share/ansible/openshift-ansible/playbooks/openshift-checks/health.yml

In order to set variables in the command line, include the -e flag with any desired variables in key=value format. For example:

# ansible-playbook -i <inventory_file> \
    /usr/share/ansible/openshift-ansible/playbooks/openshift-checks/health.yml
    -e openshift_check_logging_index_timeout_seconds=45
    -e etcd_max_image_data_size_bytes=40000000000

To disable specific checks, include the variable openshift_disable_check with a comma-delimited list of check names in the inventory file prior to running the playbook. For example:

openshift_disable_check=etcd_traffic,etcd_volume

Alternatively, set any checks to disable as variables with -e openshift_disable_check=<check1>,<check2> when running the ansible-playbook command.

4.6. Running Health Checks via Docker CLI

The openshift-ansible playbooks may run in a Docker container avoiding the requirement for installing and configuring Ansible, on any host that can run the ose-ansible image via the Docker CLI.

This is accomplished by specifying the cluster’s inventory file and the health.yml playbook when running the following docker run command as a non-root user that has privileges to run containers:

# docker run -u `id -u` \ 1
    -v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \ 2
    -v /etc/ansible/hosts:/tmp/inventory:ro \ 3
    -e INVENTORY_FILE=/tmp/inventory \
    -e PLAYBOOK_FILE=playbooks/openshift-checks/health.yml \ 4
    -e OPTS="-v -e openshift_check_logging_index_timeout_seconds=45 -e etcd_max_image_data_size_bytes=40000000000" \ 5
    openshift3/ose-ansible
1
These options make the container run with the same UID as the current user, which is required for permissions so that the SSH key can be read inside the container (SSH private keys are expected to be readable only by their owner).
2
Mount SSH keys as a volume under /opt/app-root/src/.ssh under normal usage when running the container as a non-root user.
3
Change /etc/ansible/hosts to the location of the cluster’s inventory file, if different. This file is bind-mounted to /tmp/inventory, which is used according to the INVENTORY_FILE environment variable in the container.
4
The PLAYBOOK_FILE environment variable is set to the location of the health.yml playbook relative to /usr/share/ansible/openshift-ansible inside the container.
5
Set any variables desired for a single run with the -e key=value format.

In the above command, the SSH key is mounted with the :Z flag so that the container can read the SSH key from its restricted SELinux context. This ensures the original SSH key file is relabeled similarly to system_u:object_r:container_file_t:s0:c113,c247. For more details about :Z, see the docker-run(1) man page.

It is important to note these volume mount specifications because it could have unexpected consequences. For example, if one mounts (and therefore relabels) the $HOME/.ssh directory, sshd becomes unable to access the public keys to allow remote login. To avoid altering the original file labels, mounting a copy of the SSH key (or directory) is recommended.

It is plausible to want to mount an entire .ssh directory for various reasons. For example, this enables the ability to use an SSH configuration to match keys with hosts or modify other connection parameters. It could also allow a user to provide a known_hosts file and have SSH validate host keys, which is disabled by the default configuration and can be re-enabled with an environment variable by adding -e ANSIBLE_HOST_KEY_CHECKING=True to the docker command line.