Menu Close

Chapter 4. Operational Management

With the successful deployment of OpenShift Container Platform, the following section demonstrates how to confirm proper functionality of the Red Hat OpenShift Container Platform.

4.1. SSH configuration

Optionally, to be able to connect easily to the VMs, the following SSH configuration file can be applied to the workstation that will perform the SSH commands:

$ cat /home/<user>/.ssh/config

Host bastion
     HostName                 <resourcegroup>b.<region>.cloudapp.azure.com
     User                     <user>
     StrictHostKeyChecking    no
     ProxyCommand             none
     CheckHostIP              no
     ForwardAgent             yes
     IdentityFile             /home/<user>/.ssh/id_rsa

Host master? infranode? node??
     ProxyCommand             ssh <user>@bastion -W %h:%p
     user                     <user>
     IdentityFile             /home/<user>/.ssh/id_rsa

To connect to any VM it is only needed the hostname as:

$ ssh infranode3

4.2. Gathering hostnames

With all of the steps that occur during the installation of OpenShift Container Platform, it is possible to lose track of the names of the instances in the recently deployed environment. One option to get these hostnames is to browse to the Azure Resource Group dashboard and select Overview. The filter shows all instances relating to the reference architecture deployment.

To help facilitate the Chapter 4, Operational Management chapter the following hostnames will be used.

  • master1
  • master2
  • master3
  • infranode1
  • infranode2
  • infranode3
  • node01
  • node02
  • node03

4.3. Running Diagnostics

To run diagnostics, SSH into the first master node (master1), via the bastion host using the admin user specified in the template:

$ ssh <user>@<resourcegroup>b.<region>.cloudapp.azure.com
$ ssh <user>@master1
$ sudo -i

Connectivity to the first master node (master1.<region>.cloudapp.azure.com) as the root user should have been established. Run the diagnostics that are included as part of the OpenShift Container Platform installation:

# oadm diagnostics
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/root/.kube/config'
Info:  Using context for cluster-admin access: 'default/sysdeseng-westus-cloudapp-azure-com:8443/system:admin'
[Note] Performing systemd discovery

[Note] Running diagnostic: ConfigContexts[default/sysdeseng-westus-cloudapp-azure-com:8443/system:admin]
       Description: Validate client config context is complete and has connectivity

Info:  The current client config context is 'default/sysdeseng-westus-cloudapp-azure-com:8443/system:admin':
       The server URL is 'https://sysdeseng.westus.cloudapp.azure.com:8443'
       The user authentication is 'system:admin/sysdeseng-westus-cloudapp-azure-com:8443'
       The current project is 'default'
       Successfully requested project list; has access to project(s):
         [default gsw kube-system logging management-infra openshift openshift-infra]

[Note] Running diagnostic: DiagnosticPod
       Description: Create a pod to run diagnostics from the application standpoint

       [Note] Running diagnostic: PodCheckDns
              Description: Check that DNS within a pod works as expected

       [Note] Summary of diagnostics execution (version v3.6.5.5):
       [Note] Warnings seen: 0
       [Note] Errors seen: 0

[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint

       [Note] Running diagnostic: CheckExternalNetwork
              Description: Check that external network is accessible within a pod

       [Note] Running diagnostic: CheckNodeNetwork
              Description: Check that pods in the cluster can access its own node.

       [Note] Running diagnostic: CheckPodNetwork
              Description: Check pod to pod communication in the cluster. In case of ovs-subnet network plugin, all pods should be able to communicate with each other and in case of multitenant network plugin, pods in non-global projects should be isolated and pods in global projects should be able to access any pod in the cluster and vice versa.

       [Note] Running diagnostic: CheckServiceNetwork
              Description: Check pod to service communication in the cluster. In case of ovs-subnet network plugin, all pods should be able to communicate with all services and in case of multitenant network plugin, services in non-global projects should be isolated and pods in global projects should be able to access any service in the cluster.

       [Note] Running diagnostic: CollectNetworkInfo
              Description: Collect network information in the cluster.

       [Note] Summary of diagnostics execution (version v3.6.5.5):
       [Note] Warnings seen: 0


       [Note] Running diagnostic: CheckNodeNetwork
              Description: Check that pods in the cluster can access its own node.

       [Note] Running diagnostic: CheckPodNetwork
              Description: Check pod to pod communication in the cluster. In case of ovs-subnet network plugin, all pods should be able to communicate with each other and in case of multitenant network plugin, pods in non-global projects should be isolated and pods in global projects should be able to access any pod in the cluster and vice versa.

       [Note] Running diagnostic: CheckServiceNetwork
              Description: Check pod to service communication in the cluster. In case of ovs-subnet network plugin, all pods should be able to communicate with all services and in case of multitenant network plugin, services in non-global projects should be isolated and pods in global projects should be able to access any service in the cluster.

       [Note] Running diagnostic: CollectNetworkInfo
              Description: Collect network information in the cluster.

       [Note] Summary of diagnostics execution (version v3.6.5.5):
       [Note] Warnings seen: 0


       [Note] Running diagnostic: CheckNodeNetwork
              Description: Check that pods in the cluster can access its own node.

       [Note] Running diagnostic: CheckPodNetwork
              Description: Check pod to pod communication in the cluster. In case of ovs-subnet network plugin, all pods should be able to communicate with each other and in case of multitenant network plugin, pods in non-global projects should be isolated and pods in global projects should be able to access any pod in the cluster and vice versa.

       [Note] Running diagnostic: CheckServiceNetwork
              Description: Check pod to service communication in the cluster. In case of ovs-subnet network plugin, all pods should be able to communicate with all services and in case of multitenant network plugin, services in non-global projects should be isolated and pods in global projects should be able to access any service in the cluster.

       [Note] Running diagnostic: CollectNetworkInfo
              Description: Collect network information in the cluster.

       [Note] Summary of diagnostics execution (version v3.6.5.5):
       [Note] Warnings seen: 0

[Note] Skipping diagnostic: AggregatedLogging
       Description: Check aggregated logging integration for proper configuration
       Because: No LoggingPublicURL is defined in the master configuration

[Note] Running diagnostic: ClusterRegistry
       Description: Check that there is a working Docker registry

[Note] Running diagnostic: ClusterRoleBindings
       Description: Check that the default ClusterRoleBindings are present and contain the expected subjects

Info:  clusterrolebinding/cluster-readers has more subjects than expected.

       Use the oadm policy reconcile-cluster-role-bindings command to update the role binding to remove extra subjects.

Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount management-infra management-admin    }.
Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount default router    }.

Info:  clusterrolebinding/self-provisioners has more subjects than expected.

       Use the oadm policy reconcile-cluster-role-bindings command to update the role binding to remove extra subjects.

Info:  clusterrolebinding/self-provisioners has extra subject {ServiceAccount management-infra management-admin    }.

[Note] Running diagnostic: ClusterRoles
       Description: Check that the default ClusterRoles are present and contain the expected permissions

[Note] Running diagnostic: ClusterRouterName
       Description: Check there is a working router

[Note] Running diagnostic: MasterNode
       Description: Check if master is also running node (for Open vSwitch)

WARN:  [DClu3004 from diagnostic MasterNode@openshift/origin/pkg/diagnostics/cluster/master_node.go:164]
       Unable to find a node matching the cluster server IP.
       This may indicate the master is not also running a node, and is unable
       to proxy to pods over the Open vSwitch SDN.

[Note] Skipping diagnostic: MetricsApiProxy
       Description: Check the integrated heapster metrics can be reached via the API proxy
       Because: The heapster service does not exist in the openshift-infra project at this time,
       so it is not available for the Horizontal Pod Autoscaler to use as a source of metrics.

[Note] Running diagnostic: NodeDefinitions
       Description: Check node records on master

WARN:  [DClu0003 from diagnostic NodeDefinition@openshift/origin/pkg/diagnostics/cluster/node_definitions.go:112]
       Node master1 is ready but is marked Unschedulable.
       This is usually set manually for administrative reasons.
       An administrator can mark the node schedulable with:
           oadm manage-node master1 --schedulable=true

       While in this state, pods should not be scheduled to deploy on the node.
       Existing pods will continue to run until completed or evacuated (see
       other options for 'oadm manage-node').

WARN:  [DClu0003 from diagnostic NodeDefinition@openshift/origin/pkg/diagnostics/cluster/node_definitions.go:112]
       Node master2 is ready but is marked Unschedulable.
       This is usually set manually for administrative reasons.
       An administrator can mark the node schedulable with:
           oadm manage-node master2 --schedulable=true

       While in this state, pods should not be scheduled to deploy on the node.
       Existing pods will continue to run until completed or evacuated (see
       other options for 'oadm manage-node').

WARN:  [DClu0003 from diagnostic NodeDefinition@openshift/origin/pkg/diagnostics/cluster/node_definitions.go:112]
       Node master3 is ready but is marked Unschedulable.
       This is usually set manually for administrative reasons.
       An administrator can mark the node schedulable with:
           oadm manage-node master3 --schedulable=true

       While in this state, pods should not be scheduled to deploy on the node.
       Existing pods will continue to run until completed or evacuated (see
       other options for 'oadm manage-node').

[Note] Running diagnostic: ServiceExternalIPs
       Description: Check for existing services with ExternalIPs that are disallowed by master config

[Note] Running diagnostic: AnalyzeLogs
       Description: Check for recent problems in systemd service logs

Info:  Checking journalctl logs for 'atomic-openshift-node' service
Info:  Checking journalctl logs for 'docker' service

[Note] Running diagnostic: MasterConfigCheck
       Description: Check the master config file

WARN:  [DH0005 from diagnostic MasterConfigCheck@openshift/origin/pkg/diagnostics/host/check_master_config.go:52]
       Validation of master config file '/etc/origin/master/master-config.yaml' warned:
       assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console
       assetConfig.metricsPublicURL: Invalid value: "": required to view cluster metrics in the console
       auditConfig.auditFilePath: Required value: audit can now be logged to a separate file

[Note] Running diagnostic: NodeConfigCheck
       Description: Check the node config file

Info:  Found a node config file: /etc/origin/node/node-config.yaml

[Note] Running diagnostic: UnitStatus
       Description: Check status for related systemd units

[Note] Summary of diagnostics execution (version v3.6.5.5):
[Note] Warnings seen: 5
[Note] Errors seen: 0
Note

The warnings will not cause issues in the environment

Based on the results of the diagnostics, actions can be taken to alleviate any issues.

4.4. Checking the Health of etcd

This section focuses on the etcd cluster. It describes the different commands to ensure the cluster is healthy. The internal DNS names of the nodes running etcd must be used.

SSH into the first master node (master1) as before:

$ ssh <user>@<resourcegroup>b.<region>.cloudapp.azure.com
$ ssh <user>@master1
$ sudo -i

Using the output of the command hostname issue the etcdctl command to confirm that the cluster is healthy.

# etcdctl --endpoints https://master1:2379,https://master2:2379,https://master3:2379 --ca-file /etc/etcd/ca.crt --cert-file=/etc/origin/master/master.etcd-client.crt --key-file=/etc/origin/master/master.etcd-client.key cluster-health
member 82c895b7b0de4330 is healthy: got healthy result from https://10.0.0.4:2379
member c8e7ac98bb93fe8c is healthy: got healthy result from https://10.0.0.5:2379
member f7bbfc4285f239ba is healthy: got healthy result from https://10.0.0.6:2379
Note

In this configuration the etcd services are distributed among the OpenShift Container Platform master nodes.

4.5. Default Node Selector

As explained in Nodes section, node labels are an important part of the OpenShift Container Platform environment. By default of the reference architecture installation, the default node selector is set to role=apps in /etc/origin/master/master-config.yaml on all of the master nodes. This configuration parameter is set during the installation of OpenShift on all masters.

SSH into the first master node (master1) to verify the defaultNodeSelector is defined.

$ ssh <user>@<resourcegroup>b.<region>.cloudapp.azure.com
$ ssh <user>@master1
$ sudo -i
# vi /etc/origin/master/master-config.yaml
... [OUTPUT ABBREVIATED] ...
projectConfig:
  defaultNodeSelector: "role=app"
  projectRequestMessage: ""
  projectRequestTemplate: ""
... [OUTPUT ABBREVIATED] ...
Note

If making any changes to the master configuration then the master API service must be restarted or the configuration change will not take place. Any changes and the subsequent restart must be done on all masters.

4.6. Management of Maximum Pod Size

Quotas are set on ephemeral volumes within pods to prohibit a pod from becoming too large and impacting the node. There are three places where sizing restrictions should be set. When persistent volume claims are not set a pod has the ability to grow as large as the underlying filesystem will allow. The required modifications are set by automatically.

OpenShift Volume Quota

At launch time a script creates a XFS partition on the block device, adds an entry in /etc/fstab, and mounts the volume with the option of gquota. If gquota is not set the OpenShift Container Platform node will not be able to start with the perFSGroup parameter defined below. This disk and configuration is done on the master, infrastructure, and application nodes.

SSH into the first infrastructure node (infranode1) to verify the entry exists within /etc/fstab

$ ssh <user>@<resourcegroup>b.<region>.cloudapp.azure.com
$ ssh <user>@infranode1
$ grep "/var/lib/origin/openshift.local.volumes" /etc/fstab
/dev/sdc1 /var/lib/origin/openshift.local.volumes xfs gquota 0 0

OpenShift Emptydir Quota

During installation a value for perFSGroup is set within the node configuration. The perFSGroup setting restricts the ephemeral emptyDir volume from growing larger than 512Mi. This emptyDir quota is done on the master, infrastructure, and application nodes.

SSH into the first infrastructure node (infranode1) to verify /etc/origin/node/node-config.yml matches the information below.

$ ssh <user>@<resourcegroup>b.<region>.cloudapp.azure.com
$ ssh <user>@infranode1
$ sudo grep -B2 perFSGroup /etc/origin/node/node-config.yaml
volumeConfig:
  localQuota:
     perFSGroup: 512Mi

Docker Storage Setup

The /etc/sysconfig/docker-storage-setup file is created at launch time by the bash script on every node. This file tells the Docker service to use a specific volume group for containers. Docker storage setup is performed on all master, infrastructure, and application nodes.

SSH into the first infrastructure node (infranode1) to verify /etc/sysconfig/docker-storage-setup matches the information below.

$ ssh <user>@<resourcegroup>b.<region>.cloudapp.azure.com
$ ssh <user>@infranode1
$ cat /etc/sysconfig/docker-storage-setup
DEVS=/dev/sdd
VG=docker_vol
DATA_SIZE=95%VG
STORAGE_DRIVER=overlay2
CONTAINER_ROOT_LV_NAME=dockerlv
CONTAINER_ROOT_LV_MOUNT_PATH=/var/lib/docker
CONTAINER_ROOT_LV_SIZE=100%FREE

4.7. Yum Repositories

In section Required Channels the specific repositories for a successful OpenShift Container Platform installation were defined. All systems except for the bastion host should have the same repositories configured. To verify subscriptions match those defined in Required Channels perform the following. The repositories below are enabled during the rhsm-repos playbook during the installation. The installation will be unsuccessful if the repositories are missing from the system.

SSH into the first infrastructure node (infranode1) and verify the command output matches the information below.

$ ssh <user>@<resourcegroup>b.<region>.cloudapp.azure.com
$ ssh <user>@infranode1
$ yum repolist
Loaded plugins: langpacks, product-id, search-disabled-repos
repo id                                  repo name                                                     status
rhel-7-fast-datapath-rpms/7Server/x86_64 Red Hat Enterprise Linux Fast Datapath (RHEL 7 Server) (RPMs) 27
rhel-7-server-extras-rpms/x86_64         Red Hat Enterprise Linux 7 Server - Extras (RPMs)             461+4
rhel-7-server-ose-3.6-rpms/x86_64        Red Hat OpenShift Container Platform 3.6 (RPMs)               437+30
rhel-7-server-rpms/7Server/x86_64        Red Hat Enterprise Linux 7 Server (RPMs)                      14.285
repolist: 15.210

4.8. Console Access

This section will cover logging into the OpenShift Container Platform management console via the GUI and the CLI. After logging in via one of these methods applications can then be deployed and managed.

4.8.1. Log into GUI console and deploy an application

Perform the following steps from the local workstation.

Open a browser and access the OpenShift Container Platform web console located in https://<resourcegroupname>.<region>.cloudapp.azure.com/console The resourcegroupname was given in the ARM template, and region is the Microsoft Azure zone selected during install. When logging into the OpenShift Container Platform web console, use the user login and password specified during the launch of the ARM template.

Once logged, to deploy an example application:

  • Click on the [New Project] button
  • Provide a "Name" and click [Create]
  • Next, deploy the jenkins-ephemeral instant app by clicking the corresponding box.
  • Accept the defaults and click [Create]. Instructions along with a URL will be provided for how to access the application on the next screen.
  • Click [Continue to Overview] and bring up the management page for the application.
  • Click on the link provided as the route and access the application to confirm functionality.

4.8.2. Log into CLI and Deploy an Application

Perform the following steps from the local workstation.

Install the oc CLI by visiting the public URL of the OpenShift Container Platform deployment. For example, https://resourcegroupname.region.cloudapp.azure.com/console/command-line and click latest release. When directed to https://access.redhat.com, login with the valid Red Hat customer credentials and download the client relevant to the current workstation operating system. Follow the instructions located on documentation site for getting started with the cli.

A token is required to login to OpenShift Container Platform. The token is presented on the https://resourcegroupname.region.cloudapp.azure.com/console/command-line page. Click to show token hyperlink and perform the following on the workstation in which the oc client was installed.

$ oc login https://resourcegroupname.region.cloudapp.azure.com --token=fEAjn7LnZE6v5SOocCSRVmUWGBNIIEKbjD9h-Fv7p09
Note

oc command also supports logging with username and password combination. See oc help login output for more information

After the oc client is configured, create a new project and deploy an application, in this case, a php sample application (CakePHP):

$ oc new-project test-app
$ oc new-app https://github.com/openshift/cakephp-ex.git --name=php
--> Found image 2997627 (7 days old) in image stream "php" in project "openshift" under tag "5.6" for "php"

    Apache 2.4 with PHP 5.6
    -----------------------
    Platform for building and running PHP 5.6 applications

    Tags: builder, php, php56, rh-php56

    * The source repository appears to match: php
    * A source build using source code from https://github.com/openshift/cakephp-ex.git will be created
      * The resulting image will be pushed to image stream "php:latest"
    * This image will be deployed in deployment config "php"
    * Port 8080/tcp will be load balanced by service "php"
      * Other containers can access this service through the hostname "php"

--> Creating resources with label app=php ...
    imagestream "php" created
    buildconfig "php" created
    deploymentconfig "php" created
    service "php" created
--> Success
    Build scheduled, use 'oc logs -f bc/php' to track its progress.
    Run 'oc status' to view your app.

$ oc expose service php
route "php" exposed

Display the status of the application.

$ oc status
In project test-app on server https://resourcegroupname.region.cloudapp.azure.com

http://test-app.apps.13.93.162.100.nip.io to pod port 8080-tcp (svc/php)
  dc/php deploys istag/php:latest <- bc/php builds https://github.com/openshift/cakephp-ex.git with openshift/php:5.6
    deployment #1 deployed about a minute ago - 1 pod

Access the application by accessing the URL provided by oc status. The CakePHP application should be visible now.

4.9. Explore the Environment

4.9.1. List Nodes and Set Permissions

$ oc get nodes --show-labels
NAME          STATUS                     AGE
infranode1    Ready                      16d
infranode2    Ready                      16d
infranode3    Ready                      16d
master1       Ready,SchedulingDisabled   16d
master2       Ready,SchedulingDisabled   16d
master3       Ready,SchedulingDisabled   16d
node01        Ready                      16d
node02        Ready                      16d
node03        Ready                      16d

Running this command with a regular user should fail.

$ oc get nodes --show-labels
Error from server: User "nonadmin" cannot list all nodes in the cluster

The reason it is failing is because the permissions for that user are incorrect.

Note

For more information about the roles and permissions, see Authorization documentation

4.9.2. List Router and Registry

List the router and registry pods by changing to the default project.

Note

Perform the following steps from the local workstation.

$ oc project default
$ oc get all
NAME                         REVISION        DESIRED       CURRENT   TRIGGERED BY
dc/docker-registry           1               1             1         config
dc/router                    1               2             2         config
NAME                         DESIRED         CURRENT       AGE
rc/docker-registry-1         1               1             10m
rc/router-1                  2               2             10m
NAME                         CLUSTER-IP      EXTERNAL-IP   PORT(S)                   AGE
svc/docker-registry          172.30.243.63   <none>        5000/TCP                  10m
svc/kubernetes               172.30.0.1      <none>        443/TCP,53/UDP,53/TCP     20m
svc/router                   172.30.224.41   <none>        80/TCP,443/TCP,1936/TCP   10m
NAME                         READY           STATUS        RESTARTS                  AGE
po/docker-registry-1-2a1ho   1/1             Running       0                         8m
po/router-1-1g84e            1/1             Running       0                         8m
po/router-1-t84cy            1/1             Running       0                         8m

Observe the output of oc get all

4.9.3. Explore the Docker Registry

The OpenShift Container Platform ansible playbooks configure three infrastructure nodes that have one registry running. In order to understand the configuration and mapping process of the registry pods, the command oc describe is used. oc describe details how registries are configured and mapped to the Azure Blob Storage using the REGISTRY_STORAGE_* environment variables.

Note

Perform the following steps from the local workstation.

$ oc describe dc/docker-registry
... [OUTPUT ABBREVIATED] ...
Environment Variables:
  REGISTRY_HTTP_ADDR:					:5000
  REGISTRY_HTTP_NET:					tcp
  REGISTRY_HTTP_SECRET:					7H7ihSNi2k/lqR0i5iINHtx+ItA2cGnpccBAz2URT5c=
  REGISTRY_MIDDLEWARE_REPOSITORY_OPENSHIFT_ENFORCEQUOTA:	false
  REGISTRY_HTTP_TLS_KEY:					/etc/secrets/registry.key
  REGISTRY_HTTP_TLS_CERTIFICATE:				/etc/secrets/registry.crt
  REGISTRY_STORAGE:						azure
  REGISTRY_STORAGE_AZURE_ACCOUNTKEY:			DUo2VfsnPwGl+4yEmye0iSQuHVrPCVmj7D+oIsYVlmaNJXS4YkZoXODvOfx3luLL6qb4j+1YhV8Nr/slKE9+IQ==
  REGISTRY_STORAGE_AZURE_ACCOUNTNAME:			sareg<resourcegroup>
  REGISTRY_STORAGE_AZURE_CONTAINER:				registry
... [OUTPUT ABBREVIATED] ...

To see if the docker images are being stored in the Azure Blob Storage properly, save the REGISTRY_STORAGE_AZURE_ACCOUNTKEY value from the command output before and perform the following command on the host you installed the Azure CLI Node.js package:

$ azure storage blob list registry --account-name=sareg<resourcegroup> --account-key=<account_key>
info:    Executing command storage blob list
+ Getting blobs in container registry
data:    Name                                                                                                                                                              Blob Type   Length    Content Type              Last Modified                  Snapshot Time
data:    ----------------------------------------------------------------------------------------------------------------------------------------------------------------  ----------  --------  ------------------------  -----------------------------  -------------
data:    /docker/registry/v2/blobs/sha256/31/313a6203b84e37d24fe7e43185f9c8b12b727574a1bc98bf464faf78dc8e9689/data                                                         AppendBlob  9624      application/octet-stream  Tue, 23 May 2017 15:44:24 GMT
data:    /docker/registry/v2/blobs/sha256/4c/4c1fa39c5cda68c387cfc7dd32207af1a25b2413c266c464580001c97939cce0/data                                                         AppendBlob  43515975  application/octet-stream  Tue, 23 May 2017 15:43:45 GMT
... [OUTPUT ABBREVIATED] ...
info:    storage blob list command OK

4.9.4. Explore Docker Storage

This section will explore the Docker storage on an infrastructure node.

The example below can be performed on any node but for this example the infrastructure node (infranode1) is used.

The output below verifies docker storage is using the devicemapper driver in the Storage Driver section and using the proper LVM VolumeGroup:

$ ssh <user>@<resourcegroup>b.<region>.cloudapp.azure.com
$ ssh <user>@infranode1
$ sudo -i
# docker info
Containers: 2
 Running: 2
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 1.10.3
Storage Driver: devicemapper
 Pool Name: docker--vol-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 3.221 GB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 1.221 GB
 Data Space Total: 25.5 GB
 Data Space Available: 24.28 GB
 Metadata Space Used: 307.2 kB
 Metadata Space Total: 29.36 MB
 Metadata Space Available: 29.05 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: bridge null host
 Authorization: rhel-push-plugin
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Operating System: Employee SKU
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 2
Total Memory: 7.389 GiB
Name: ip-10-20-3-46.azure.internal
ID: XDCD:7NAA:N2S5:AMYW:EF33:P2WM:NF5M:XOLN:JHAD:SIHC:IZXP:MOT3
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Registries: registry.access.redhat.com (secure), docker.io (secure)
# vgs
  VG        #PV #LV #SN Attr   VSize   VFree
  docker-vg   1   1   0 wz--n- 128,00g 76,80g

If it was in loopback as Storage Mode, the output would list the loopback file. As the below output does not contain the word loopback, the docker daemon is working in the optimal way.

Note

For more information about the docker storage requirements, check Configuring docker storage documentation

4.9.5. Explore the Microsoft Azure Load Balancers

As mentioned earlier in the document two Azure Load Balancers have been created. The purpose of this section is to encourage exploration of the load balancers that were created.

Note

Perform the following steps from the Azure web console.

On the main Microsoft Azure dashboard, click on [Resource Groups] icon. Then select the resource group that corresponds with the OpenShift Container Platform deployment, and then find the [Load Balancers] within the resource group. Select the AppLB load balancer and on the [Description] page note the [Port Configuration] and how it is configured. That is for the OpenShift Container Platform application traffic. There should be three master instances running with a [Status] of Ok. Next check the [Health Check] tab and the options that were configured. Further details of the configuration can be viewed by exploring the ARM templates to see exactly what was configured.

4.9.6. Explore the Microsoft Azure Resource Group

As mentioned earlier in the document an Azure Resource Group was created. The purpose of this section is to encourage exploration of the resource group that was created.

Note

Perform the following steps from the Azure web console.

On the main Microsoft Azure console, click on [Resource Group]. Next on the left hand navigation panel select the [Your Resource Groups]. Select the Resource Group recently created and explore the [Summary] tabs. Next, on the right hand navigation panel, explore the [Virtual Machines], [Storage Accounts], [Load Balancers], and [Networks] tabs More detail can be looked at with the configuration by exploring the ansible playbooks and ARM json files to see exactly what was configured.

4.10. Testing Failure

In this section, reactions to failure are explored. After a successful install and some of the smoke tests noted above have been completed, failure testing is executed.

4.10.1. Generate a Master Outage

Note

Perform the following steps from the Azure web console and the OpenShift public URL.

Log into the Microsoft Azure console. On the dashboard, click on the [Resource Group] web service and then click [Overview]. Locate the running master2 instance, select it, right click and change the state to stopped.

Ensure the console can still be accessed by opening a browser and accessing https://resourcegroupname.region.cloudapp.azure.com. At this point, the cluster is in a degraded state because only 2/3 master nodes are running, but complete functionality remains.

4.10.2. Observe the Behavior of etcd with a Failed Master Node

SSH into the first master node (master1) from the bastion. Using the output of the command hostname issue the etcdctl command to confirm that the cluster is healthy.

$ ssh <user>@<resourcegroup>b.<region>.cloudapp.azure.com
$ ssh <user>@master1
$ sudo -i
# etcdctl --endpoints https://master1:2379,https://master2:2379,https://master3:2379 --ca-file /etc/etcd/ca.crt --cert-file=/etc/origin/master/master.etcd-client.crt --key-file=/etc/origin/master/master.etcd-client.key cluster-health
failed to check the health of member 82c895b7b0de4330 on https://10.20.2.251:2379: Get https://10.20.1.251:2379/health: dial tcp 10.20.1.251:2379: i/o timeout
member 82c895b7b0de4330 is unreachable: [https://10.20.1.251:2379] are all unreachable
member c8e7ac98bb93fe8c is healthy: got healthy result from https://10.20.3.74:2379
member f7bbfc4285f239ba is healthy: got healthy result from https://10.20.1.106:2379
cluster is healthy

Notice how one member of the etcd cluster is now unreachable. Restart master2 by following the same steps in the Azure web console as noted above.

4.10.3. Generate an Infrastructure Node outage

This section shows what to expect when an infrastructure node fails or is brought down intentionally.

4.10.3.1. Confirm Application Accessibility

Note

Perform the following steps from the browser on a local workstation.

Before bringing down an infrastructure node, check behavior and ensure things are working as expected. The goal of testing an infrastructure node outage is to see how the OpenShift Container Platform routers and registries behave. Confirm the simple application deployed from before is still functional. If it is not, deploy a new version. Access the application to confirm connectivity. As a reminder, to find the required information to ensure the application is still running, list the projects, change to the project that the application is deployed in, get the status of the application which including the URL and access the application via that URL.

$ oc get projects
NAME               DISPLAY NAME   STATUS
openshift                         Active
openshift-infra                   Active
ttester                           Active
test-app1                         Active
default                           Active
management-infra                  Active

$ oc project test-app1
Now using project "test-app1" on server "https://resourcegroupname.region.cloudapp.azure.com".

$ oc status
In project test-app1 on server https://resourcegroupname.region.cloudapp.azure.com

http://test-app1.apps.13.93.162.100.nip.io to pod port 8080-tcp (svc/php-prod)
  dc/php-prod deploys istag/php-prod:latest <-
    bc/php-prod builds https://github.com/openshift/cakephp-ex.git with openshift/php:5.6
    deployment #1 deployed 27 minutes ago - 1 pod

Open a browser and ensure the application is still accessible.

4.10.3.2. Confirm Registry Functionality

This section is another step to take before initiating the outage of the infrastructure node to ensure that the registry is functioning properly. The goal is to push a image to the OpenShift Container Platform registry.

Note

Perform the following steps from a CLI on a local workstation and ensure that the oc client has been configured as explained before.

Important

In order to be able to push images to the registry, the docker configuration on the workstation will be modified to trust the docker registry certificate.

Get the name of the docker-registry pod:

$ oc get pods -n default | grep docker-registry
docker-registry-4-9r033    1/1       Running   0          2h

Get the registry certificate and save it:

$ oc exec docker-registry-4-9r033 cat /etc/secrets/registry.crt >> /tmp/my-docker-registry-certificate.crt

Capture the registry route:

$ oc get route docker-registry -n default
NAME              HOST/PORT                                      PATH      SERVICES          PORT      TERMINATION   WILDCARD
docker-registry   docker-registry-default.13.64.245.134.nip.io             docker-registry   <all>     passthrough   None

Create the proper directory in /etc/docker/certs.d/ for the registry:

$ sudo mkdir -p /etc/docker/certs.d/docker-registry-default.13.64.245.134.nip.io

Move the certificate to the directory previously created and restart the docker service in the workstation

$ sudo mv /tmp/my-docker-registry-certificate.crt /etc/docker/certs.d/docker-registry-default.13.64.245.134.nip.io/ca.crt
$ sudo systemctl restart docker

A token is needed so that the Docker registry can be logged into.

$ oc whoami -t
feAeAgL139uFFF_72bcJlboTv7gi_bo373kf1byaAT8

Pull a new docker image for the purposes of test pushing.

$ docker pull fedora/apache
$ docker images | grep fedora/apache
docker.io/fedora/apache  latest  c786010769a8  3 months ago  396.4 MB

Tag the docker image with the registry hostname

$ docker tag docker.io/fedora/apache docker-registry-default.13.64.245.134.nip.io/openshift/prodapache

Check the images and ensure the newly tagged image is available.

$ docker images | grep openshift/prodapache
docker-registry-default.13.64.245.134.nip.io/openshift/prodapache   latest              c786010769a8        3 months ago        396.4 MB

Issue a Docker login.

$ docker login -u $(oc whoami) -e <email> -p $(oc whoami -t) docker-registry-default.13.64.245.134.nip.io
Login Succeeded
Note

The email doesn’t need to be valid and it will be deprecated in next versions of the docker cli

Push the image to the OpenShift Container Platform registry:

$ docker push docker-registry-default.13.64.245.134.nip.io/openshift/prodapache
The push refers to a repository [docker-registry-default.13.64.245.134.nip.io/openshift/prodapache]
3a85ee80fd6c: Pushed
5b0548b012ca: Pushed
a89856341b3d: Pushed
a839f63448f5: Pushed
e4f86288aaf7: Pushed
latest: digest: sha256:e2a15a809ce2fe1a692b2728bd07f58fbf06429a79143b96b5f3e3ba0d1ce6b5 size: 7536

4.10.3.3. Get Location of Registry

Note

Perform the following steps from the CLI of a local workstation.

Change to the default OpenShift Container Platform project and check the registry pod location

$ oc get pods -o wide -n default
NAME                       READY     STATUS    RESTARTS   AGE       IP           NODE
docker-registry-4-9r033    1/1       Running   0          2h        10.128.6.5   infranode3
registry-console-1-zwzsl   1/1       Running   0          5d        10.131.4.2   infranode2
router-1-09x4g             1/1       Running   0          5d        10.0.2.5     infranode2
router-1-6135c             1/1       Running   0          5d        10.0.2.4     infranode1
router-1-l2562             1/1       Running   0          5d        10.0.2.6     infranode3

4.10.3.4. Initiate the Failure and Confirm Functionality

Note

Perform the following steps from the Azure web console and a browser.

Log into the Azure web console. On the dashboard, click on the [Resource Group]. Locate the running instance where the registry pod is running (infranode3 in the previous example), select it, right click and change the state to stopped. Wait a minute or two for the registry pod to be migrate over to a different infranode. Check the registry location and confirm that it moved to a different infranode:

$ oc get pods -o wide -n default | grep docker-registry
docker-registry-4-kd40f    1/1       Running   0          1m        10.130.4.3   infranode1

Follow the procedures above to ensure a Docker image can still be pushed to the registry now that infranode3 is down.

4.11. Metrics exploration

Red Hat OpenShift Container Platform metrics components enable additional features in the Red Hat OpenShift Container Platform web interface. If the environment has been deployed choosing to deploy metrics, there will be a new tab in the pod section named "Metrics" where it shows usage data of CPU, memory and network resources for a period of time:

Metrics details
Note

If metrics don’t show, check if the hawkular certificate has been trusted. Visit the metrics route using the browser and accept the self signed certificate warning and refresh the metrics tab to check if metrics are shown. Future revisions of this reference architecture document will include how to create proper certificates to avoid trusting self signed certificates.

Using the CLI, the cluster-admin can observe the usage of the pods and nodes using the following commands as well:

$ oc adm top pod --heapster-namespace="openshift-infra" --heapster-scheme="https" --all-namespaces
NAMESPACE         NAME                                   CPU(cores)   MEMORY(bytes)
openshift-infra   hawkular-cassandra-1-h9mrq             161m         1423Mi
logging           logging-fluentd-g5jqw                  8m           92Mi
logging           logging-es-ops-b44n3gav-1-zkl3r        19m          861Mi
... [OUTPUT ABBREVIATED] ...
$ oc adm top node --heapster-namespace="openshift-infra" --heapster-scheme="https"
NAME         CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%
infranode3   372m         9%        4657Mi          33%
master3      68m          1%        1923Mi          13%
node02       43m          1%        1437Mi          5%
... [OUTPUT ABBREVIATED] ...

4.11.1. Using the Horizontal Pod Autoscaler

In order to be able to use the HorizontalPodAutoscaler feature, the metrics components should be deployed and limits should be configured for the pod in order to set the target percentage when the pod will be scaled.

The following commands shows how to create a new project, deploy an example pod and set some limits:

$ oc new-project autoscaletest
Now using project "autoscaletest" on server "https://myocp.eastus2.cloudapp.azure.com:8443".
... [OUTPUT ABBREVIATED] ...

$ oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
--> Found Docker image d9c9735 (10 days old) from Docker Hub for "centos/ruby-22-centos7"
... [OUTPUT ABBREVIATED] ...

$ oc patch dc/ruby-ex -p \'{"spec":{"template":{"spec":{"containers":[{"name":"ruby-ex","resources":{"limits":{"cpu":"80m"}}}]}}}}'
"ruby-ex" patched

$ oc get pods
NAME              READY     STATUS      RESTARTS   AGE
ruby-ex-1-210l9   1/1       Running     0          2m
ruby-ex-1-build   0/1       Completed   0          4m

$ oc describe pod ruby-ex-1-210l9
Name:			ruby-ex-1-210l9
... [OUTPUT ABBREVIATED] ...
    Limits:
      cpu:	80m
    Requests:
      cpu:		80m

Once the pod is running, create the autoscaler:

$ oc autoscale dc/ruby-ex --min 1 --max 10 --cpu-percent=50
deploymentconfig "ruby-ex" autoscaled
$ oc get horizontalpodautoscaler
NAME      REFERENCE                  TARGET    CURRENT   MINPODS   MAXPODS   AGE
ruby-ex   DeploymentConfig/ruby-ex   50%       0%        1         10        53s

Access the pod and create some CPU load, as:

$ oc rsh ruby-ex-1-210l9

sh-4.2$ while true; do echo "cpu hog" >> mytempfile; rm -f mytempfile; done

Observe the events and the pods running and after a while a new replica will be created:

$ oc get events -w
LASTSEEN                        FIRSTSEEN                       COUNT     NAME      KIND                      SUBOBJECT   TYPE      REASON                    SOURCE                         MESSAGE
2017-07-13 13:28:35 +0000 UTC   2017-07-13 13:26:30 +0000 UTC   7         ruby-ex   HorizontalPodAutoscaler               Normal    DesiredReplicasComputed   {horizontal-pod-autoscaler }   Computed the desired num of replicas: 0 (avgCPUutil: 0, current replicas: 1)
2017-07-13 13:29:05 +0000 UTC   2017-07-13 13:29:05 +0000 UTC   1         ruby-ex   HorizontalPodAutoscaler             Normal    DesiredReplicasComputed   {horizontal-pod-autoscaler }   Computed the desired num of replicas: 2 (avgCPUutil: 67, current replicas: 1)
2017-07-13 13:29:05 +0000 UTC   2017-07-13 13:29:05 +0000 UTC   1         ruby-ex   DeploymentConfig             Normal    ReplicationControllerScaled   {deploymentconfig-controller }   Scaled replication controller "ruby-ex-1" from 1 to 2
2017-07-13 13:29:05 +0000 UTC   2017-07-13 13:29:05 +0000 UTC   1         ruby-ex   HorizontalPodAutoscaler             Normal    SuccessfulRescale   {horizontal-pod-autoscaler }   New size: 2; reason: CPU utilization above target
2017-07-13 13:29:05 +0000 UTC   2017-07-13 13:29:05 +0000 UTC   1         ruby-ex-1-zwmxd   Pod                 Normal    Scheduled   {default-scheduler }   Successfully assigned ruby-ex-1-zwmxd to node02

$ oc get pods
NAME              READY     STATUS      RESTARTS   AGE
ruby-ex-1-210l9   1/1       Running     0          8m
ruby-ex-1-build   0/1       Completed   0          9m
ruby-ex-1-zwmxd   1/1       Running     0          58s

After canceling the CPU hog command, the events will show how the deploymentconfig returns to a single replica:

$ oc get events -w
LASTSEEN                        FIRSTSEEN                       COUNT     NAME      KIND                      SUBOBJECT   TYPE      REASON                    SOURCE                         MESSAGE
2017-07-13 13:34:05 +0000 UTC   2017-07-13 13:34:05 +0000 UTC   1         ruby-ex   HorizontalPodAutoscaler             Normal    SuccessfulRescale   {horizontal-pod-autoscaler }   New size: 1; reason: All metrics below target
2017-07-13 13:34:05 +0000 UTC   2017-07-13 13:34:05 +0000 UTC   1         ruby-ex   DeploymentConfig             Normal    ReplicationControllerScaled   {deploymentconfig-controller }   Scaled replication controller "ruby-ex-1" from 2 to 1
2017-07-13 13:34:05 +0000 UTC   2017-07-13 13:34:05 +0000 UTC   1         ruby-ex-1   ReplicationController             Normal    SuccessfulDelete   {replication-controller }   Deleted pod: ruby-ex-1-zwmxd

4.12. Logging exploration

Red Hat OpenShift Container Platform aggregated logging components enable additional features in the Red Hat OpenShift Container Platform web interface. If the environment has been deployed choosing to deploy logging, there will be a new link in the pod logs section named "View Archive" that will redirect to the Kibana web interface for the user to see the pods logs, create queries, filters, etc.

Logging example
Note

For more information about Kibana, see Kibana documentation

In case the "opslogging" cluster has been deployed, there will be a route "kibana-ops" in the "logging" project where cluster-admin users can browse infrastructure logs.

OPS logging example