Chapter 12. Operational Tools (monitoring, logging & alarms)

Monitoring traditional hardware based EPC mostly involved Element Managers(EMs) that were provided by the EPC vendor along with a set of generic monitoring tools and techniques including SNMP monitoring (using snmpget and snmpwalk), SNMP traps and Syslog. However, with the advent of vEPC, the solution is multi-layered and much more complex to monitor. In addition to monitoring the vEPC at the application layer (using EM or otherwise), we now have to monitor:

  • Server Hardware: Power supply, Temperature, Fans, Disk, fabric errors etc.
  • Host OS: Memory, CPU, Disk and I/O errors
  • OpenStack: Service daemons, Instance reachability, Volumes, Hypervisor metrics, Nova/Compute metrics, Tenant metrics, Message Queues, Keystone Tokens and Notifications

It should be noted that the VNFM, if one is present also contributes to lifecycle management of the VNFs, restarting VNFs, starting new VNFs if scaling is required.

12.1. Logging

OpenStack provides numerous log files for each component. It is an important activity to monitor these log files. As an example, let us look at /var/log/nova/nova-compute.log:

2016-12-09 17:51:51.025 44510 INFO nova.compute.resource_tracker [req-336c8469-3c29-4b55-8917-36db3303bb72 - - - - -] Auditing locally available compute resources for node overcloud-compute-0.localdomain

2016-12-09 17:51:51.307 44510 INFO nova.compute.resource_tracker [req-336c8469-3c29-4b55-8917-36db3303bb72 - - - - -] Total usable vcpus: 48, total allocated vcpus: 0

2016-12-09 17:51:51.307 44510 INFO nova.compute.resource_tracker [req-336c8469-3c29-4b55-8917-36db3303bb72 - - - - -] Final resource view: name=overcloud-compute-0.localdomain phys_ram=130950MB used_ram=2048MB phys_disk=372GB used_disk=0GB total_vcpus=48 used_vcpus=0 pci_stats=[]

2016-12-09 17:51:51.325 44510 INFO nova.compute.resource_tracker [req-336c8469-3c29-4b55-8917-36db3303bb72 - - - - -] Compute_service record updated for overcloud-compute-0.localdomain:overcloud-compute-0.localdomain

If the compute node is healthy and is able to communicate with the controller node, we should see periodic log messages “Auditing locally available compute resources for node…” type of logs followed by actual report providing a snapshot of resources available on that compute node. If for some reason nova service is not healthy on that compute node or communication between the controller node and this compute node has failed, we will stop seeing these updates in nova-compute.log file. Of course this can also be observed on OpenStack Horizon dashboard when VMs are not schedules on the compute nodes that are considered “Out of service” by the controller node.

A complete list of log files can be found on Red Hat customer portal under each release. For Red Hat OpenStack Platfrom 10 it can be found at https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/paged/logging-monitoring-and-troubleshooting-guide/.

12.2. Monitoring

OpenStack also provides various KPIs that should be monitored apart from the log files.

With OpenStack, monitoring the metrics and events listed in the table below enables mobile operators with insights into the performance, availability and overall health of OpenStack.

Most useful metrics the following:

  • hypervisor_load
  • running_vms
  • free_disk_gb
  • free_ram_mb
  • queue memory
  • queue consumers
  • consumer_utilisation

Details and a more complete listing of OpenStack KPIs can be found at https://www.datadoghq.com/blog/openstack-monitoring-nova/?ref=wikipedia.