CloudForms Management Engine (CFME): Understanding logrotate- identifying and correcting problems

Solution Verified - Updated -

Environment

  • Enterprise Virtualization Manager Version 4 (EVM)
  • Enterprise Virtualization Manager Version 5 (EVM)
  • Red Hat CloudForms Management Engine (CFME) 5.1
  • Red Hat CloudForms Management Engine (CFME) 5.2
  • CloudForms (CF) 2.0
  • CloudForms (CF) 3.0
  • CloudForms (CF) 3.1

Issue

  • Log rotate (logrotate) is not working in one or more CFME appliances and/or the filesystem containing the CFME log files is full or fills repeatedly
  • Logs on server are not rotating properly
  • /var/www,miq/vmdb/log -100%
  • Why are the logs on the CloudForms appliance not rotating properly in CloudForms 2.0?

Resolution

Logrotate not working or the log filesystem filling is a symptom of a deeper problem, so fixing the symptom does not generally repair the system. Before fixing the symptom make certain that you have captured the full contents of the /var/www/miq/vmdb/log directory so that diagnosis is possible after "fixing the symptom" provides relief that is likely to be only temporary.

Fixing the symptom
Navigate to the /var/www/miq/vmdb/log directory and delete the largest *.log file within that directory.

Understanding what is causing the symptom
Determine the nature of log line types that are most frequent and determine if these are caused by some error in the environment or by factors internal to CFME. See Diagnostics Steps section below for how to dive deeper into the root cause of each instance.

If external factors are the cause, identify as closely as possible the external cause (eg, event storms being emitted from one or more providers) and provide the information to the customer so that the appropriate customer staff can be Engaged to correct the problem.

If internal factors are the cause (eg, an incorrectly coded provisioning or other automation process that is looping and emitting too many log lines) then help the customer identify the looping component and contact the author of the customization that is behaving incorrectly to correct the problem. Alternatively, if the problem is due to looping in process that can be identified uniquely (eg,a single pid with high cpu utilization) then have the customer kill that process using their favorite console access facility.

Root Cause

Some unexpected behavior in the appliance or in the environment being monitored is causing more log lines to be created that the allocated filesystem can reliably handle.

Diagnostic Steps

Collect the contents of /var/www/miq/vmdb/log directory from the appliance (or one of the appliances) encountering the issue.

When the log set is provided, review the set of logs to see there is anything unusual in the sizes of any of the .log files or .gz files created by logrotate.

the logrotate created (generally *.gz) files will have date stamps that can be used to identify the last time when logrotate was working correctly.

if one of the *.log files is > one half of the available freespace on the filesystem on which logging is occuring, focus your analysis on that log. A file with a size > 100% of the available freespace is sufficient to cause logrotate to fail.

if any one file (.log) is obviously larger than all of the others, that is a good candidate to examine closely.

As you look through the most likely suspects, navigate to the end of each file and move backwards to see if log lines from any one process (pid) are dominating the log at the end of the logs.

All CFME log lines have a standard format broken down into two general sections: the pre-amble and the payload. They are separated by a unique field of -- : for each standard (eg, non error) log line.

A random sample follows:

[----] I, [2014-02-17T08:25:31.127580 #19603:11b0804]  INFO -- : MIQ(VcRefresher.refresh) EMS: [Virtual Center (192.168.252.14)], id: [440000000000001] Refreshing targets for EMS... 

For initial diagnosis, we will not examine the payload, but first focus on the preamble.

The preamble fields are as follows and are separated by blanks:
1- literal of [----] - If a log line does not start with this string it is not a standard log line.
2- A two character field containing a short indicator of the type of log line. The field terminates with the last character being a comma (','). Valid values are "W" (Warning), "I" (INFO), "F" (FATAL), "E" (Error), "D" (Debug). In the sample above this field is "I,"
3- A fixed length field containing the UTC date and time of the log line. This field starts with a left square-bracket ("["). In the sample above the value is "[2014-02-17T08:25:31.127580"
4- a variable length field that begins with the hash sign ("#") and terminates with the right square-bracket {"]"). This field is composed of two key elements which are separated by the colon (":") within the field. The first element is the process id (pid) of the process responsible for creating the log line ("19603" in the example above), and the second element is the thread id within that process that is responsible for creating this log line ("11b0804" in the example above).

For diagnostic purposes, we will focus on the pid of each log line. If there is one pid in the largest of the .log files which seems to dominate, it is likely that this pid has encountered an issue that needs to be corrected. It may be possible to determine the nature of the problem by looking at the content of the payload associated with these log lines (eg, if they contain "AUTOMATION" or some other obvious identifier, that suggests where the problem may be. If the payload contents contain the word "ERROR" that is suggestive of a problem that needs to be researched further.

It is possible and likely that the log lines at the end of any file being examined may be substantially in the past, especially if the problem is that the filesystem has filled. This simply reflects that last time that log lines were able to be added to this file and is not necessarily an indication of the problem itself, simply another symptom, so don't think too deeply on this as "the problem".

If the actions above do not present a reasonable path forward, request collaboration from the cloud SBR.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments