pmlogcheck 100% cpu after 8.0 -> 8.3 update

Latest response


After updating today from 8.0 to 8.3, I now have pmlogcheck running at 100% of one core. It will occasionally die, be replaced by pmlogrewrite and then xz, and then run again at 100%.

Is this normal? If not, what do I look for to fix or disable?


P.S. The update process I used was 'yum update' followed by 'reboot'.


Answering my own question here: I disabled the pmlogger_check service. Looking at its status, I now have this:

# systemctl status pmlogger_check.service 
● pmlogger_check.service - Check pmlogger instances are running
   Loaded: loaded (/usr/lib/systemd/system/pmlogger_check.service; disabled; vendor preset: disable>
   Active: inactive (dead) since Sun 2021-01-17 10:56:45 EST; 14min ago
     Docs: man:pmlogger_check(1)
  Process: 10637 ExecStart=/usr/libexec/pcp/bin/pmlogger_check $PMLOGGER_CHECK_PARAMS (code=exited,>
 Main PID: 10637 (code=exited, status=0/SUCCESS)

Jan 17 10:55:03 logrus systemd[1]: Starting Check pmlogger instances are running...
Jan 17 10:56:45 logrus systemd[1]: pmlogger_check.service: Succeeded.
Jan 17 10:56:45 logrus systemd[1]: Started Check pmlogger instances are running.

Listing the directory /var/log/pcp/pmlogger/logrus shows the various logging data files are being written and compressed. Looking at the error logs does not show anything related to pmlogger.

it seems that after update to the pcp in 8.3, it rewrites all the damned logfiles. I've just had a server chew 100% cpu on a core for 32 minutes, while it went through rewriting all of its garbage.

It better be a one-off, after update. This behaviour is completely unacceptable, it is the sort of thing that triggers alerts in an enterprise environment, and gets people woken up at 3am.

The pmlogrewrite & logcheck processes were not even "nice'd" to a low priority, so it also has the potential to slow down production service startup, and cause failures. We have several jboss services that take a long time to start up, and if this log rewriting slows things down a bit, they could easily exceed the 5 minute systemd service startup timeout, leading to systemd to think service startup failed, etc etc.

Very poor behaviour.