Unreliable status of containerized Collectd agent

Solution In Progress - Updated -

Issue

  • We found the output of collectd's data was inconsistent on several nodes.

  • From one of the nodes, here are logs:

[root@overcloud-controller-0 collectd]# tail collectd.log
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [7]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [7]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [7]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [7]: write to wake-up fd 8 failed: Broken pipe
  • The container health goes to unhealthy:
[root@overcloud-controller-0 collectd]# docker ps
CONTAINER ID        IMAGE                                                                                                         COMMAND                  CREATED             STATUS                  PORTS               NAMES
e752df371daa        satellite.localdomain:5000/is_linux-production-openstack-osp13_containers-collectd:13.0-137.1608222731   "dumb-init --singl..."   6 days ago          Up 6 days (unhealthy)                       collectd
  • We tried to restart the collectd container manually then. At the beginning, it reported loading plugins successfully.:
[root@overcloud-controller-0 collectd]# docker restart collectd
collectd
[root@overcloud-controller-0 collectd]# tail -20 collectd.log
[2021-01-22 15:25:28] plugin_load: plugin "cpu" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "df" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "disk" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "ethstat" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "hugepages" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "interface" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "load" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "memory" successfully loaded.
[2021-01-22 15..:25:28] plugin_load: plugin "processes" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "sysevent" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "tcpconns" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "unixsock" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "uptime" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "write_http" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "write_kafka" successfully loaded.
[2021-01-22 15:25:28] unixsock plugin: Successfully deleted socket file "/var/run/collectd-socket".
[2021-01-22 15:25:28] Initialization complete, entering read-loop.
[2021-01-22 15:25:28] tcpconns plugin: Reading from netlink succeeded. Will use the netlink method from now on.
[2021-01-22 15:25:28] write_kafka plugin: created KAFKA handle : rdkafka#producer-1
[2021-01-22 15:25:28] write_kafka plugin: handle created for topic : nfv-collectd-events
  • However, after a few minutes, it thrown errors like before.

Environment

  • Red Hat OpenStack Platform 13.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In