Unreliable status of containerized Collectd agent
Issue
-
We found the output of collectd's data was inconsistent on several nodes.
-
From one of the nodes, here are logs:
[root@overcloud-controller-0 collectd]# tail collectd.log
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [7]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [7]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [7]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [2]: write to wake-up fd 8 failed: Broken pipe
[2021-01-22 15:23:43] [thrd:app]: nfv-collectd-events [7]: write to wake-up fd 8 failed: Broken pipe
- The container health goes to
unhealthy
:
[root@overcloud-controller-0 collectd]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e752df371daa satellite.localdomain:5000/is_linux-production-openstack-osp13_containers-collectd:13.0-137.1608222731 "dumb-init --singl..." 6 days ago Up 6 days (unhealthy) collectd
- We tried to restart the collectd container manually then. At the beginning, it reported loading plugins successfully.:
[root@overcloud-controller-0 collectd]# docker restart collectd
collectd
[root@overcloud-controller-0 collectd]# tail -20 collectd.log
[2021-01-22 15:25:28] plugin_load: plugin "cpu" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "df" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "disk" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "ethstat" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "hugepages" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "interface" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "load" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "memory" successfully loaded.
[2021-01-22 15..:25:28] plugin_load: plugin "processes" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "sysevent" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "tcpconns" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "unixsock" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "uptime" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "write_http" successfully loaded.
[2021-01-22 15:25:28] plugin_load: plugin "write_kafka" successfully loaded.
[2021-01-22 15:25:28] unixsock plugin: Successfully deleted socket file "/var/run/collectd-socket".
[2021-01-22 15:25:28] Initialization complete, entering read-loop.
[2021-01-22 15:25:28] tcpconns plugin: Reading from netlink succeeded. Will use the netlink method from now on.
[2021-01-22 15:25:28] write_kafka plugin: created KAFKA handle : rdkafka#producer-1
[2021-01-22 15:25:28] write_kafka plugin: handle created for topic : nfv-collectd-events
- However, after a few minutes, it thrown errors like before.
Environment
- Red Hat OpenStack Platform 13.0 (RHOSP)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.