After upgrade to v5.4.2, collector pods keep restarting with OOMKilled
Environment
- Red Hat OpenShift Service on AWS
- OpenShift Container Platform
- 4.9+
- Logging
- 5.4.2
Issue
- After upgrade to v5.4.2, collector pods keep restarting with OOMKilled if you configured CloudWatch forwarding configurations.
- After upgrade to v5.4.2, collector pods memory usage is spiked than before if you configured CloudWatch forwarding configurations.
Resolution
- Please upgrade your Logging sub-system to v5.4.3+ in order to fix the issue.
Root Cause
- The collector(fluentd) had been enabled "disable_chunk_backup" as of v5.4.2, it causes more memory usage due to handling chunk backup files IO and around buffers. It's disabled at v5.4.3 again to fix this issue.
- Usually, it would be affected when you configure CloudWatch in ClusterLogForwarder in your env. Because CloudWatch has hard limit message length size as 256kb, all logs which is more than the limit size are backing up if "disable_chunk_backup" is enabled.
Diagnostic Steps
- With CloudWatch forwarding configuration, you can see the following logs in your collector pod if your log is more than 256kb, when "disable_chunk_backup" is enabled. It shows you that your collector can use more memory than before.
[warn]: got unrecoverable error in primary and no secondary error_class=Fluent::Plugin::CloudwatchLogsOutput::TooLargeEventError error="Log event in /dxpf/cont/rosa/audit.audit is discarded because it is too large: 12345 bytes exceeds limit of 262144"
:
[warn]: bad chunk is moved to /tmp/fluent/backup/worker0/object_xxxxx.log
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments