Seeing occurrences of high read I/O on docker dedicated device. Need some help with RCA

Solution In Progress - Updated 2024-06-14T16:27:23+00:00 -

Environment

OpenShift Container Platform 3.4.0
various applications including kibana-proxy

Issue

We're seeing incidents of high read I/O (150MB/sec+) on the dedicated docker storage device across different nodes in different environments. The troubleshooting and diagnosis indicates a number of different pods so it doesn't seem consistent, frequently but not at specific times.

Resolution

In later OpenShift versions kibana-proxy memory limit was increased from 96M to 256M to avoid OOM conditions
Other applications should be evaluated on their memory usage and requirements.

Root Cause

This appears to be a symptom of containers reaching their memory limit.

From the analysis of VMCore captured when kibana-proxy reached this state.

The process was gradually allocating 'anonymous memory' (data and stack). As the memory allocation got close to the cgroup's limit, the process began to reclaim pages that contain cached file data. Those were the only kind of pages that could be reclaimed because - due to lack of a swap device - it was impossible to swap out and reclaim modified (dirty) 'anonymous memory' pages. After a while the process got to a point where it was only possible to allocate more pages to data and stack if cached pages from the '/usr/bin/node' binary image were reclaimed aggressively. Eventually the number of pages from the '/usr/bin/node' binary image that could be kept cached in memory got so small that the likelihood of a page fault during instruction fetch increased significantly. In order to handle the page fault it was necessary to reclaim pages that contain cached file data - probably from the '/usr/bin/node' binary image again - so the process ended up thrashing while trying to fault in pages from the binary image.

This would explain the simultaneous increase in page faults and disk read I/O.

Diagnostic Steps

VMCore was capture while kibana-proxy container reached the memory limit and before it was oom killed.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

Seeing occurrences of high read I/O on docker dedicated device. Need some help with RCA

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links