CEPH Release 2.5 (10.2.10-28.el7cp) with VMs under high iowait(>20%)

Posted on

Dear community,

I have a cloud solution running under some relative old CEPH release where it is being noticed high iowait on any VM requiring a little bit more disk write.

The solution includes many servers with local disks shared with Guests over CEPH 2.5.

Some examples are:

03/05/21 14:52:04
avg-cpu: %user %nice %system %iowait %steal %idle
2.95 0.00 1.80 31.19 0.13 63.93

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 12.50 0.00 1.00 0.00 190.00 380.00 0.89 1.00 0.00 1.00 889.50 88.95
vda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vda2 0.00 12.50 0.00 1.00 0.00 190.00 380.00 0.00 1.00 0.00 1.00 1.00 0.10
vda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdc 0.00 11.50 0.00 2.50 0.00 56.00 44.80 0.01 2.40 0.00 2.40 0.60 0.15
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

03/05/21 14:52:06
avg-cpu: %user %nice %system %iowait %steal %idle
0.64 0.00 1.03 24.20 0.13 74.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.50 0.00 10.50 0.00 44.00 8.38 0.53 0.00 0.00 0.00 50.05 52.55
vda1 0.00 0.00 0.00 3.50 0.00 14.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
vda2 0.00 0.50 0.00 6.50 0.00 28.00 8.62 0.00 0.00 0.00 0.00 0.00 0.00
vda3 0.00 0.00 0.00 0.50 0.00 2.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
vdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

03/05/21 14:52:08
avg-cpu: %user %nice %system %iowait %steal %idle
1.29 0.00 1.16 29.50 0.39 67.66

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 28.50 0.00 92.50 0.00 1398.00 30.23 0.17 108.51 0.00 108.51 1.21 11.20
vda1 0.00 1.50 0.00 1.50 0.00 12.00 16.00 0.00 1880.67 0.00 1880.67 0.33 0.05
vda2 0.00 0.00 0.00 0.50 0.00 2.00 8.00 0.00 2406.00 0.00 2406.00 0.00 0.00
vda3 0.00 27.00 0.00 68.00 0.00 1384.00 40.71 0.07 80.25 0.00 80.25 0.24 1.65
vdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdc 0.00 0.00 0.00 0.50 0.00 2.00 8.00 0.00 2481.00 0.00 2481.00 0.00 0.00
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

03/05/21 14:52:10
avg-cpu: %user %nice %system %iowait %steal %idle
5.87 0.00 3.79 4.44 0.65 85.25

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 6.50 0.00 359.00 0.00 3576.00 19.92 0.61 3.14 0.00 3.14 0.04 1.30
vda1 0.00 0.00 0.00 0.50 0.00 2.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
vda2 0.00 4.50 0.00 1.00 0.00 22.00 44.00 0.00 3.00 0.00 3.00 0.50 0.05
vda3 0.00 2.00 0.00 300.00 0.00 3552.00 23.68 0.61 3.18 0.00 3.18 0.04 1.25
vdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdc 0.00 3.50 0.00 1.00 0.00 18.00 36.00 0.00 2.00 0.00 2.00 0.00 0.00
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Notice that even with a not so high writes per second, the iowait oscillates a lot, with high values.

On the Compute Node apparently we do not see high iowait on the physical disk, indicating something in the CEPH logic, like its own cache.

Any suggestion to troubleshoot this condition on CEPH?

Responses