OSDs flapping along with slow requests when applying a test load in a Ceph cluster.
Issue
-
While load testing a Ceph cluster using
cosbench
, the OSDs time out or logwrongly marked me down
messages. -
The tests are done on the pools used for RGW.
-
The benchmark is being executed on the RGW pool consisting of 8TB SATA disks.
-
This started recently and the only change that has happened is the addition of objects in the pools, but not a dramatic increase as well.
-
The slowness is not evident when the files are copied to a new folder on the same disk, but the problem comes up if the files are copied with their extended attributes (xattr).
-
Some base line testing:
# dd if=/dev/zero of=here bs=4M count=256 oflag=direct && dd if=here of=/dev/null bs=4M count=256 && rm -f here
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 7.54084 s, 142 MB/s
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 5.52354 s, 194 MB/s
# dd if=/dev/zero of=here bs=4M count=256 oflag=direct && dd if=here of=/dev/null bs=4M count=256 && rm -f here
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 7.38176 s, 145 MB/s
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 5.21124 s, 206 MB/s
# dd if=/dev/zero of=here bs=4M count=256 oflag=direct && dd if=here of=/dev/null bs=4M count=256 && rm -f here
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 7.49258 s, 143 MB/s
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 5.1737 s, 208 MB/s
Environment
- Red Hat Ceph Storage 1.3
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.