How to Use 'fio' to Check Etcd Disk Performance in OCP
Issue
- etcd has delicate disk response requirements, and it is often necessary to ensure that the speed that etcd writes to its backing storage is fast enough for production workloads.
-
etcd alerts from the web console or frequent error messages such as the below may suggest that writes are taking too long:
2020-10-21T09:56:00.246667768Z 2020-10-21 09:56:00.246542 W | etcdserver: read-only range request "key:\"/kubernetes.io/serviceaccounts/openshift-kube-scheduler/localhost-recovery-client\" " with result "range_response_count:1 size:407" took too long (113.372697ms) to execute
-
The performance documentation on etcd suggests that in production workloads,
wal_fsync_duration_seconds
p99 duration should be less than 10ms to confirm the disk is reasonably fast. - Depending on the severity of disk speed issues, impact can range from frequent alerting to overall cluster instability.
- For more general information regarding infrastructure requirements, please see etcd backend performance requirements.
Environment
- Red Hat OpenShift Container Platform (RHOCP, OCP)
- 3.11
- 4
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.