How to Use 'fio' to Check Etcd Disk Performance in OCP

Solution Verified - Updated -

Issue

  • etcd has delicate disk response requirements, and it is often necessary to ensure that the speed that etcd writes to its backing storage is fast enough for production workloads.
  • etcd alerts from the web console or frequent error messages such as the below may suggest that writes are taking too long:

    2020-10-21T09:56:00.246667768Z 2020-10-21 09:56:00.246542 W | etcdserver: read-only range request "key:\"/kubernetes.io/serviceaccounts/openshift-kube-scheduler/localhost-recovery-client\" " with result  "range_response_count:1 size:407" took too long (113.372697ms) to execute
    
  • The performance documentation on etcd suggests that in production workloads, wal_fsync_duration_seconds p99 duration should be less than 10ms to confirm the disk is reasonably fast.

  • Depending on the severity of disk speed issues, impact can range from frequent alerting to overall cluster instability.
  • For more general information regarding infrastructure requirements, please see etcd backend performance requirements.

Environment

  • Red Hat OpenShift Container Platform (RHOCP, OCP)
    • 3.11
    • 4

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In