A 'ceph pg scrub' resulted in a read error on one of the OSDs. What does it mean?
Environment
- Inktank Ceph Enterprise 1.1
Issue
-
A 'ceph pg scrub <pg.id>' resulted in a read error on one of the OSDs.
-
A subsequent 'ceph health detail' shows scrub errors.
# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 40.17ae is active+clean+inconsistent, acting [145,107,101]
1 scrub errors
- A look into the OSD logs [145, 107, 101] shows a lossy connection
# cat ceph-osd.101.log | grep 40.17ae
2015-04-27 16:47:29.635290 7f60d84bc700 0 -- AA.BB.CC.DD:6800/4027434 submit_message osd_sub_op(unknown.0.0:0 40.17ae 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 remote, EE.FF.GG.HH:6808/10555, failed lossy con, dropping message 0x17ac1800
2015-04-27 18:30:39.344752 7f60d8cbd700 0 -- AA.BB.CC.DD:6820/5027434 submit_message osd_sub_op(unknown.0.0:0 40.17ae 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 remote, EE.FF.GG.HH:6808/10555, failed lossy con, dropping message 0x2ed83800
Resolution
-
From the OSD logs, it seems that the scrub error was caused due to a hardware problem.
-
Running a hardware check along with using 'smartd' is suggested. To know more on smartd, read https://access.redhat.com/solutions/1456.
-
In order to bring the Ceph cluster to an active state as quickly as possible, the OSD can be marked as out so that the PGs would get another OSD allocated.
-
To set the OSD as out, execute:
# ceph osd out {osd.number}
- Once the cluster recovers and reach a clean state, the scrub can be started once again to confirm that the data is indeed correct.
Root Cause
- Based on the output of the scrub and OSD logs, it seems that there was a hardware error involved here.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments