When a hard disk is removed to simulate a real world failure the device label and the lsscsi identifier change upon reinsertion.
Issue
-
When a hard disk is removed to simulate a real world failure the device label and the lsscsi identifier change upon reinsertion.
-
Steps to reproduce: Containerized Ceph. Remove drive. Wait 10 minutes. Replace drive.
[stack@undercloud ~]$ ssh heat-admin@10.10.10.10
Warning: Permanently added '10.10.10.10' (ECDSA) to the list of known hosts.
Last login: Fri Aug 23 14:08:41 2019 from 10.10.10.1
[heat-admin@overcloud-controller-0 ~]$ ceph -w
cluster:
id: b5c2048c-9cab-11e9-8001-525400e2af01
health: HEALTH_OK
services:
mon: 3 daemons, quorum overcloud-controller-0,overcloud-controller-1,overcloud-controller-2
mgr: overcloud-controller-2(active), standbys: overcloud-controller-0, overcloud-controller-1
osd: 24 osds: 24 up, 24 in
rgw: 3 daemons active
data:
pools: 10 pools, 928 pgs
objects: 7.21k objects, 16.6GiB
usage: 68.5GiB used, 34.9TiB / 34.9TiB avail
pgs: 928 active+clean
io:
client: 340B/s wr, 0op/s rd, 0op/s wr
2019-08-23 14:16:58.654605 mon.overcloud-controller-0 [INF] osd.13 failed (root=default,host=overcloudszq-cephstorage-0) (connection refused reported by osd.9)
2019-08-23 14:16:58.885966 mon.overcloud-controller-0 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-08-23 14:16:59.914237 mon.overcloud-controller-0 [WRN] Health check failed: Reduced data availability: 3 pgs inactive, 32 pgs peering (PG_AVAILABILITY)
2019-08-23 14:17:01.903343 mon.overcloud-controller-0 [WRN] Health check failed: Degraded data redundancy: 422/21630 objects degraded (1.951%), 44 pgs degraded (PG_DEGRADED)
2019-08-23 14:17:06.763916 mon.overcloud-controller-0 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 5 pgs inactive, 57 pgs peering)
2019-08-23 14:17:11.832355 mon.overcloud-controller-0 [WRN] Health check update: Degraded data redundancy: 682/21630 objects degraded (3.153%), 89 pgs degraded (PG_DEGRADED)
2019-08-23 14:17:59.994652 mon.overcloud-controller-0 [WRN] Health check update: Degraded data redundancy: 682/21630 objects degraded (3.153%), 89 pgs degraded, 117 pgs undersized (PG_DEGRADED)
2019-08-23 14:27:01.896159 mon.overcloud-controller-0 [INF] Marking osd.13 out (has been down for 602 seconds)
2019-08-23 14:27:01.896497 mon.overcloud-controller-0 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-08-23 14:27:05.926034 mon.overcloud-controller-0 [WRN] Health check update: Degraded data redundancy: 538/21630 objects degraded (2.487%), 52 pgs degraded, 67 pgs undersized (PG_DEGRADED)
2019-08-23 14:27:10.996651 mon.overcloud-controller-0 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 229/21630 objects degraded (1.059%), 6 pgs degraded)
2019-08-23 14:27:10.996686 mon.overcloud-controller-0 [INF] Cluster is now healthy
-
Expected result: upon re-insertion of drive, or insertion of new drive, the OSD ID device label and lssci HTCL placement remain the same.
-
The problem is summarized by the following information:
We have a given set of disks that are already deployed:
[root@overcloud-cephstorage-0 heat-admin]# lsscsi
[0:0:0:0] disk LENOVO-X HUSMM1616ASS20 K4CC /dev/sda
[0:0:1:0] disk LENOVO-X HUSMM1616ASS20 K4CC /dev/sdb
[1:0:0:0] disk LENOVO-X HUSMM1620ASS20 K4CC /dev/sdc
[1:0:1:0] disk LENOVO-X HUSMM1620ASS20 K4CC /dev/sdd
[1:0:2:0] disk LENOVO-X HUSMM1616ASS20 K4CC /dev/sde
[1:0:3:0] disk LENOVO-X HUSMM1616ASS20 K4CC /dev/sdf
[1:0:5:0] disk LENOVO-X HUSMM1616ASS20 K4CC /dev/sdh
[1:0:6:0] disk LENOVO-X HUSMM1616ASS20 K4CC /dev/sdi
[1:0:7:0] disk LENOVO-X HUSMM1616ASS20 K4CC /dev/sdj
[1:0:11:0] disk LENOVO-X HUSMM1616ASS20 K4CC /dev/sdg
but if we remove 1:0:2:0 and insert it back in at the same physical location, it becomes 1:0:12:0.
Environment
- Red Hat OpenStack Platform 13.0 (RHOSP)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.