When a hard disk is removed to simulate a real world failure the device label and the lsscsi identifier change upon reinsertion.

Solution In Progress - Updated 2024-06-14T12:31:56+00:00 -

Issue

When a hard disk is removed to simulate a real world failure the device label and the lsscsi identifier change upon reinsertion.
Steps to reproduce: Containerized Ceph. Remove drive. Wait 10 minutes. Replace drive.

[stack@undercloud ~]$ ssh heat-admin@10.10.10.10
Warning: Permanently added '10.10.10.10' (ECDSA) to the list of known hosts.
Last login: Fri Aug 23 14:08:41 2019 from 10.10.10.1
[heat-admin@overcloud-controller-0 ~]$ ceph -w
   cluster:
    id:     b5c2048c-9cab-11e9-8001-525400e2af01
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum overcloud-controller-0,overcloud-controller-1,overcloud-controller-2
    mgr: overcloud-controller-2(active), standbys: overcloud-controller-0, overcloud-controller-1
    osd: 24 osds: 24 up, 24 in
    rgw: 3 daemons active
  data:
   pools:   10 pools, 928 pgs
    objects: 7.21k objects, 16.6GiB
    usage:   68.5GiB used, 34.9TiB / 34.9TiB avail
    pgs:     928 active+clean
  io:
    client:   340B/s wr, 0op/s rd, 0op/s wr

2019-08-23 14:16:58.654605 mon.overcloud-controller-0 [INF] osd.13 failed (root=default,host=overcloudszq-cephstorage-0) (connection refused reported by osd.9)
2019-08-23 14:16:58.885966 mon.overcloud-controller-0 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-08-23 14:16:59.914237 mon.overcloud-controller-0 [WRN] Health check failed: Reduced data availability: 3 pgs inactive, 32 pgs peering (PG_AVAILABILITY)
2019-08-23 14:17:01.903343 mon.overcloud-controller-0 [WRN] Health check failed: Degraded data redundancy: 422/21630 objects degraded (1.951%), 44 pgs degraded (PG_DEGRADED)
2019-08-23 14:17:06.763916 mon.overcloud-controller-0 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 5 pgs inactive, 57 pgs peering)
2019-08-23 14:17:11.832355 mon.overcloud-controller-0 [WRN] Health check update: Degraded data redundancy: 682/21630 objects degraded (3.153%), 89 pgs degraded (PG_DEGRADED)
2019-08-23 14:17:59.994652 mon.overcloud-controller-0 [WRN] Health check update: Degraded data redundancy: 682/21630 objects degraded (3.153%), 89 pgs degraded, 117 pgs undersized (PG_DEGRADED)
2019-08-23 14:27:01.896159 mon.overcloud-controller-0 [INF] Marking osd.13 out (has been down for 602 seconds)
2019-08-23 14:27:01.896497 mon.overcloud-controller-0 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-08-23 14:27:05.926034 mon.overcloud-controller-0 [WRN] Health check update: Degraded data redundancy: 538/21630 objects degraded (2.487%), 52 pgs degraded, 67 pgs undersized (PG_DEGRADED)
2019-08-23 14:27:10.996651 mon.overcloud-controller-0 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 229/21630 objects degraded (1.059%), 6 pgs degraded)
2019-08-23 14:27:10.996686 mon.overcloud-controller-0 [INF] Cluster is now healthy

Expected result: upon re-insertion of drive, or insertion of new drive, the OSD ID device label and lssci HTCL placement remain the same.
The problem is summarized by the following information:

We have a given set of disks that are already deployed:
[root@overcloud-cephstorage-0 heat-admin]# lsscsi
[0:0:0:0]    disk    LENOVO-X HUSMM1616ASS20   K4CC  /dev/sda 
[0:0:1:0]    disk    LENOVO-X HUSMM1616ASS20   K4CC  /dev/sdb 
[1:0:0:0]    disk    LENOVO-X HUSMM1620ASS20   K4CC  /dev/sdc 
[1:0:1:0]    disk    LENOVO-X HUSMM1620ASS20   K4CC  /dev/sdd 
[1:0:2:0]    disk    LENOVO-X HUSMM1616ASS20   K4CC  /dev/sde 
[1:0:3:0]    disk    LENOVO-X HUSMM1616ASS20   K4CC  /dev/sdf 
[1:0:5:0]    disk    LENOVO-X HUSMM1616ASS20   K4CC  /dev/sdh 
[1:0:6:0]    disk    LENOVO-X HUSMM1616ASS20   K4CC  /dev/sdi 
[1:0:7:0]    disk    LENOVO-X HUSMM1616ASS20   K4CC  /dev/sdj 
[1:0:11:0]   disk    LENOVO-X HUSMM1616ASS20   K4CC  /dev/sdg

but if we remove 1:0:2:0 and insert it back in at the same physical location, it becomes 1:0:12:0.

Environment

Red Hat OpenStack Platform 13.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Select Your Language

When a hard disk is removed to simulate a real world failure the device label and the lsscsi identifier change upon reinsertion.

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links