Ceph full, nova services reported as DOWN and instances cannot be created or deleted in Red Hat OpenStack Platform
Issue
Ceph full, nova services reported as DOWN and instances cannot be deleted in Red Hat OpenStack Platform
It's impossible to SSH into running instances. The port is open, but SSH seems to timeout upon connection attempt.
All compute nodes seem to be down
. One can restart the compute services, but they will go back into down
after a few minutes.
[stack@undercloud ~]$ nova service-list
+-----+------------------+------------------------------------+-------------+---------+-------+----------------------------+-----------------+
| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+-----+------------------+------------------------------------+-------------+---------+-------+----------------------------+-----------------+
| 2 | nova-scheduler | overcloud-controller-0.localdomain | internal | enabled | up | 2016-11-23T18:28:02.000000 | - |
| 5 | nova-scheduler | overcloud-controller-1.localdomain | internal | enabled | up | 2016-11-23T18:28:02.000000 | - |
| 8 | nova-scheduler | overcloud-controller-2.localdomain | internal | enabled | up | 2016-11-23T18:27:59.000000 | - |
| 11 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up | 2016-11-23T18:27:59.000000 | - |
| 14 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up | 2016-11-23T18:28:03.000000 | - |
| 17 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2016-11-23T18:28:02.000000 | - |
| 20 | nova-conductor | overcloud-controller-1.localdomain | internal | enabled | up | 2016-11-23T18:28:04.000000 | - |
| 29 | nova-conductor | overcloud-controller-2.localdomain | internal | enabled | up | 2016-11-23T18:27:59.000000 | - |
| 56 | nova-conductor | overcloud-controller-0.localdomain | internal | enabled | up | 2016-11-23T18:28:01.000000 | - |
| 101 | nova-compute | overcloud-compute-2.localdomain | nova | enabled | down | 2016-11-23T17:25:43.000000 | - |
| 104 | nova-compute | overcloud-compute-1.localdomain | nova | enabled | down | 2016-11-23T09:54:44.000000 | - |
| 107 | nova-compute | overcloud-compute-4.localdomain | nova | enabled | down | 2016-11-23T10:29:06.000000 | - |
| 110 | nova-compute | overcloud-compute-3.localdomain | nova | enabled | down | 2016-11-23T09:56:24.000000 | - |
| 113 | nova-compute | overcloud-compute-0.localdomain | nova | enabled | down | 2016-11-23T09:54:44.000000 | - |
| 116 | nova-compute | overcloud-compute-5.localdomain | nova | enabled | down | 2016-11-23T10:12:36.000000 | - |
| 119 | nova-cert | overcloud-controller-2.localdomain | internal | enabled | down | 2016-11-10T18:19:38.000000 | - |
| 122 | nova-cert | overcloud-controller-1.localdomain | internal | enabled | down | 2016-11-10T17:05:24.000000 | - |
| 125 | nova-cert | overcloud-controller-0.localdomain | internal | enabled | down | 2016-11-10T19:22:07.000000 | - |
| 128 | nova-compute | overcloud-controller-0.localdomain | controllers | enabled | down | 2016-09-27T16:02:56.000000 | - |
| 131 | nova-compute | overcloud-controller-1.localdomain | controllers | enabled | down | 2016-09-27T16:03:22.000000 | - |
| 134 | nova-compute | overcloud-controller-2.localdomain | controllers | enabled | down | 2016-09-27T16:03:34.000000 | - |
+-----+------------------+------------------------------------+-------------+---------+-------+----------------------------+-----------------+
Deleting instances does not work. Instances disappear from horizon, but upon further review, the instances are still in the database and in virsh list
.
MariaDB [nova]> select id, hostname, vm_state, task_state, host, uuid, created_at from instances where vm_state="error"
-> ;
+-------+----------------------------------------------------+----------+------------+------+--------------------------------------+---------------------+
| id | hostname | vm_state | task_state | host | uuid | created_at |
+-------+----------------------------------------------------+----------+------------+------+--------------------------------------+---------------------+
| 5534 | instance1 | error | deleting | NULL | b852df3a-ea18-421f-b8ad-14acfc5b09a2 | 2016-10-04 14:51:41 |
| 5978 | instance2 | error | deleting | NULL | 4e2ad13c-0576-44a9-92cd-eff06fb7ebc2 | 2016-10-17 15:32:49 |
| 5981 | instance2 | error | deleting | NULL | 0946c0bf-750b-4ba5-8432-9bb665d7beee | 2016-10-17 15:38:53 |
| 9544 | instance3 | error | deleting | NULL | c86ee274-51a8-4579-8b28-80a7cfdae7d1 | 2016-11-10 15:27:51 |
(...)
| 20744 | instance4 | error | deleting | NULL | 7d25a576-4bec-4b33-87de-916ada35ad3a | 2016-11-23 11:47:28 |
| 20747 | instance4 | error | deleting | NULL | d4315daa-33c3-4ec1-be89-54c58ad9e255 | 2016-11-23 11:47:41 |
| 20750 | instance4 | error | deleting | NULL | 3e7f3ea3-d8f5-4ad1-93ae-1a5236054c8b | 2016-11-23 11:48:02 |
| 20753 | instance4 | error | deleting | NULL | e2409adf-1dce-40e9-aa90-8c5cf4708a65 | 2016-11-23 11:48:39 |
| 20756 | cirros | error | deleting | NULL | e9d95def-6527-46ab-b056-69928119f9a0 | 2016-11-23 14:22:37 |
| 20759 | instance5 | error | deleting | NULL | c59be23f-a525-4310-819c-3687abd58f7c | 2016-11-23 14:27:02 |
| 20762 | instance6 | error | deleting | NULL | ebaf802a-33f8-45bc-ad0a-99d57ce2e1fd | 2016-11-23 14:31:20 |
+-------+----------------------------------------------------+----------+------------+------+--------------------------------------+---------------------+
virsh list
shows instances which should have been deleted
[root@overcloud-compute-2 instances]# virsh list
Id Name State
----------------------------------------------------
(...)
535 <instance that should have been deleted> running
(...)
The instance's ephemeral drive still exists in ceph
ceph ls --p vms | grep <instance UUID>
The following actions did not help:
* restarting the nova-compute service on one compute to see if the state in service-list would return to normal. It did, but then eventually went back into the down state again.
* setting running_deleted_instance_action=reap and restarting the nova-compute process but only 1 VM was removed.
Ceph health reports as HEALTH_ERR
[root@overcloud-controller-1 glance]# ceph -s
cluster <cluster uuid>
health HEALTH_ERR
1 full osd(s)
11 near full osd(s)
monmap e1: 3 mons at {overcloud-controller-0=10.0.1.9:6789/0,overcloud-controller-1=10.0.1.7:6789/0,overcloud-controller-2=10.0.1.8:6789/0}
election epoch 52, quorum 0,1,2 overcloud-controller-1,overcloud-controller-2,overcloud-controller-0
osdmap e1175: 24 osds: 24 up, 24 in
flags full
pgmap v6152415: 2764 pgs, 4 pools, 8411 GB data, 1179 kobjects
16823 GB used, 3286 GB / 20110 GB avail
2764 active+clean
ceph osd df
shows very unbalanced OSDs
[root@overcloud-cephstorage-1 heat-admin]# ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR
0 0.81999 1.00000 837G 692G 145G 82.64 0.99
1 0.81999 1.00000 837G 782G 56896M 93.37 1.12
4 0.81999 1.00000 837G 600G 237G 71.69 0.86
5 0.81999 1.00000 837G 796G 42830M 95.01 1.14
7 0.81999 1.00000 837G 683G 154G 81.52 0.97
8 0.81999 1.00000 837G 725G 112G 86.53 1.03
9 0.81999 1.00000 837G 721G 116G 86.07 1.03
12 0.81999 1.00000 837G 616G 221G 73.61 0.88
15 0.81999 1.00000 837G 604G 233G 72.09 0.86
16 0.81999 1.00000 837G 645G 192G 77.08 0.92
18 0.81999 1.00000 837G 780G 58719M 93.16 1.11
21 0.81999 1.00000 837G 763G 76305M 91.11 1.09
2 0.81999 1.00000 837G 682G 155G 81.48 0.97
3 0.81999 1.00000 837G 780G 59266M 93.09 1.11
6 0.81999 1.00000 837G 755G 84685M 90.13 1.08
10 0.81999 1.00000 837G 772G 66920M 92.20 1.10
11 0.81999 1.00000 837G 732G 105G 87.45 1.05
13 0.81999 1.00000 837G 653G 184G 77.94 0.93
14 0.81999 1.00000 837G 636G 201G 75.95 0.91
17 0.81999 1.00000 837G 653G 184G 77.93 0.93
19 0.81999 1.00000 837G 588G 249G 70.19 0.84
20 0.81999 1.00000 837G 726G 111G 86.75 1.04
22 0.81999 1.00000 837G 677G 160G 80.87 0.97
23 0.81999 1.00000 837G 753G 86865M 89.88 1.07
TOTAL 20110G 16823G 3286G 83.66
MIN/MAX VAR: 0.84/1.14 STDDEV: 7.62
ceph health detail
[root@overcloud-controller-1 glance]# ceph health detail
HEALTH_ERR 1 full osd(s): 11 near full osd(s)
osd.5 is full at 95%
osd.1 is near full at 93%
osd.3 is near full at 93%
osd.6 is near full at 90%
osd.8 is near full at 86%
osd.9 is near full at 86%
osd.10 is near full at 92%
osd.11 is near full at 87%
osd.18 is near full at 93%
osd.20 is near full at 86%
osd.21 is near full at 91%
osd.23 is near full at 89%
Portion of the ceph-mon on controller-2.
2016-11-23 08:15:53.413999 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:16:53.414482 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:17:53.414980 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:18:53.415453 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:19:53.416069 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:20:53.416681 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:21:53.417234 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:22:53.417781 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:23:53.418272 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:24:53.418715 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:25:53.419289 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:26:33.684350 7f8cf4c71700 0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:33.684405 7f8cf4c71700 0 log_channel(audit) log [DBG] : from='client.? 10.0.1.15:0/1048911' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:35.598944 7f8cf4c71700 0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:35.598978 7f8cf4c71700 0 log_channel(audit) log [DBG] : from='client.? 10.0.1.11:0/1010805' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:43.148044 7f8cf4c71700 0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:43.148088 7f8cf4c71700 0 log_channel(audit) log [DBG] : from='client.? 10.0.1.13:0/1021028' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:45.479157 7f8cf4c71700 0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:45.479227 7f8cf4c71700 0 log_channel(audit) log [DBG] : from='client.? 10.0.1.13:0/1021506' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:52.926522 7f8cf4c71700 0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:52.926613 7f8cf4c71700 0 log_channel(audit) log [DBG] : from='client.? 10.0.1.10:0/1015390' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:53.021632 7f8cf4c71700 0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:53.021702 7f8cf4c71700 0 log_channel(audit) log [DBG] : from='client.? 10.0.1.10:0/1015400' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:53.419838 7f8cf5472700 0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:27:01.171036 7f8cf4c71700 0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:27:01.171082 7f8cf4c71700 0 log_channel(audit) log [DBG] : from='client.? 10.0.1.14:0/1012087' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:27:05.209061 7f8cf4c71700 0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:27:05.209112 7f8cf4c71700 0 log_channel(audit) log [DBG] : from='client.? 10.0.1.15:0/1002216' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:27:05.382266 7f8cf4c71700 0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:27:05.382309 7f8cf4c71700 0 log_channel(audit) log [DBG] : from='client.? 10.0.1.10:0/1016853' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
It's impossible to SSH into running instances. The port is open, but SSH seems to timeout upon connection attempt. Using local port redirection and using VNC to directly log into the instance, it looks as if the instance freezes when trying to log in.
# Command for VNC port redirection
ssh root@10.10.181.86 -L 5999:192.0.2.6:5900
console log for a running cirros image a compute node
cat /var/lib/nova/instances/<UUID>/console.log
[1469760.315500] INFO: task syslogd:107 blocked for more than 120 seconds.
[1469760.316195] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1469760.317083] ffff88001d0119f8 0000000000000086 00000000335a3528 ffffffff81c17540
[1469760.318021] ffff88001d011fd8 ffff88001d011fd8 ffff88001d011fd8 0000000000013700
[1469760.318938] ffff88001d02dbc0 ffff88001d0516f0 ffff88001d0119c8 ffff88001fa13fc0
[1469760.319870] Call Trace:
[1469760.320205] [<ffffffff81117f50>] ? __lock_page+0x70/0x70
[1469760.320799] [<ffffffff8165badf>] schedule+0x3f/0x60
[1469760.321340] [<ffffffff8165bb8f>] io_schedule+0x8f/0xd0
[1469760.321918] [<ffffffff81117f5e>] sleep_on_page+0xe/0x20
[1469760.322486] [<ffffffff8165c39f>] __wait_on_bit+0x5f/0x90
[1469760.323069] [<ffffffff811180c8>] wait_on_page_bit+0x78/0x80
[1469760.323682] [<ffffffff8108b920>] ? autoremove_wake_function+0x40/0x40
[1469760.324360] [<ffffffff81118ee2>] grab_cache_page_write_begin+0x92/0xe0
[1469760.325058] [<ffffffff8116354f>] ? kmem_cache_free+0x2f/0x110
[1469760.325689] [<ffffffff811f9ed0>] ext3_write_begin+0x80/0x270
[1469760.326297] [<ffffffff8111839a>] generic_perform_write+0xca/0x210
[1469760.326963] [<ffffffff81137928>] ? bdi_wakeup_thread_delayed+0x38/0x40
[1469760.327656] [<ffffffff8111853d>] generic_file_buffered_write+0x5d/0x90
[1469760.328340] [<ffffffff81119ea9>] __generic_file_aio_write+0x229/0x440
[1469760.329033] [<ffffffff81198b00>] ? mntput_no_expire+0x30/0xf0
[1469760.329682] [<ffffffff81187687>] ? do_last+0x1d7/0x730
[1469760.330254] [<ffffffff8111a132>] generic_file_aio_write+0x72/0xe0
[1469760.330944] [<ffffffff81178e0a>] do_sync_write+0xda/0x120
[1469760.331532] [<ffffffff812d9588>] ? apparmor_file_permission+0x18/0x20
[1469760.332220] [<ffffffff8129ec6c>] ? security_file_permission+0x2c/0xb0
[1469760.332909] [<ffffffff811793b1>] ? rw_verify_area+0x61/0xf0
[1469760.337730] [<ffffffff81179713>] vfs_write+0xb3/0x180
[1469760.338290] [<ffffffff81179a3a>] sys_write+0x4a/0x90
[1469760.338854] [<ffffffff81666002>] system_call_fastpath+0x16/0x1b
Environment
Red Hat OpenStack Platform 8.0
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.