Ceph full, nova services reported as DOWN and instances cannot be created or deleted in Red Hat OpenStack Platform

Solution Verified - Updated -

Issue

Ceph full, nova services reported as DOWN and instances cannot be deleted in Red Hat OpenStack Platform

It's impossible to SSH into running instances. The port is open, but SSH seems to timeout upon connection attempt.

All compute nodes seem to be down. One can restart the compute services, but they will go back into down after a few minutes.

[stack@undercloud ~]$ nova service-list
+-----+------------------+------------------------------------+-------------+---------+-------+----------------------------+-----------------+
| Id  | Binary           | Host                               | Zone        | Status  | State | Updated_at                 | Disabled Reason |
+-----+------------------+------------------------------------+-------------+---------+-------+----------------------------+-----------------+
| 2   | nova-scheduler   | overcloud-controller-0.localdomain | internal    | enabled | up    | 2016-11-23T18:28:02.000000 | -               |
| 5   | nova-scheduler   | overcloud-controller-1.localdomain | internal    | enabled | up    | 2016-11-23T18:28:02.000000 | -               |
| 8   | nova-scheduler   | overcloud-controller-2.localdomain | internal    | enabled | up    | 2016-11-23T18:27:59.000000 | -               |
| 11  | nova-consoleauth | overcloud-controller-2.localdomain | internal    | enabled | up    | 2016-11-23T18:27:59.000000 | -               |
| 14  | nova-consoleauth | overcloud-controller-1.localdomain | internal    | enabled | up    | 2016-11-23T18:28:03.000000 | -               |
| 17  | nova-consoleauth | overcloud-controller-0.localdomain | internal    | enabled | up    | 2016-11-23T18:28:02.000000 | -               |
| 20  | nova-conductor   | overcloud-controller-1.localdomain | internal    | enabled | up    | 2016-11-23T18:28:04.000000 | -               |
| 29  | nova-conductor   | overcloud-controller-2.localdomain | internal    | enabled | up    | 2016-11-23T18:27:59.000000 | -               |
| 56  | nova-conductor   | overcloud-controller-0.localdomain | internal    | enabled | up    | 2016-11-23T18:28:01.000000 | -               |
| 101 | nova-compute     | overcloud-compute-2.localdomain    | nova        | enabled | down  | 2016-11-23T17:25:43.000000 | -               |
| 104 | nova-compute     | overcloud-compute-1.localdomain    | nova        | enabled | down  | 2016-11-23T09:54:44.000000 | -               |
| 107 | nova-compute     | overcloud-compute-4.localdomain    | nova        | enabled | down  | 2016-11-23T10:29:06.000000 | -               |
| 110 | nova-compute     | overcloud-compute-3.localdomain    | nova        | enabled | down  | 2016-11-23T09:56:24.000000 | -               |
| 113 | nova-compute     | overcloud-compute-0.localdomain    | nova        | enabled | down  | 2016-11-23T09:54:44.000000 | -               |
| 116 | nova-compute     | overcloud-compute-5.localdomain    | nova        | enabled | down  | 2016-11-23T10:12:36.000000 | -               |
| 119 | nova-cert        | overcloud-controller-2.localdomain | internal    | enabled | down  | 2016-11-10T18:19:38.000000 | -               |
| 122 | nova-cert        | overcloud-controller-1.localdomain | internal    | enabled | down  | 2016-11-10T17:05:24.000000 | -               |
| 125 | nova-cert        | overcloud-controller-0.localdomain | internal    | enabled | down  | 2016-11-10T19:22:07.000000 | -               |
| 128 | nova-compute     | overcloud-controller-0.localdomain | controllers | enabled | down  | 2016-09-27T16:02:56.000000 | -               |
| 131 | nova-compute     | overcloud-controller-1.localdomain | controllers | enabled | down  | 2016-09-27T16:03:22.000000 | -               |
| 134 | nova-compute     | overcloud-controller-2.localdomain | controllers | enabled | down  | 2016-09-27T16:03:34.000000 | -               |
+-----+------------------+------------------------------------+-------------+---------+-------+----------------------------+-----------------+

Deleting instances does not work. Instances disappear from horizon, but upon further review, the instances are still in the database and in virsh list.

MariaDB [nova]> select id, hostname, vm_state, task_state, host, uuid, created_at from instances where vm_state="error"
    -> ;
+-------+----------------------------------------------------+----------+------------+------+--------------------------------------+---------------------+
| id    | hostname                                           | vm_state | task_state | host | uuid                                 | created_at          |
+-------+----------------------------------------------------+----------+------------+------+--------------------------------------+---------------------+
|  5534 | instance1                                       | error    | deleting   | NULL | b852df3a-ea18-421f-b8ad-14acfc5b09a2 | 2016-10-04 14:51:41 |
|  5978 | instance2                                            | error    | deleting   | NULL | 4e2ad13c-0576-44a9-92cd-eff06fb7ebc2 | 2016-10-17 15:32:49 |
|  5981 | instance2                                            | error    | deleting   | NULL | 0946c0bf-750b-4ba5-8432-9bb665d7beee | 2016-10-17 15:38:53 |
|  9544 | instance3                                                | error    | deleting   | NULL | c86ee274-51a8-4579-8b28-80a7cfdae7d1 | 2016-11-10 15:27:51 |
(...)
| 20744 | instance4 | error    | deleting   | NULL | 7d25a576-4bec-4b33-87de-916ada35ad3a | 2016-11-23 11:47:28 |
| 20747 | instance4 | error    | deleting   | NULL | d4315daa-33c3-4ec1-be89-54c58ad9e255 | 2016-11-23 11:47:41 |
| 20750 | instance4 | error    | deleting   | NULL | 3e7f3ea3-d8f5-4ad1-93ae-1a5236054c8b | 2016-11-23 11:48:02 |
| 20753 | instance4 | error    | deleting   | NULL | e2409adf-1dce-40e9-aa90-8c5cf4708a65 | 2016-11-23 11:48:39 |
| 20756 | cirros                                             | error    | deleting   | NULL | e9d95def-6527-46ab-b056-69928119f9a0 | 2016-11-23 14:22:37 |
| 20759 | instance5 | error    | deleting   | NULL | c59be23f-a525-4310-819c-3687abd58f7c | 2016-11-23 14:27:02 |
| 20762 | instance6 | error    | deleting   | NULL | ebaf802a-33f8-45bc-ad0a-99d57ce2e1fd | 2016-11-23 14:31:20 |
+-------+----------------------------------------------------+----------+------------+------+--------------------------------------+---------------------+

virsh list shows instances which should have been deleted

[root@overcloud-compute-2 instances]# virsh list
 Id    Name                           State
----------------------------------------------------
(...)
 535   <instance that should have been deleted>              running
(...)

The instance's ephemeral drive still exists in ceph

ceph ls --p vms | grep <instance UUID>

The following actions did not help:
* restarting the nova-compute service on one compute to see if the state in service-list would return to normal. It did, but then eventually went back into the down state again.
* setting running_deleted_instance_action=reap and restarting the nova-compute process but only 1 VM was removed.

Ceph health reports as HEALTH_ERR

[root@overcloud-controller-1 glance]# ceph -s
    cluster <cluster uuid>
     health HEALTH_ERR
            1 full osd(s)
            11 near full osd(s)
     monmap e1: 3 mons at {overcloud-controller-0=10.0.1.9:6789/0,overcloud-controller-1=10.0.1.7:6789/0,overcloud-controller-2=10.0.1.8:6789/0}
            election epoch 52, quorum 0,1,2 overcloud-controller-1,overcloud-controller-2,overcloud-controller-0
     osdmap e1175: 24 osds: 24 up, 24 in
            flags full
      pgmap v6152415: 2764 pgs, 4 pools, 8411 GB data, 1179 kobjects
            16823 GB used, 3286 GB / 20110 GB avail
                2764 active+clean

ceph osd df shows very unbalanced OSDs

[root@overcloud-cephstorage-1 heat-admin]# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  
 0 0.81999  1.00000   837G   692G   145G 82.64 0.99 
 1 0.81999  1.00000   837G   782G 56896M 93.37 1.12 
 4 0.81999  1.00000   837G   600G   237G 71.69 0.86 
 5 0.81999  1.00000   837G   796G 42830M 95.01 1.14 
 7 0.81999  1.00000   837G   683G   154G 81.52 0.97 
 8 0.81999  1.00000   837G   725G   112G 86.53 1.03 
 9 0.81999  1.00000   837G   721G   116G 86.07 1.03 
12 0.81999  1.00000   837G   616G   221G 73.61 0.88 
15 0.81999  1.00000   837G   604G   233G 72.09 0.86 
16 0.81999  1.00000   837G   645G   192G 77.08 0.92 
18 0.81999  1.00000   837G   780G 58719M 93.16 1.11 
21 0.81999  1.00000   837G   763G 76305M 91.11 1.09 
 2 0.81999  1.00000   837G   682G   155G 81.48 0.97 
 3 0.81999  1.00000   837G   780G 59266M 93.09 1.11 
 6 0.81999  1.00000   837G   755G 84685M 90.13 1.08 
10 0.81999  1.00000   837G   772G 66920M 92.20 1.10 
11 0.81999  1.00000   837G   732G   105G 87.45 1.05 
13 0.81999  1.00000   837G   653G   184G 77.94 0.93 
14 0.81999  1.00000   837G   636G   201G 75.95 0.91 
17 0.81999  1.00000   837G   653G   184G 77.93 0.93 
19 0.81999  1.00000   837G   588G   249G 70.19 0.84 
20 0.81999  1.00000   837G   726G   111G 86.75 1.04 
22 0.81999  1.00000   837G   677G   160G 80.87 0.97 
23 0.81999  1.00000   837G   753G 86865M 89.88 1.07 
              TOTAL 20110G 16823G  3286G 83.66      
MIN/MAX VAR: 0.84/1.14  STDDEV: 7.62

ceph health detail

[root@overcloud-controller-1 glance]# ceph health detail

HEALTH_ERR 1 full osd(s): 11 near full osd(s)

 osd.5 is full at 95%
 osd.1 is near full at 93%
 osd.3 is near full at 93%
 osd.6 is near full at 90%
 osd.8 is near full at 86%
 osd.9 is near full at 86%
 osd.10 is near full at 92%
 osd.11 is near full at 87%
 osd.18 is near full at 93%
 osd.20 is near full at 86%
 osd.21 is near full at 91%
 osd.23 is near full at 89%

Portion of the ceph-mon on controller-2.

2016-11-23 08:15:53.413999 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:16:53.414482 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:17:53.414980 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:18:53.415453 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:19:53.416069 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:20:53.416681 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:21:53.417234 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:22:53.417781 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:23:53.418272 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:24:53.418715 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:25:53.419289 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:26:33.684350 7f8cf4c71700  0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:33.684405 7f8cf4c71700  0 log_channel(audit) log [DBG] : from='client.? 10.0.1.15:0/1048911' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:35.598944 7f8cf4c71700  0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:35.598978 7f8cf4c71700  0 log_channel(audit) log [DBG] : from='client.? 10.0.1.11:0/1010805' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:43.148044 7f8cf4c71700  0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:43.148088 7f8cf4c71700  0 log_channel(audit) log [DBG] : from='client.? 10.0.1.13:0/1021028' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:45.479157 7f8cf4c71700  0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:45.479227 7f8cf4c71700  0 log_channel(audit) log [DBG] : from='client.? 10.0.1.13:0/1021506' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:52.926522 7f8cf4c71700  0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:52.926613 7f8cf4c71700  0 log_channel(audit) log [DBG] : from='client.? 10.0.1.10:0/1015390' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:53.021632 7f8cf4c71700  0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:26:53.021702 7f8cf4c71700  0 log_channel(audit) log [DBG] : from='client.? 10.0.1.10:0/1015400' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:26:53.419838 7f8cf5472700  0 mon.overcloud-controller-2@1(peon).data_health(34) update_stats avail 52% total 279 GB, used 133 GB, avail 146 GB
2016-11-23 08:27:01.171036 7f8cf4c71700  0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:27:01.171082 7f8cf4c71700  0 log_channel(audit) log [DBG] : from='client.? 10.0.1.14:0/1012087' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:27:05.209061 7f8cf4c71700  0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:27:05.209112 7f8cf4c71700  0 log_channel(audit) log [DBG] : from='client.? 10.0.1.15:0/1002216' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch
2016-11-23 08:27:05.382266 7f8cf4c71700  0 mon.overcloud-controller-2@1(peon) e1 handle_command mon_command({"prefix": "mon dump", "format": "json"} v 0) v1
2016-11-23 08:27:05.382309 7f8cf4c71700  0 log_channel(audit) log [DBG] : from='client.? 10.0.1.10:0/1016853' entity='client.openstack' cmd=[{"prefix": "mon dump", "format": "json"}]: dispatch

It's impossible to SSH into running instances. The port is open, but SSH seems to timeout upon connection attempt. Using local port redirection and using VNC to directly log into the instance, it looks as if the instance freezes when trying to log in.

# Command for VNC port redirection
ssh root@10.10.181.86 -L 5999:192.0.2.6:5900 

console log for a running cirros image a compute node

cat /var/lib/nova/instances/<UUID>/console.log
[1469760.315500] INFO: task syslogd:107 blocked for more than 120 seconds.
[1469760.316195] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1469760.317083]  ffff88001d0119f8 0000000000000086 00000000335a3528 ffffffff81c17540
[1469760.318021]  ffff88001d011fd8 ffff88001d011fd8 ffff88001d011fd8 0000000000013700
[1469760.318938]  ffff88001d02dbc0 ffff88001d0516f0 ffff88001d0119c8 ffff88001fa13fc0
[1469760.319870] Call Trace:
[1469760.320205]  [<ffffffff81117f50>] ? __lock_page+0x70/0x70
[1469760.320799]  [<ffffffff8165badf>] schedule+0x3f/0x60
[1469760.321340]  [<ffffffff8165bb8f>] io_schedule+0x8f/0xd0
[1469760.321918]  [<ffffffff81117f5e>] sleep_on_page+0xe/0x20
[1469760.322486]  [<ffffffff8165c39f>] __wait_on_bit+0x5f/0x90
[1469760.323069]  [<ffffffff811180c8>] wait_on_page_bit+0x78/0x80
[1469760.323682]  [<ffffffff8108b920>] ? autoremove_wake_function+0x40/0x40
[1469760.324360]  [<ffffffff81118ee2>] grab_cache_page_write_begin+0x92/0xe0
[1469760.325058]  [<ffffffff8116354f>] ? kmem_cache_free+0x2f/0x110
[1469760.325689]  [<ffffffff811f9ed0>] ext3_write_begin+0x80/0x270
[1469760.326297]  [<ffffffff8111839a>] generic_perform_write+0xca/0x210
[1469760.326963]  [<ffffffff81137928>] ? bdi_wakeup_thread_delayed+0x38/0x40
[1469760.327656]  [<ffffffff8111853d>] generic_file_buffered_write+0x5d/0x90
[1469760.328340]  [<ffffffff81119ea9>] __generic_file_aio_write+0x229/0x440
[1469760.329033]  [<ffffffff81198b00>] ? mntput_no_expire+0x30/0xf0
[1469760.329682]  [<ffffffff81187687>] ? do_last+0x1d7/0x730
[1469760.330254]  [<ffffffff8111a132>] generic_file_aio_write+0x72/0xe0
[1469760.330944]  [<ffffffff81178e0a>] do_sync_write+0xda/0x120
[1469760.331532]  [<ffffffff812d9588>] ? apparmor_file_permission+0x18/0x20
[1469760.332220]  [<ffffffff8129ec6c>] ? security_file_permission+0x2c/0xb0
[1469760.332909]  [<ffffffff811793b1>] ? rw_verify_area+0x61/0xf0
[1469760.337730]  [<ffffffff81179713>] vfs_write+0xb3/0x180
[1469760.338290]  [<ffffffff81179a3a>] sys_write+0x4a/0x90
[1469760.338854]  [<ffffffff81666002>] system_call_fastpath+0x16/0x1b

Environment

Red Hat OpenStack Platform 8.0

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content