Openstack reliability test fails
Hi every one;
I had installed Openstack multi node using RDO Packstack method, and during the installation everything goes perfect, my setup was one controller node and three compute nodes each of them on separate physical PC ,I launched one instance and Openstack chose compute 1 as a host for this instance and when I disconnected compute 1 from network the openstack controller did not discover that compute 1 is down , until 1 minute passed the compute1 shows as down in horizon dashboard but the instance still down , and I cannot ping or access to instance my questions is :
1-why it is take long time ( about 1 minute ) for controller to discover that one of compute is down or not reachable
2- and why the instance is still down as I know ((when a computing node is faulty the system automatically move computing task from the faulty computing node to other computing nodes )) so in our case the system should move instance from faulty compute 1 to compute 2 or compute 3
thanks and best regards
Responses
1-why it is take long time ( about 1 minute ) for controller to discover that one of compute is down or not reachable
The default service_down_time in /etc/nova/nova.conf is 60s, where the default report interval is 10s.
2- and why the instance is still down as I know ((when a computing node is faulty the system automatically move computing task from the faulty computing node to other computing nodes )) so in our case the system should move instance from faulty compute 1 to compute 2 or compute 3
This is expected, there is no automatic move of the instances to a different compute. You can use the evacuation functionality to evacuate the instances from the down compute to a different one. You need to make sure that the not reachable compute is really down. If you have started the instances from shared storage and just disconnect the compute from network and the instances are still running from accessible shared storage. After evacuation you'd have two instances (one from old compute + one from new compute) which access the same instance disk on shared storage. As a result you'd have corrupted file system in the instance disk.
Automation for instance evacuation could be enabled via [1]
[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/high_availability_for_compute_instances/index
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
