Compute nodes are stuck with clean wait status in Red Hat OpenStack Platform

Solution Verified - Updated -

Environment

Issue

  • Compute nodes stuck with clean wait during overcloud node introspection. After sometime they moved to clean failed state.

  • Baremetal nodes are stuck in clean wait after setting them to provide

Resolution

If overcloud nodes are being cleaned by the undercloud

  • Disable the cleaning, restart ironic and re introspect the nodes.

    Edit the /etc/ironic/ironic.conf to disable cleaning:
    [conductor]
    automated_clean = false
    
  • Restart ironic:

    # systemctl restart openstack-ironic-api.service
    # systemctl restart openstack-ironic-conductor.service
    # systemctl restart openstack-ironic-inspector-dnsmasq.service
    # systemctl restart openstack-ironic-inspector.service
    

If baremetal nodes are being cleaned by ironic in the overcloud

Execute this on all controllers to lower the clean_callback_timeout and make nodes fail after one minute. Adjust the time if it's too short for nodes which actually call back:

crudini --set /var/lib/config-data/puppet-generated/ironic/etc/ironic/ironic.conf conductor clean_callback_timeout 60
docker ps | awk '/ironic/ {print $NF}' | xargs docker restart

Or disable node cleaning altogether:

crudini --set /var/lib/config-data/puppet-generated/ironic/etc/ironic/ironic.conf conductor automated_clean false
docker ps | awk '/ironic/ {print $NF}' | xargs docker restart

Root Cause

  • Cleaning was enabled manually, By default it is disable.Cleaning takes lots of time ( Default value for timeout is 1800sec), Due to timeout introspection was failing.

  • From the ironic.conf file

    # Enables or disables automated cleaning. Automated cleaning
     # is a configurable set of steps, such as erasing disk drives,
     # that are performed on the node to ensure it is in a baseline
     # state and ready to be deployed to. This is done after
     # instance deletion as well as during the transition from a
     # "manageable" to "available" state. When enabled, the
     # particular steps performed to clean a node depend on which
     # driver that node is managed by; see the individual driver's
     # documentation for details. NOTE: The introduction of the
     # cleaning operation causes instance deletion to take
     # significantly longer. In an environment where all tenants
     # are trusted (eg, because there is only one tenant), this
     # option could be safely disabled. (boolean value)
    
     # Timeout (seconds) to wait for a callback from the ramdisk
     # doing the cleaning. If the timeout is reached the node will
     # be put in the "clean failed" provision state. Set to 0 to
     # disable timeout. (integer value)
     #clean_callback_timeout = 1800
    

Diagnostic Steps

  • While doing bulk introspection it failed with below error:

    Introspection for UUID 9700540f-6964-4c7c-86d0-2f84899f1bb5 finished successfully.
    Introspection completed.
    Started Mistral Workflow tripleo.baremetal.v1.provide_manageable_nodes. Execution ID: 4d8626bd-b4a5-4812-a56e-6d5839c46212
    Waiting for messages on queue '8056d8c6-2177-4b23-a6d2-702dfa0f897c' with no timeout.
    socket is already closed.
    
  • We have tried to do introspection for one node and it also failed with below error:

    # openstack overcloud node introspect 6b669a27-e2ec-4164-b775-d14dd6ef5d41 --provide
    Started Mistral Workflow tripleo.baremetal.v1.introspect. Execution ID: 62b37499-d6fa-40dd-954c-36d59d95afcd
    Waiting for introspection to finish...
    Waiting for messages on queue '7b9d075c-8a9a-4316-a443-355a9e316f54' with no timeout.
    Successfully introspected all nodes.
    Introspection completed.
    Started Mistral Workflow tripleo.baremetal.v1.provide. Execution ID: 2ae43f27-bb71-4c43-a2c7-86f2d58a4a10
    Waiting for messages on queue '7b9d075c-8a9a-4316-a443-355a9e316f54' with no timeout.
    Failed to set nodes to available state:  IronicAction.node.set_power_state failed: <class 'ironicclient.common.apiclient.exceptions.BadRequest'>: The requested action "power off" can not be performed on node "6b669a27-e2ec-4164-b775-d14dd6ef5d41" while it is in state "clean wait".
    [stack@ ~]$
    
  • When we are checking the introspection status , it is showing True in Finished column.

    $ openstack baremetal introspection bulk status
    +--------------------------------------+----------+-------+
    | Node UUID                            | Finished | Error |
    +--------------------------------------+----------+-------+
    | d62e7834-35f3-4c0e-8d58-b67324d30838 | True     | None  |
    | 9cd445b4-52d2-4d0a-b0ed-502ff8cbe0b3 | True     | None  |
    | d7bb62e9-12e0-4e40-88bc-09f9d1e12522 | True     | None  |
    | 6b669a27-e2ec-4164-b775-d14dd6ef5d41 | True     | None  |
    | 08f084fe-ba43-48a6-b041-759952c9000e | True     | None  |
    | 136c4835-b313-4644-9f29-6a0caaa6354f | True     | None  |
    | 156973c6-9cef-4cc7-89ce-65dfe2266bbe | True     | None  |
    | 9700540f-6964-4c7c-86d0-2f84899f1bb5 | True     | None  |
    | 76dd228a-d58a-4c54-a67a-2bb8c0de1fc1 | True     | None  |
    | da952e34-0514-4241-a2e7-fd7476de65fa | True     | None  |
    | 28611a1b-2dcf-46eb-a2b6-26baac85aef7 | True     | None  |
    | 9b45c9fa-51cb-4fa9-bffb-1cd06fc92547 | True     | None  |
    +--------------------------------------+----------+-------+
    
  • tried to clean up one compute node with fresh rhel 7 installation and did introspection but no luck. It is failing while changing the state to poweroff after introspection.

    stack@l ~]$ openstack overcloud node introspect 08f084fe-ba43-48a6-b041-759952c9000e --provide
    Started Mistral Workflow tripleo.baremetal.v1.introspect. Execution ID: 87094e34-5568-4fc1-9d79-b6166da7d541
    Waiting for introspection to finish...
    Waiting for messages on queue 'da4040fd-7921-46a4-9e20-eb5caf72dc51' with no timeout.
    Successfully introspected all nodes.
    Introspection completed.
    Started Mistral Workflow tripleo.baremetal.v1.provide. Execution ID: bf76e148-977f-4998-9ee8-03af3ba45ad5
    Waiting for messages on queue 'da4040fd-7921-46a4-9e20-eb5caf72dc51' with no timeout.
    Failed to set nodes to available state:  IronicAction.node.set_power_state failed: <class 'ironicclient.common.apiclient.exceptions.BadRequest'>: The requested action "power off" can not be performed on node "08f084fe-ba43-48a6-b041-759952c9000e" while it is in state "clean wait"
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments