Openstack FFU to 17.1 upgrade failed to lookup user ceph-admin

Solution Verified - Updated -

Environment

  • Red Hat Openstack Plataform 17.1

Issue

The execution of the command ´openstack overcloud external-upgrade run´ on step 6.2.6 of the Openstack Upgrade to 17.1 fail because of the user ceph-admin was not found in a server or more.

Resolution

Re-run the previous two steps from upgrade including additional ansible groups for ceph services in the limits and do not disable validations.

  1. Run the command from step 6.2.4 as below (which creates the ceph-admin user)
ANSIBLE_LOG_PATH=/home/stack/cephadm_enable_user_key.log \
ANSIBLE_HOST_KEY_CHECKING=false \
ansible-playbook -i /home/stack/overcloud-deploy/<stack>/config-download/<stack>/tripleo-ansible-inventory.yaml \
  -b -e ansible_python_interpreter=/usr/libexec/platform-python /usr/share/ansible/tripleo-playbooks/ceph-admin-user-playbook.yml \
 -e tripleo_admin_user=ceph-admin \
 -e distribute_private_key=true \
  --limit Undercloud,ceph_mon,ceph_mgr,ceph_rgw,ceph_mds,ceph_nfs,ceph_grafana,ceph_osd

Note that the ceph_rgw,ceph_mds,ceph_nfs,ceph_grafana ansible groups are included.

  1. Run the command from step 6.2.5 as below (which upgrades Ceph)
openstack overcloud upgrade run \
--stack <stack> \
--skip-tags ceph_ansible_remote_tmp \
--tags setup_packages --limit Undercloud,ceph_mon,ceph_mgr,ceph_rgw,ceph_mds,ceph_nfs,ceph_grafana \
--playbook /home/stack/overcloud-deploy/<stack>/config-download/<stack>/upgrade_steps_playbook.yaml 2>&1

Note that ceph_health,opendev-validation should not be skipped.

  1. Run the step 6.2.6 again.

Root Cause

The ceph-admin user creation playbook was limited to the Ansible groups for ceph_osd, and ceph_mon, Undercloud but there was also a set of servers from the ceph_rgw group which did not get that user because they were exlucded form the --limit option.

By default the ceph_mon and ceph_rgw groups both refer to OpenStack controller nods, but if in environments where the Ceph services are distributed in different roles, they might be excluded. For example if ceph_mon is not in controllers but ceph_rgw is in controllers.

Diagnostic Steps

  1. Looks for the fatal error in the execution of the command in step 6.2.6.
99999999:2024-01-01 22:22:22,222 p=333333 u=root n=ansible | fatal: [controller01]: FAILED! => {"changed": false, "msg": "Failed to lookup user ceph-admin: \"getpwnam(): name not found: 'ceph-admin'\""}
  1. Check in openstack templates (roles data) where Ceph services are implemented or in Ceph configuration. If ceph_rgw is in a different role than others Ceph services, this solutions is applicable.
  2. Inspect the tripleo-ansible-inventory.yaml file in the config-download directory for groups matching ceph_ and include them in the --limit.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments