Updating overcloud fails when adding storage, and then pcs commands show "free(): invalid next size (normal)" message.

Solution Verified - Updated 2024-06-03T16:37:53+00:00 -

Environment

Red Hat OpenStack Platform 16
Red Hat Enterprise Linux Server 8 and 9 (with the Red Hat High Availability Add-On)
Pacemaker version less than 2.1.5

Issue

Overcloud deploy fails when an administrator tries to add some entries to CinderVolumeOptVolumes.
When they execute some pcs commands, "free(): invalid next size (normal)" is displayed.
The ClusterHA for the controller nodes is not working well.

Resolution

Red Hat Enterprise Linux 8

The issue (RHEL-14119) has been resolved with the errata RHBA-2023:7527 with the following package(s): pacemaker-2.0.5-9.el8_4.8 on RHEL 8.4.0.z or later.
The issue (RHEL-14120) has been resolved with the errata RHBA-2023:7406 with the following package(s): pacemaker-2.1.2-4.el8_6.8 on RHEL 8.6.0.z or later.
The issue (bugzilla bug: 2122352) has been resolved with the errata RHBA-2023:2818 with the following package(s): pacemaker-2.1.5-8.el8 on RHEL 8.8 or later.

Red Hat Enterprise Linux 9

The issue (bugzilla bug: 2122353) has been resolved with the errata RHBA-2023:2150 with the following package(s): pacemaker-2.1.5-7.el9 on RHEL 9.2 or later.

Workaround

Stop pacemaker service on each controller node.
```
# systemctl stop pacemaker.service
```
If systemctl stop fails, disable the service and reboot the system instead on the controller node(s).
```
# systemctl disable pacemaker.service
# systemctl reboot
```

Modify /var/lib/pacemaker/cib/cib.xml by some editor like vi on each controller node.

# vi /var/lib/pacemaker/cib/cib.xml

  Increment epoch="XXX" value in the top line.
  Remove some <storage-mapping id=...> lines.

Remove cib.last and cib.xml.sig on each controller node.

# rm /var/lib/pacemaker/cib/cib.last
# rm /var/lib/pacemaker/cib/cib.xml.sig

Start pacemaker service and check the result on each controller node.
```
# systemctl start pacemaker.service
# pcs status --full
# pcs config
```
If you disabled the service in Step 1, re-enable it on the controller node(s).
```
# systemctl enable pacemaker.service
```
Execute openstack overcloud deploy command.

Root Cause

These issues hit an upstream bug
Pacemaker before v2.1.5 prepares 4-KB buffer for all mount points(storage-mapping item) in cib.xml. This is a hard coded limitation.
If openstack-cinder-volume's storage mappings(storage-mapping) exceed this limitation when it's deployed, it will fail.
"CinderVolumeOptVolumes:"(*1) values are needed to describe within a 4-KB buffer when overcloud is deployed for avoiding this issue.
```
(*1)
CinderVolumeOptVolumes:
- /etc/cinder/xxx:ro
- /etc/cinder/yyy:ro
...
- /etc/cinder/zzz:ro
```

Diagnostic Steps

When an administrator tried to add some storage into the cluster, it failed as below.

$ openstack overcloud deploy ...(options) 
...
xxx xx xx:xx:xx puppet-user: error: Could not connect to controller: Transport endpoint is not connected
xxx xx xx:xx:xx puppet-user: Error: /Stage[main]/Tripleo::Fencing/Pacemaker::Stonith::Level[stonith-1-xxxx]/Pcmk_stonith_level[stonith-level-1-$(/usr/sbin/crm_node -n)-stonith-fence_kdump-xxxxxxxxxxxx_stonith-fence_compute-fence-nova]: Could not evaluate: pcs -f  stonith level | sed -n \"/^Target: $(/usr/sbin/crm_node -n)$/,/^Target:/{/^Target: $(/usr/sbin/crm_node -n)$/b;/^Target:/b;p}\" | grep -e \"Level[[:space:]]*1[[:space:]]*-[[:space:]]
*stonith-fence_kdump-xxxx,stonith-fence_compute-fence-nova\" failed: . Too many tries\n
xxx xx xx:xx:xx puppet-user: Error: /Stage[main]/Tripleo::Fencing/Pacemaker::Stonith::Level[stonith-2-xxxx]/Pcmk_stonith_level[stonith-level-2-$(/usr/sbin/crm_node -n)-stonith-fence_ipmilan-xxxx_stonith-fence_compute-fence-nova]: Could not evaluate: pcs -f  stonith level | sed -n \"/^Target: $(/usr/sbin/crm_node -n)
...
[2023-xx-xx xx:xx:xx.xxx] 2023-xx-xx xx:xx:xx.xxx | xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx |      FATAL | Wait for puppet host configuration to finish | $HOSTNAME | error={"ansible_job_id": "xxxxx.xxxxx", "attempts": 360, "changed": false, "failed_when_result": true, "finished": 0, "started": 1}

When they executed the pcs command, "Error: error running crm_mon, is pacemaker running?" was displayed..

$ sudo pcs status --full
Error: error running crm_mon, is pacemaker running?
free(): invalid next size (normal)

Confirming a core dump, it was aborted due to a buffer overflow consumed by storage mapping data.

#1  0x0000ZZZZZZZZZZZZ in __GI_abort () at abort.c:79
#2  0x0000ZZZZZZZZZZZZ in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xAAAAAAAAAAAA "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x0000ZZZZZZZZZZZZ in malloc_printerr (str=str@entry=0xBBBBBBBBBBBB "free(): invalid next size (normal)") at malloc.c:5374
#4  0x0000ZZZZZZZZZZZZ in _int_free (av=0xCCCCCCCCCCCC <main_arena>, p=0xDDDDDDDDDDDD, have_lock=<optimized out>) at malloc.c:4334

(gdb) x/64s 0xDDDDDDDDDDDD
...
0xZZZZZZZZZZZZ: " -e PCMK_stderr=1 --net=host -e PCMK_remote_port=3121 -v /etc/hosts:/etc/hosts:ro -v /etc/localtime:/etc/localtime:ro -v /etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro -v /etc/pki/ca-trust"...
0xZZZZZZZZZZZZ: "/source/anchors:/etc/pki/ca-trust/source/anchors:ro -v /etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro -v /etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust"...
0xZZZZZZZZZZZZ: ".crt:ro -v /etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro -v /dev/log:/dev/log:rw -v /etc/puppet:/etc/puppet:ro -v /var/lib/config-data/puppet-generated/cinder:/var/lib/kolla/config_files/src:ro -v /v"...
...
...             (Long arguments related to storage mapping points were seen here.)
...
0xZZZZZZZZZZZZ: "/kolla/config_files/cinder_volume.json:/var/lib/kolla/config_files/config.json:ro -v /etc/iscsi:/var/lib/kolla/config_files/src-iscsid:ro -v /etc/ceph:/var/lib/kolla/config_files/src-ceph:ro -v /lib/m"...
0xZZZZZZZZZZZZ: "odules:/lib/modules:ro -v /dev/:/dev/:rw -v /run/:/run/:rw -v /sys:/sys:rw -v /var/lib/cinder:/" 
0xZZZZZZZZZZZZ: ""

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

Updating overcloud fails when adding storage, and then pcs commands show "free(): invalid next size (normal)" message.

Environment

Issue

Resolution

Red Hat Enterprise Linux 8

Red Hat Enterprise Linux 9

Workaround

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Red Hat Enterprise Linux 8

Red Hat Enterprise Linux 9

Workaround

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links