Updating overcloud fails when adding storage, and then pcs commands show "free(): invalid next size (normal)" message.

  • Red Hat OpenStack Platform 16
  • Red Hat Enterprise Linux Server 8 and 9 (with the Red Hat High Availability Add-On)
  • Pacemaker version less than 2.1.5


  • Overcloud deploy fails when an administrator tries to add some entries to CinderVolumeOptVolumes.
  • When they execute some pcs commands, "free(): invalid next size (normal)" is displayed.
    The ClusterHA for the controller nodes is not working well.


Red Hat Enterprise Linux 8

  • The issue (RHEL-14119) has been resolved with the errata RHBA-2023:7527 with the following package(s): pacemaker-2.0.5-9.el8_4.8 on RHEL 8.4.0.z or later.
  • The issue (RHEL-14120) has been resolved with the errata RHBA-2023:7406 with the following package(s): pacemaker-2.1.2-4.el8_6.8 on RHEL 8.6.0.z or later.
  • The issue (bugzilla bug: 2122352) has been resolved with the errata RHBA-2023:2818 with the following package(s): pacemaker-2.1.5-8.el8 on RHEL 8.8 or later.

Red Hat Enterprise Linux 9

  • The issue (bugzilla bug: 2122353) has been resolved with the errata RHBA-2023:2150 with the following package(s): pacemaker-2.1.5-7.el9 on RHEL 9.2 or later.


  1. Stop pacemaker service on each controller node.

    # systemctl stop pacemaker.service

    If systemctl stop fails, disable the service and reboot the system instead on the controller node(s).

    # systemctl disable pacemaker.service
    # systemctl reboot
  2. Modify /var/lib/pacemaker/cib/cib.xml by some editor like vi on each controller node.

    # vi /var/lib/pacemaker/cib/cib.xml
      Increment epoch="XXX" value in the top line.
      Remove some <storage-mapping id=...> lines.
  3. Remove cib.last and cib.xml.sig on each controller node.

    # rm /var/lib/pacemaker/cib/cib.last
    # rm /var/lib/pacemaker/cib/cib.xml.sig 
  4. Start pacemaker service and check the result on each controller node.

    # systemctl start pacemaker.service
    # pcs status --full
    # pcs config

    If you disabled the service in Step 1, re-enable it on the controller node(s).

    # systemctl enable pacemaker.service
  5. Execute openstack overcloud deploy command.

Root Cause

  • These issues hit an upstream bug
  • Pacemaker before v2.1.5 prepares 4-KB buffer for all mount points(storage-mapping item) in cib.xml. This is a hard coded limitation.
  • If openstack-cinder-volume's storage mappings(storage-mapping) exceed this limitation when it's deployed, it will fail.
    "CinderVolumeOptVolumes:"(*1) values are needed to describe within a 4-KB buffer when overcloud is deployed for avoiding this issue.

    - /etc/cinder/xxx:ro
    - /etc/cinder/yyy:ro
    - /etc/cinder/zzz:ro

Diagnostic Steps

  • When an administrator tried to add some storage into the cluster, it failed as below.

    $ openstack overcloud deploy ...(options) 
    xxx xx xx:xx:xx puppet-user: error: Could not connect to controller: Transport endpoint is not connected
    xxx xx xx:xx:xx puppet-user: Error: /Stage[main]/Tripleo::Fencing/Pacemaker::Stonith::Level[stonith-1-xxxx]/Pcmk_stonith_level[stonith-level-1-$(/usr/sbin/crm_node -n)-stonith-fence_kdump-xxxxxxxxxxxx_stonith-fence_compute-fence-nova]: Could not evaluate: pcs -f  stonith level | sed -n \"/^Target: $(/usr/sbin/crm_node -n)$/,/^Target:/{/^Target: $(/usr/sbin/crm_node -n)$/b;/^Target:/b;p}\" | grep -e \"Level[[:space:]]*1[[:space:]]*-[[:space:]]
    *stonith-fence_kdump-xxxx,stonith-fence_compute-fence-nova\" failed: . Too many tries\n
    xxx xx xx:xx:xx puppet-user: Error: /Stage[main]/Tripleo::Fencing/Pacemaker::Stonith::Level[stonith-2-xxxx]/Pcmk_stonith_level[stonith-level-2-$(/usr/sbin/crm_node -n)-stonith-fence_ipmilan-xxxx_stonith-fence_compute-fence-nova]: Could not evaluate: pcs -f  stonith level | sed -n \"/^Target: $(/usr/sbin/crm_node -n)
    [2023-xx-xx xx:xx:xx.xxx] 2023-xx-xx xx:xx:xx.xxx | xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx |      FATAL | Wait for puppet host configuration to finish | $HOSTNAME | error={"ansible_job_id": "xxxxx.xxxxx", "attempts": 360, "changed": false, "failed_when_result": true, "finished": 0, "started": 1}
  • When they executed the pcs command, "Error: error running crm_mon, is pacemaker running?" was displayed..

    $ sudo pcs status --full
    Error: error running crm_mon, is pacemaker running?
    free(): invalid next size (normal)
  • Confirming a core dump, it was aborted due to a buffer overflow consumed by storage mapping data.

    #1  0x0000ZZZZZZZZZZZZ in __GI_abort () at abort.c:79
    #2  0x0000ZZZZZZZZZZZZ in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xAAAAAAAAAAAA "%s\n") at ../sysdeps/posix/libc_fatal.c:181
    #3  0x0000ZZZZZZZZZZZZ in malloc_printerr (str=str@entry=0xBBBBBBBBBBBB "free(): invalid next size (normal)") at malloc.c:5374
    #4  0x0000ZZZZZZZZZZZZ in _int_free (av=0xCCCCCCCCCCCC <main_arena>, p=0xDDDDDDDDDDDD, have_lock=<optimized out>) at malloc.c:4334
    (gdb) x/64s 0xDDDDDDDDDDDD
    0xZZZZZZZZZZZZ: " -e PCMK_stderr=1 --net=host -e PCMK_remote_port=3121 -v /etc/hosts:/etc/hosts:ro -v /etc/localtime:/etc/localtime:ro -v /etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro -v /etc/pki/ca-trust"...
    0xZZZZZZZZZZZZ: "/source/anchors:/etc/pki/ca-trust/source/anchors:ro -v /etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro -v /etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust"...
    0xZZZZZZZZZZZZ: ".crt:ro -v /etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro -v /dev/log:/dev/log:rw -v /etc/puppet:/etc/puppet:ro -v /var/lib/config-data/puppet-generated/cinder:/var/lib/kolla/config_files/src:ro -v /v"...
    ...             (Long arguments related to storage mapping points were seen here.)
    0xZZZZZZZZZZZZ: "/kolla/config_files/cinder_volume.json:/var/lib/kolla/config_files/config.json:ro -v /etc/iscsi:/var/lib/kolla/config_files/src-iscsid:ro -v /etc/ceph:/var/lib/kolla/config_files/src-ceph:ro -v /lib/m"...
    0xZZZZZZZZZZZZ: "odules:/lib/modules:ro -v /dev/:/dev/:rw -v /run/:/run/:rw -v /sys:/sys:rw -v /var/lib/cinder:/" 

