Updating overcloud fails when adding storage, and then pcs commands show "free(): invalid next size (normal)" message.

Solution Verified - Updated -

Environment

  • Red Hat OpenStack Platform 16
  • Red Hat Enterprise Linux Server 8 and 9 (with the Red Hat High Availability Add-On)
  • Pacemaker version less than 2.1.5

Issue

  • Overcloud deploy fails when an administrator tries to add some entries to CinderVolumeOptVolumes.
  • When they execute some pcs commands, "free(): invalid next size (normal)" is displayed.
    The ClusterHA for the controller nodes is not working well.

Resolution

Red Hat Enterprise Linux 8

  • The issue (RHEL-14119) has been resolved with the errata RHBA-2023:7527 with the following package(s): pacemaker-2.0.5-9.el8_4.8 on RHEL 8.4.0.z or later.
  • The issue (RHEL-14120) has been resolved with the errata RHBA-2023:7406 with the following package(s): pacemaker-2.1.2-4.el8_6.8 on RHEL 8.6.0.z or later.
  • The issue (bugzilla bug: 2122352) has been resolved with the errata RHBA-2023:2818 with the following package(s): pacemaker-2.1.5-8.el8 on RHEL 8.8 or later.

Red Hat Enterprise Linux 9

  • The issue (bugzilla bug: 2122353) has been resolved with the errata RHBA-2023:2150 with the following package(s): pacemaker-2.1.5-7.el9 on RHEL 9.2 or later.

Workaround

  1. Stop pacemaker service on each controller node.

    # systemctl stop pacemaker.service
    

    If systemctl stop fails, disable the service and reboot the system instead on the controller node(s).

    # systemctl disable pacemaker.service
    # systemctl reboot
    
  2. Modify /var/lib/pacemaker/cib/cib.xml by some editor like vi on each controller node.

    # vi /var/lib/pacemaker/cib/cib.xml
    
      Increment epoch="XXX" value in the top line.
      Remove some <storage-mapping id=...> lines.
    
  3. Remove cib.last and cib.xml.sig on each controller node.

    # rm /var/lib/pacemaker/cib/cib.last
    # rm /var/lib/pacemaker/cib/cib.xml.sig 
    
  4. Start pacemaker service and check the result on each controller node.

    # systemctl start pacemaker.service
    # pcs status --full
    # pcs config
    

    If you disabled the service in Step 1, re-enable it on the controller node(s).

    # systemctl enable pacemaker.service
    
  5. Execute openstack overcloud deploy command.

Root Cause

  • These issues hit an upstream bug
  • Pacemaker before v2.1.5 prepares 4-KB buffer for all mount points(storage-mapping item) in cib.xml. This is a hard coded limitation.
  • If openstack-cinder-volume's storage mappings(storage-mapping) exceed this limitation when it's deployed, it will fail.
    "CinderVolumeOptVolumes:"(*1) values are needed to describe within a 4-KB buffer when overcloud is deployed for avoiding this issue.

    (*1)
    CinderVolumeOptVolumes:
    - /etc/cinder/xxx:ro
    - /etc/cinder/yyy:ro
    ...
    - /etc/cinder/zzz:ro
    

Diagnostic Steps

  • When an administrator tried to add some storage into the cluster, it failed as below.

    $ openstack overcloud deploy ...(options) 
    ...
    xxx xx xx:xx:xx puppet-user: error: Could not connect to controller: Transport endpoint is not connected
    xxx xx xx:xx:xx puppet-user: Error: /Stage[main]/Tripleo::Fencing/Pacemaker::Stonith::Level[stonith-1-xxxx]/Pcmk_stonith_level[stonith-level-1-$(/usr/sbin/crm_node -n)-stonith-fence_kdump-xxxxxxxxxxxx_stonith-fence_compute-fence-nova]: Could not evaluate: pcs -f  stonith level | sed -n \"/^Target: $(/usr/sbin/crm_node -n)$/,/^Target:/{/^Target: $(/usr/sbin/crm_node -n)$/b;/^Target:/b;p}\" | grep -e \"Level[[:space:]]*1[[:space:]]*-[[:space:]]
    *stonith-fence_kdump-xxxx,stonith-fence_compute-fence-nova\" failed: . Too many tries\n
    xxx xx xx:xx:xx puppet-user: Error: /Stage[main]/Tripleo::Fencing/Pacemaker::Stonith::Level[stonith-2-xxxx]/Pcmk_stonith_level[stonith-level-2-$(/usr/sbin/crm_node -n)-stonith-fence_ipmilan-xxxx_stonith-fence_compute-fence-nova]: Could not evaluate: pcs -f  stonith level | sed -n \"/^Target: $(/usr/sbin/crm_node -n)
    ...
    [2023-xx-xx xx:xx:xx.xxx] 2023-xx-xx xx:xx:xx.xxx | xxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx |      FATAL | Wait for puppet host configuration to finish | $HOSTNAME | error={"ansible_job_id": "xxxxx.xxxxx", "attempts": 360, "changed": false, "failed_when_result": true, "finished": 0, "started": 1}
    
  • When they executed the pcs command, "Error: error running crm_mon, is pacemaker running?" was displayed..

    $ sudo pcs status --full
    Error: error running crm_mon, is pacemaker running?
    free(): invalid next size (normal)
    
  • Confirming a core dump, it was aborted due to a buffer overflow consumed by storage mapping data.

    #1  0x0000ZZZZZZZZZZZZ in __GI_abort () at abort.c:79
    #2  0x0000ZZZZZZZZZZZZ in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xAAAAAAAAAAAA "%s\n") at ../sysdeps/posix/libc_fatal.c:181
    #3  0x0000ZZZZZZZZZZZZ in malloc_printerr (str=str@entry=0xBBBBBBBBBBBB "free(): invalid next size (normal)") at malloc.c:5374
    #4  0x0000ZZZZZZZZZZZZ in _int_free (av=0xCCCCCCCCCCCC <main_arena>, p=0xDDDDDDDDDDDD, have_lock=<optimized out>) at malloc.c:4334
    
    (gdb) x/64s 0xDDDDDDDDDDDD
    ...
    0xZZZZZZZZZZZZ: " -e PCMK_stderr=1 --net=host -e PCMK_remote_port=3121 -v /etc/hosts:/etc/hosts:ro -v /etc/localtime:/etc/localtime:ro -v /etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro -v /etc/pki/ca-trust"...
    0xZZZZZZZZZZZZ: "/source/anchors:/etc/pki/ca-trust/source/anchors:ro -v /etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro -v /etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust"...
    0xZZZZZZZZZZZZ: ".crt:ro -v /etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro -v /dev/log:/dev/log:rw -v /etc/puppet:/etc/puppet:ro -v /var/lib/config-data/puppet-generated/cinder:/var/lib/kolla/config_files/src:ro -v /v"...
    ...
    ...             (Long arguments related to storage mapping points were seen here.)
    ...
    0xZZZZZZZZZZZZ: "/kolla/config_files/cinder_volume.json:/var/lib/kolla/config_files/config.json:ro -v /etc/iscsi:/var/lib/kolla/config_files/src-iscsid:ro -v /etc/ceph:/var/lib/kolla/config_files/src-ceph:ro -v /lib/m"...
    0xZZZZZZZZZZZZ: "odules:/lib/modules:ro -v /dev/:/dev/:rw -v /run/:/run/:rw -v /sys:/sys:rw -v /var/lib/cinder:/" 
    0xZZZZZZZZZZZZ: ""
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments