Noobaa-DB PVC mount fails: RBD CSI NodeStageVolume error “operation with the given Volume ID already exists” on ODF 4.17 with Multus
Description
Environment
Product: Red Hat OpenShift Data Foundation 4.17 (StorageCluster reports version 4.17.18)
OpenShift: 4.17.x (single cluster, homelab)
Platform: VMs (three storage workers: workload-0/1/2, two infra, three masters)
Storage: ODF internal mode using LocalVolumeSet + local-block-sc (6 local block PVs, RBD and CephFS SCs created successfully)
Network: OVN-Kubernetes with Multus, ODF configured with macvlan NADs:
odf-public-net (master bond-data.220)
odf-cluster-net (master bond-data.221)
These VLANs are created via NMState NNCPs on the storage nodes (bond-data, bond-data.220, bond-data.221 etc.).
Symptom
After a fresh ODF deployment, Ceph cluster, mons, OSDs, and CSI pods are all Running and appear healthy.
Only the noobaa-db-pg-0 pod stays stuck in ContainerCreating on node workload-1.
Its PVC db-noobaa-db-pg-0 is bound to PV pvc-685babc7-20f6-45da-a000-e8ec404cafc9 (SC ocs-storagecluster-ceph-rbd).
VolumeAttachment shows the volume attached to workload-1.
Pod events for noobaa-db-pg-0:
Normal Scheduled default-scheduler Successfully assigned openshift-storage/noobaa-db-pg-0 to workload-1
Normal SuccessfulAttachVolume AttachVolume.Attach succeeded for volume "pvc-685babc7-20f6-45da-a000-e8ec404cafc9"
Warning FailedMount kubelet MountVolume.MountDevice failed for volume "pvc-685babc7-20f6-45da-a000-e8ec404cafc9":
rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedMount (repeated) MountVolume.MountDevice failed:
rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-000000000000000b-b7e6ea81-3642-4527-8d35-775c6a0ea89a already exists
Relevant csi-rbdplugin node plugin logs on workload-1 (csi-rbdplugin-8898k, container csi-rbdplugin):
I... GRPC call: /csi.v1.Node/NodeStageVolume
I... GRPC request: { ... "staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com/d324cab8.../globalmount",
"volume_id":"0001-0011-openshift-storage-000000000000000b-b7e6ea81-3642-4527-8d35-775c6a0ea89a",
"volume_context":{ "clusterID":"openshift-storage", "imageName":"csi-vol-b7e6ea81-3642-4527-8d35-775c6a0ea89a", ... } }
... (no success or normal failure logged for this first call) ...
E... nodeserver.go:320 ... NodeStageVolume ... an operation with the given Volume ID 0001-0011-openshift-storage-000000000000000b-b7e6ea81-3642-4527-8d35-775c6a0ea89a already exists
E... utils.go:240 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID ... already exists
... (repeats for subsequent NodeStageVolume calls)
So it looks like the first NodeStageVolume call for this volume hung or didn’t complete correctly, and for every retry ceph‑csi returns Aborted with “operation with the given Volume ID already exists”.
What I have already tried
Full ODF teardown and redeploy (twice)
Deleted NooBaa (after setting spec.cleanupPolicy.allowNoobaaDeletion: true).
Deleted CephBlockPools, CephFilesystem, CephObjectStore, CephCluster, StorageCluster, StorageSystem, and ODF SCs.
Deleted remaining ODF PVC/PVs and cleaned up local PVs.
Uninstalled and then reinstalled the ODF operator.
Re‑applied the same StorageCluster (multus public/cluster nets; same local-block devices).
After redeploy, a new PVC/PV/VolumeID is created for NooBaa DB, but the same pattern appears: attach succeeds, NodeStageVolume hangs, and we get operation with the given Volume ID already exists.
Node‑side cleanup on workload-1
Used oc debug node/workload-1 and chroot /host.
Verified staging path exists:
/var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com/d324cab8.../globalmount
Checked it was not mounted (mount | grep d324cab8, findmnt show nothing).
Removed the staging directory:
rm -rf /var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com/d324cab8...
Restarted the CSI nodeplugin pod on workload-1:
oc -n openshift-storage delete pod csi-rbdplugin-8898k
Deleted and let the operator recreate noobaa-db-pg-0.
The first NodeStageVolume call still hangs and all retries still return “operation with the given Volume ID already exists”.
Networking / Multus configuration validation
I found Bugzilla 2282543 (“Noobaa-DB Failed mount on cluster with multus”), where the root cause was an incorrect NAD master (br-ex).
In my case, NADs are:
odf-public-net:
{ "type": "macvlan", "mode": "bridge", "master": "bond-data.220", "ipam": { "type": "whereabouts", "range": "192.168.210.0/24", ... } }
odf-cluster-net:
{ "type": "macvlan", "mode": "bridge", "master": "bond-data.221", "ipam": { "type": "whereabouts", "range": "192.168.211.0/24", ... } }
NNCPs configure bond-data, bond-data.220, bond-data.221 correctly on the storage nodes.
So the “master is br-ex” misconfig from that BZ does not apply here.
Cluster state
Ceph cluster otherwise looks healthy (mons, mgr, OSDs, RGW all Running).
Only the NooBaa DB PVC is affected; other ODF components start correctly.
Questions for Red Hat
Is this NodeStageVolume “operation with the given Volume ID already exists” for the NooBaa DB RBD PVC a known issue in ODF 4.17.x with multus?
Are there additional cleanup steps recommended on the node (e.g. for rbd map devices, CSI state, kubelet) beyond removing the staging dir and restarting the csi-rbdplugin pod?
Are there specific Ceph or Ceph‑CSI bugs in this ODF/OCP combination that match this pattern, and if so is there a fix (updated ceph‑csi/ODF image or configuration)?
Any recommended workaround to unblock NooBaa DB (for example, recreating the volume/image in a supported way, or adjusting CSI/NooBaa configuration), short of disabling NooBaa entirely?
I can provide a must‑gather, oc adm ocp-debug-node output, and full csi-rbdplugin + kubelet logs from workload-1 if needed.
Responses