Cannot gain SPM role in RHV Data Center, getting BlockStorageDomainMasterFSCKError
Environment
- Red Hat Virtualization 4.x
Issue
- Data Center is non-operational; cannot gain SPM role in RHV Data Center, getting
BlockStorageDomainMasterFSCKError
message on vdsm.log:
2020-11-12 00:45:25,937+0300 ERROR (tasks/2) [storage.StoragePool] Unexpected error (sp:389)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 341, in startSpm
self.masterDomain.mountMaster()
File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1531, in mountMaster
raise se.BlockStorageDomainMasterFSCKError(masterfsdev, rc)
BlockStorageDomainMasterFSCKError: BlockSD master file system FSCK error: 'masterfsdev=/dev/cc22f21f-9dab-4450-bc73-b3935dd9bc43/master, rc=8'
2020-11-12 00:45:25,937+0300 ERROR (tasks/2) [storage.StoragePool] failed: BlockSD master file system FSCK error: 'masterfsdev=/dev/cc22f21f-9dab-4450-bc73-b3935dd9bc43/master, rc=8' (sp:390)
2020-11-12 00:45:26,001+0300 ERROR (tasks/2) [storage.TaskManager.Task] (Task='2ce6a969-d6f6-40e8-9b0d-cc8177ea2682') Unexpected error (task:875)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
return fn(*args, **kargs)
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
return self.cmd(*self.argslist, **self.argsdict)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 341, in startSpm
self.masterDomain.mountMaster()
File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1531, in mountMaster
raise se.BlockStorageDomainMasterFSCKError(masterfsdev, rc)
BlockStorageDomainMasterFSCKError: BlockSD master file system FSCK error: 'masterfsdev=/dev/cc22f21f-9dab-4450-bc73-b3935dd9bc43/master, rc=8'
Resolution
- The resolution is to run manually fsck on the master metadata.
- Put all the hosts on the affected Data Center, except the one running HostedEngine, into Maintenance on the Admin Portal. Note, you might have running VMs on it. Those VMs should be stopped as well. If you are unsure that any VMs might still be up on the hypervisors, you can use the following command to check:
# ps aux|grep qemu-kvm
If no VMs are running, then no qemu-kvm processes should show up. If any process shows up, check the name parameter in the output, and try to login the VM of the same name to power it off gracefully.
- In a hosted engine setup, set the global HA maintenance mode.
# hosted-engine --set-maintenance --mode=global
- Choose the host running HostedEngine as the host you are going to work on. If not a hosted engine setup, choose any host.
- Shut down all the other hosts (run
poweroff
on each). This is important. Why? To clean the cache, to be sure that this VG is seen and updated only on one server, so that no new inconsistencies would be created. - Power off the HostedEngine VM.
- On the last active host, run the following commands:
# systemctl stop vdsmd supervdsmd
# systemctl status vdsmd supervdsmd # to confirm they are stopped
# fsck -y /dev/<dev_uuid>/master
- After all the errors are fixed, you can start the vdsmd and supervdsmd services and activate the host on the Admin Portal.
- Once the has host gained the SPM role and the Data Center is active again, you can start and activate other hosts. Note that the host might spend some time in the "Contending" state, depending on the amount of corruption that was present on the device(s). To check the progress, you can tail vdsm's log file:
# tail -f /var/log/vdsm/vdsm.log
- If anything in the above procedure goes wrong, please stop, run log collector and contact Red Hat Support.
Root Cause
- Master Metadata has ext3 file system on it. It contains VM OVF files and current tasks information running on the storage.
Diagnostic Steps
- Check
/var/log/vdsm/vdsm.log
on the host that contends to become SPM. - The latest and repeating error would be about master files system error.
- This error would also contain the device the Master is located on.
- This is the device you should use for manual format.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments