Storage - Manual steps required after failure
Hey,
Have two hosts. Power management is enabled.
Both are connected to a SAN via fibre channel.
For testing purposes I simulated SAN connection failure ( brought down the switch ports for one host).
All VMs on the affected hosts gets paused.
The host enter non-operational state.
However, no failover (i.e migration or restarting of the VMs) happens.
They just stay paused until I manually unpauses them.
Why is it handled that way? Or is it the desired way to handle this kind of fault in general?
Is it possible to customize certain actions for certain failures?
Responses
Hi,
I have a problem with the fibre channel storage, after approximately one week, the storage became in non-responsive state without apparent reasons.
I have a fibre channel datacenter, and the way that I used to add a storage is directly on the LUN without creating a VG.
PS: I have an Emulex LPe12002-M8 on fibre channel adapter on my host.
When the error occurs, I have the following message in the event:
Failed to Reconstruct Master Domain for Data Center FC_TOTEM
and in the engine.log file I have this one:
2013-07-29 10:24:07,122 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-39) [78bfefe1] hostFromVds::selectedVds - tuc, spmStatus Unknown_Pool, storage pool FC_TOTEM
I have already try to do an dd command on the entirely LUN but the result is the same.
Any idea about the reason of this status changing?
Cordially
Anthony
> I have a fibre channel datacenter, and the way that I used to add a storage is directly on the LUN without creating a VG.
There should be a data storage domain to hold the boot disk image for the vm. Luns without VG (direct lun) is possible as additional luns. Are you speaking about direct luns?
> Failed to Reconstruct Master Domain for Data Center FC_TOTEM
There are a lot of reasons for this. Need to take a look at the engine.log and vdsm.log to understand why this happens. Recommended you open a case with Red Hat or spot the exact error from the logs and post them here.
Hi,
I'm not speaking about direct Luns, but about the fibre channel data master domain in RHEV 3.2 and how I proceed to add the fibre storage and the way that I used to clean it it before to add it. The problem is when I try to re-activate the storage after that it became in non-responsive state , the following messages appears in the Event tab in the admin portal:
Failed to activate Storage Domain STORAGE_VM_FC (Data Center FC_TOTEM) by admin@internal
Failed to Reconstruct Master Domain for Data Center FC_TOTEM.
And I have also those messages in the engine.log file after that the re-activation operation has failed.
2013-07-30 09:45:05,400 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-46) [2f40bda5] Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=f76622ea-2f04-454e-81ba-58fc9d681fe8, msdUUID=acbbf754-0154-4518-96af-41cf4eb3130d'
2013-07-30 09:45:05,400 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-46) [2f40bda5] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=f76622ea-2f04-454e-81ba-58fc9d681fe8, msdUUID=acbbf754-0154-4518-96af-41cf4eb3130d'
2013-07-30 09:45:05,658 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-46) [2f40bda5] Command SpmStatusVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-07-30 09:45:05,658 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-46) [2f40bda5] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-07-30 09:45:05,704 ERROR [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-4-thread-46) [2f40bda5] Command org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-07-30 09:45:05,716 WARN [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-4-thread-48) [6af13bee] CanDoAction of action ReconstructMasterDomain failed. Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status Locked
2013-07-30 09:45:15,966 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-2) [25416916] Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=f76622ea-2f04-454e-81ba-58fc9d681fe8, msdUUID=acbbf754-0154-4518-96af-41cf4eb3130d'
2013-07-30 09:45:15,967 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-2) [25416916] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=f76622ea-2f04-454e-81ba-58fc9d681fe8, msdUUID=acbbf754-0154-4518-96af-41cf4eb3130d'
2013-07-30 09:45:16,223 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-2) [25416916] Command SpmStatusVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-07-30 09:45:16,223 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-2) [25416916] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-07-30 09:45:16,328 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-4-thread-45) [75829c4e] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : ID: acbbf754-0154-4518-96af-41cf4eb3130d Type: Storage
I have realized the steps given in the following procedure:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.2/pdf/Installation_Guide/Red_Hat_Enterprise_Virtualization-3.2-Installation_Guide-en-US.pdf
Especially the Adding FCP Storage part which describe how add fibre storage.
It work correctly at the begining but after a while the messages described above appear.
Cordially
Anthony
Regarding the first command "multipath -ll" I have the same result on both of the machines in the fibre cluster:
Server1:
]# multipath -ll
200004c7f0dad0001 dm-1 NEC,iStorage 2000
size=532G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 8:0:1:1 sdf 8:80 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 8:0:0:1 sdd 8:48 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 9:0:0:1 sdh 8:112 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 9:0:1:1 sdj 8:144 active ready running
200004c7f0dad0000 dm-0 NEC,iStorage 2000
size=133G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 8:0:1:0 sde 8:64 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 8:0:0:0 sdc 8:32 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 9:0:0:0 sdg 8:96 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 9:0:1:0 sdi 8:128 active ready running
1Dell_Internal_Dual_SD_0123456789AB dm-3 Dell,Internal Dual SD
size=1.9G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
`- 0:0:0:0 sdb 8:16 active ready running
Server 2:
multipath -ll
200004c7f0dad0001 dm-0 NEC,iStorage 2000
size=532G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 0:0:1:1 sdd 8:48 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 0:0:0:1 sdb 8:16 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 1:0:1:1 sdj 8:144 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 1:0:0:1 sdh 8:112 active ready running
200004c7f0dad0000 dm-1 NEC,iStorage 2000
size=133G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 0:0:1:0 sdc 8:32 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 1:0:0:0 sdg 8:96 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 1:0:1:0 sdi 8:128 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 0:0:0:0 sda 8:0 active ready running
36a4badb0177bfc0016c5032b3165723c dm-3 DELL,PERC 6/i
size=837G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
`- 2:2:1:0 sdf 8:80 active ready running
It's a good point but it seems that I don't have a status active active.
PS: The LUN that I used as a master storage domain is 200004c7f0dad0001 dm-0 NEC,iStorage 2000
About the "vgs" command, I have the following result on each machine:
vgs
VG #PV #LV #SN Attr VSize VFree
93aacc10-ea6b-4b96-aa70-93c9031824be 1 6 0 wz--n- 132,62g 128,75g
VG0 1 7 0 wz--n- 135,81g 85,03g
Server 2:
vgs
VG #PV #LV #SN Attr VSize VFree
93aacc10-ea6b-4b96-aa70-93c9031824be 1 6 0 wz--n- 132,62g 128,75g
VG0 1 7 0 wz--n- 135,19g 76,69g
And as shown in the log message:
"Cannot find master domain: 'spUUID=f76622ea-2f04-454e-81ba-58fc9d681fe8, msdUUID=acbbf754-0154-4518-96af-41cf4eb3130d'"
I can't find the vg named "acbbf754-0154-4518-96af-41cf4eb3130d"
Cordially
Anthony
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
