Storage - Manual steps required after failure

Latest response

Hey,

Have two hosts. Power management is enabled.
Both are connected to a SAN via fibre channel.

For testing purposes I simulated SAN connection failure ( brought down the switch ports for one host).
All VMs on the affected hosts gets paused.
The host enter non-operational state.

However, no failover (i.e migration or restarting of the VMs) happens.
They just stay paused until I manually unpauses them.
Why is it handled that way? Or is it the desired way to handle this kind of fault in general?
Is it possible to customize certain actions for certain failures?

Responses

Hi Pär. Sorry you haven't seen a response to this yet. I'll see if I can find someone to help you out with this issue.

Automatic unpause of vms paused due to storage failure is still a work in progress for rhevm-3.3. At this time they need to be automatically unpaused.

Hi,

I have a problem with the fibre channel storage, after approximately one week, the storage became in non-responsive state without apparent reasons.

I have a fibre channel datacenter, and the way that I used to add a storage is directly on the LUN without creating a VG.

PS: I have an Emulex LPe12002-M8 on fibre channel adapter on my host.

When the error occurs, I have the following message in the event:

Failed to Reconstruct Master Domain for Data Center FC_TOTEM

and in the engine.log file I have this one:

2013-07-29 10:24:07,122 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-39) [78bfefe1] hostFromVds::selectedVds - tuc, spmStatus Unknown_Pool, storage pool FC_TOTEM

I have already try to do an dd command on the entirely LUN but the result is the same.

Any idea about the reason of this status changing?

Cordially

Anthony

> I have a fibre channel datacenter, and the way that I used to add a storage is directly on the LUN without creating a VG.

There should be a data storage domain to hold the boot disk image for the vm. Luns without VG (direct lun) is possible as additional luns. Are you speaking about direct luns?

> Failed to Reconstruct Master Domain for Data Center FC_TOTEM

There are a lot of reasons for this. Need to take a look at the engine.log and vdsm.log to understand why this happens. Recommended you open a case with Red Hat or spot the exact error from the logs and post them here.

Hi,

I'm not speaking about direct Luns, but about  the fibre channel data master domain in RHEV 3.2 and how I proceed to add the fibre storage and the way that I used to clean it it before to add it. The problem is when I try to re-activate the storage after that it became in non-responsive state , the following messages appears in the Event tab in the admin portal:

Failed to activate Storage Domain STORAGE_VM_FC (Data Center FC_TOTEM) by admin@internal

Failed to Reconstruct Master Domain for Data Center FC_TOTEM.

And I have also those messages in the engine.log file after that the re-activation operation has failed.

2013-07-30 09:45:05,400 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-46) [2f40bda5] Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=f76622ea-2f04-454e-81ba-58fc9d681fe8, msdUUID=acbbf754-0154-4518-96af-41cf4eb3130d'
2013-07-30 09:45:05,400 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-46) [2f40bda5] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=f76622ea-2f04-454e-81ba-58fc9d681fe8, msdUUID=acbbf754-0154-4518-96af-41cf4eb3130d'
2013-07-30 09:45:05,658 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-46) [2f40bda5] Command SpmStatusVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-07-30 09:45:05,658 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-46) [2f40bda5] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-07-30 09:45:05,704 ERROR [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-4-thread-46) [2f40bda5] Command org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-07-30 09:45:05,716 WARN  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-4-thread-48) [6af13bee] CanDoAction of action ReconstructMasterDomain failed. Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status Locked
2013-07-30 09:45:15,966 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-2) [25416916] Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=f76622ea-2f04-454e-81ba-58fc9d681fe8, msdUUID=acbbf754-0154-4518-96af-41cf4eb3130d'
2013-07-30 09:45:15,967 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-2) [25416916] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=f76622ea-2f04-454e-81ba-58fc9d681fe8, msdUUID=acbbf754-0154-4518-96af-41cf4eb3130d'
2013-07-30 09:45:16,223 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-2) [25416916] Command SpmStatusVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-07-30 09:45:16,223 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-2) [25416916] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-07-30 09:45:16,328 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-4-thread-45) [75829c4e] Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: acbbf754-0154-4518-96af-41cf4eb3130d Type: Storage

I have realized the steps given in the following procedure:

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.2/pdf/Installation_Guide/Red_Hat_Enterprise_Virtualization-3.2-Installation_Guide-en-US.pdf

Especially the Adding FCP Storage part which describe how add fibre storage.

It work correctly at the begining but after a while the messages described above appear.

Cordially

Anthony

Things to check.

- Run "multipath -ll" in hypervisor and verify that the lun used to create the master storage domain is showing as active-active.

- Run "vgs" command on hypervisor and verify that a vg named "acbbf754-0154-4518-96af-41cf4eb3130d" on top of the lun.

Regarding the first command "multipath -ll" I have the same result on both of the machines in the fibre cluster:

Server1:

]# multipath -ll
200004c7f0dad0001 dm-1 NEC,iStorage 2000
size=532G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 8:0:1:1 sdf 8:80  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 8:0:0:1 sdd 8:48  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 9:0:0:1 sdh 8:112 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 9:0:1:1 sdj 8:144 active ready running
200004c7f0dad0000 dm-0 NEC,iStorage 2000
size=133G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 8:0:1:0 sde 8:64  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 8:0:0:0 sdc 8:32  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 9:0:0:0 sdg 8:96  active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 9:0:1:0 sdi 8:128 active ready running
1Dell_Internal_Dual_SD_0123456789AB dm-3 Dell,Internal Dual SD
size=1.9G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  `- 0:0:0:0 sdb 8:16  active ready running

 

Server 2:

multipath -ll
200004c7f0dad0001 dm-0 NEC,iStorage 2000
size=532G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 0:0:1:1 sdd 8:48  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 0:0:0:1 sdb 8:16  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 1:0:1:1 sdj 8:144 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 1:0:0:1 sdh 8:112 active ready running
200004c7f0dad0000 dm-1 NEC,iStorage 2000
size=133G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 0:0:1:0 sdc 8:32  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 1:0:0:0 sdg 8:96  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 1:0:1:0 sdi 8:128 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 0:0:0:0 sda 8:0   active ready running
36a4badb0177bfc0016c5032b3165723c dm-3 DELL,PERC 6/i
size=837G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  `- 2:2:1:0 sdf 8:80  active ready running

 

It's a good point but it seems that I don't have a status active active.

PS: The LUN that I used as a master storage domain is 200004c7f0dad0001 dm-0 NEC,iStorage 2000

About the "vgs" command, I have the following result on each machine:

 vgs
  VG                                   #PV #LV #SN Attr   VSize   VFree  
  93aacc10-ea6b-4b96-aa70-93c9031824be   1   6   0 wz--n- 132,62g 128,75g
  VG0                                    1   7   0 wz--n- 135,81g  85,03g

 

Server 2:

 vgs
  VG                                   #PV #LV #SN Attr   VSize   VFree  
  93aacc10-ea6b-4b96-aa70-93c9031824be   1   6   0 wz--n- 132,62g 128,75g
  VG0                                    1   7   0 wz--n- 135,19g  76,69g

 

And as shown in the log message:

"Cannot find master domain: 'spUUID=f76622ea-2f04-454e-81ba-58fc9d681fe8, msdUUID=acbbf754-0154-4518-96af-41cf4eb3130d'"

I can't find the vg named "acbbf754-0154-4518-96af-41cf4eb3130d"

Cordially

Anthony

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.