RHEV3.0-Hypervisor to maintenance mode shutdown's iSCSI sessions

Latest response

 

Hi,

 

We are running RHEV 3.0 managing two  RHEV 6.2 hypervisor using iSCSI data center. When I moved the hypervisor to maintenance mode, all  iSCSI sessions got disconnected and that causes storage domains to lose all the underlying devices.  

So can someone help me why iSCSI sessions are getting dropped when host goes to maintenance mode.

 

Thanks,

Inbaraj

Responses

A host in maintenance mode is a host you want to be able to take down for service. This means all the VMs will be live migrated away from this host and all storage connections will be disconnected.

 

Storage domains should not be affected, because they are being accessed by the other hypervisors. If the host that was put in maintenance mode was SPM, a new SPM will be elected automatically (this takes a few seconds, so SDs might appear to be down, but this does not really affect running VMs)

 

Thanks Dan, 

 

but after moving the hypervisor to maintenance  window i did an update. update failed with the below message.

*host1 installation failed. The required action is taking longer than allowed by configuration. Verify host networking and storage settings.

 

But when I discovered the iSCSI sessions manually at the time of update failure, automatically the host rebooted and came up with the updated version. so is this because of the master disk not availabe during update due to iscsi sessions drop.

 

-Inbaraj

Your hypervisors boot from iSCSI?

No Dan. this is a local boot installation.

 

the scenario is:

I am using software iSCSI protocol. so when i move the Hypervisor to maintenence mode all my iSCSI sessions are getting dropped and when i tried for upgrade, it failed with this message:

 

host1 installation failed. The required action is taking longer than allowed by configuration. Verify host networking and storage settings.

 

when i checked the hypervisor, all the sessions are dropped and all LV and VG commands fail to execute(it hangs)

At this point, when i create the sessions manually the hypervisor which was updated., rebooted and came up with the upgraded version(latest). 

I am puzzled if install had happened and it was waiting for storage (master disk) to update few parameters ? Once storage is up(after creating iSCSI sessions manually), it could update it and then rebooted.

 

pls guide.

 

One more observation is that , in FC i dont see any loss of LUN paths. And the upgrade succeeds.

 

-Inbaraj

Hi Dan,

Even I am facing the exact issue what Inbaraj is facing. Looks like a potential issue with RHEV.

Also, I wanted to know if you have any write-up of what happens when you move the hypervisros in maintenance mode.

 

Thanks,

-Ranjan

Hey  Inbaraj,

 

I doubt the 'reboot' operation of 'rhev-h" failed at the time of upgrade..

 

Are you sure you saw upgraded version of RHEV-H before you reboot or when you manually executed 'iscsi' logins ?

 

In my imagination, this is the scenario when 'upgrade' reboot failed..

 

How did you check whether the upgraded version was the one you were operating after the error ?

 

Please let us know more details on this.

 

Also, pulling/logging out all the sessions at time of "maintanance mode' is expected..

 

--Humble

Hey Ranjan,

 

Please see my reply to Inbaraj ..

 

--Humble

 

Thanks Humble.

 

Hope you can help me here as I am stuck with this bug. This time I have all the logs ready. Please look into and guide me how to proceed on this.

 

Here are the steps that I have followed.

 

-created a iSCSI datacenter with a two node  cluster.

-created a master storage domain(iSCSI).

-moved node1 to maintenance mode.

-started updating the hypervisor from RHEV manager.

-update failed

 

Please let me know if you need any other details.

 

Here is the log Link:

http://www.yourfilelink.com/get.php?fid=815548&dv=1

 

 

-Inbaraj

So now the update of a host in maintenance mode failed, right? Nothing to do with iSCSI... Maybe we should start a separate thread for this?

So now the update of a host in maintenance mode failed, right?

-yes. 

 

Nothing to do with iSCSI... 

-i dont see this behaviour in FC. As per Humble and also, what I see in the log, iSCSI session are terminated, when we push the server to maintenance mode. This is expected as per design as per you guys.

At this time, I see all LVM layer getting hung. During update, I see that "vdsm update" script internally. And i see hung_task timeout for lvs commands.

 

 

Here is the snippet  :

 

Jul 18 10:43:59 ibmx3550-210-89 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 18 10:43:59 ibmx3550-210-89 kernel: lvs           D 0000000000000006     0 13890  13881 0x00000080
Jul 18 10:43:59 ibmx3550-210-89 kernel: ffff8808789a1b18 0000000000000086 0000000000000000 ffff88087620abe0
Jul 18 10:43:59 ibmx3550-210-89 kernel: ffff88087620abe0 ffff880869b24c00 0000000000000001 000000000000000c
Jul 18 10:43:59 ibmx3550-210-89 kernel: ffff88086997fab8 ffff8808789a1fd8 000000000000f4e8 ffff88086997fab8
Jul 18 10:43:59 ibmx3550-210-89 kernel: Call Trace:
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff814ed293>] io_schedule+0x73/0xc0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811b19de>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811b1ece>] __blockdev_direct_IO+0x5e/0xd0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811ae7d0>] ? blkdev_get_blocks+0x0/0xc0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811af637>] blkdev_direct_IO+0x57/0x60
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811ae7d0>] ? blkdev_get_blocks+0x0/0xc0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff8111288b>] generic_file_aio_read+0x6bb/0x700
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff81213851>] ? avc_has_perm+0x71/0x90
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811763ca>] do_sync_read+0xfa/0x140
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811aec0c>] ? block_ioctl+0x3c/0x40
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811892f2>] ? vfs_ioctl+0x22/0xa0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff81218e3b>] ? selinux_file_permission+0xfb/0x150
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff8120c1e6>] ? security_file_permission+0x16/0x20
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff81176dc5>] vfs_read+0xb5/0x1a0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff81176f01>] sys_read+0x51/0x90
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Jul 18 10:45:59 ibmx3550-210-89 kernel: INFO: task lvs:13890 blocked for more than 120 seconds.
=========================================

 

Jul 18 10:43:59 ibmx3550-210-89 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 18 10:43:59 ibmx3550-210-89 kernel: lvs           D 0000000000000006     0 13890  13881 0x00000080
Jul 18 10:43:59 ibmx3550-210-89 kernel: ffff8808789a1b18 0000000000000086 0000000000000000 ffff88087620abe0
Jul 18 10:43:59 ibmx3550-210-89 kernel: ffff88087620abe0 ffff880869b24c00 0000000000000001 000000000000000c
Jul 18 10:43:59 ibmx3550-210-89 kernel: ffff88086997fab8 ffff8808789a1fd8 000000000000f4e8 ffff88086997fab8
Jul 18 10:43:59 ibmx3550-210-89 kernel: Call Trace:
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff814ed293>] io_schedule+0x73/0xc0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811b19de>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811b1ece>] __blockdev_direct_IO+0x5e/0xd0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811ae7d0>] ? blkdev_get_blocks+0x0/0xc0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811af637>] blkdev_direct_IO+0x57/0x60
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811ae7d0>] ? blkdev_get_blocks+0x0/0xc0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff8111288b>] generic_file_aio_read+0x6bb/0x700
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff81213851>] ? avc_has_perm+0x71/0x90
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811763ca>] do_sync_read+0xfa/0x140
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811aec0c>] ? block_ioctl+0x3c/0x40
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff811892f2>] ? vfs_ioctl+0x22/0xa0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff81218e3b>] ? selinux_file_permission+0xfb/0x150
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff8120c1e6>] ? security_file_permission+0x16/0x20
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff81176dc5>] vfs_read+0xb5/0x1a0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff81176f01>] sys_read+0x51/0x90
Jul 18 10:43:59 ibmx3550-210-89 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Jul 18 10:45:59 ibmx3550-210-89 kernel: INFO: task lvs:13890 blocked for more than 120 seconds.
=============================================

Hi Inbaraj,

I checked your logs. Looks like there is a problem with vdsm.

 

Dan, Humble ? Can you people confirm ?

 

-Ranjan

OK, looks like this is a case that requires deeper investigation than a simple forum can provide. Please open a support case for this, and provide a full log collector output.

Hi Inbaraj,

 

Thanks for more information on this.

 

Did you get a chance to check different lvm commands in your system 'before' the vdsm/rhev update and 'after' the host moved to maintanance mode? As you said, it seems like 'lvm' layer is hung which cause further issues.

 

To check further, we need the LVM commands status and other outputs like "multipath -ll"..etc .

 

It is better to open a support case as Dan mentioned..

 

I still have to catch up with the logs.. but it is better to move and work on a support case..

 

--Humble

Hi Ranjan,

 

I still have to catch up with the logs and vdsm source code for further confirmation..

 

Even-though I see less chance for vdsm bug, I can not 100% be sure..

 

Afaik , I have n't seen this issue reported with any other customers.

 

Are you also seeing the same behaviour ( LVM commands hang) in your setup as well ?

 

>>

Looks like there is a problem with vdsm.

 

>>

 

Did you notice anything from the logs to think it is a bug with vdsm ? If yes, please let me know , so that I can pay more attention on this.

 

--Humble