Virtio-Win block driver with Windows Server 2012 and libvirt

Latest response

Hello -

I have a Windows Server 2012 virtual machine running as a RHEL guest.  This one is in a RHEL KVM environment, managed with virt-manager.  RHEV is not installed anywhere at this site.   Looking at the Windows Event Viewer, I see occasional warnings with text similar to this:

The IO operation at logical block address 40028 for Disk 2 was retried

Although block number 40028 is commonly referenced, other block numbers range all over the place.  I see an average of maybe a couple of these warnings every day.  There were around 10 on 2/23, none 2/24, 2 on 2/25, 2 so far today. 

In the old days, before we had software pretending to be hardware, this generally meant a disk may have had some bad blocks.  But now ... ?

Here are the layers at this site:

From the Windows VM point of view, looking at Driver Details for the Red Hat VirtIO SCSI Disk Device, I see two files named disk.sys and partmgr.sys.  These must be standard Microsoft files. 

The viostor.sys driver in C:\Windows\System32 has a modify date of 7/6/2012, File version 60.63.103.2600.  Looking on the RHEL host, rpm -qa | grep virtio shows virtio-win-1.5.3-1.el6_3.noarch.  So I think I have the latest viostor driver inside that VM. 

The host is running RHEL 6.3.  It's a Dell PowerEdge R515 with a PERC H700i RAID controller and five 600 GB 15K RPM SAS drives.  These are RAIDed into one RAID 1 (mirror) set and a 3 drive RAID 5 set.  All storage is local on this host - no SAN or other shared storage here.  This host also has another VM running a firewall based on Fedora.

The virtual disks for this VM are LVM logical volumes. 

I have a couple of other Windows 2012 virtual machines in a RHEV environment at a different site with different hardware and these are not logging similar warnings. 

So my question - should I be concerned about these warnings and how do I find out the cause?  Why does this Windows driver need to retry IO operations sometimes?

Oh yes - from the host point of view -

dmesg | grep sda
dmesg | grep sdb, and
dmseg | grep /dev/mapper

all return nothing. 

thanks

- Greg Scott

Responses

An IO operation can be retried if the first attempt timed out or failed. Now that can happen because of high storage latency, and isn't too uncommon, you would probably see that on systems with very high IO loads and slow disk subsystems. If this is not the case for you, I suspect you are simply seeing a minor timedrift, when the IO request was made, the guest drifet a few ms, and that appeared to the disk driver as if the timeout for the request elapsed. 

In any case, there are still no 2012 virtio_blk drivers available, but once they are released, and if you keep seeing these errors, please open a support case, so we can look deeper into the issue. 

Ok.  Thanks Dan.

- Greg

Without really investigating, all I've written is just a well informed guess of course, so if you do experience the same issues when win2012 is supported and has a designated set of drivers, please do open a case, it is important for us to catch these things, even if they aren't dangerous and minor

Thanks Dan -

I was out of town yesterday and didn't get a chance to reply here.  There are a couple of discussions about NTP and the paravirtualized clock and that got me thinking - I wonder how we would test the theory that clock adjustments inside my VM makes it "think" the IOs timeout?  What if I temporarily turn off NTP for a day or two on both the host and VM?  Then the clock would not adjust and I would not see any of those retry warnings in the Windows Event Viewer.  Then turn NTP back on and let's see if the warnings come back.  Thoughts?

Also - did I read there are newer paravirtualized block drivers with RHEL 6.4?

thanks

- Greg

I doubt milisecond delays is what NTP fixes. Last time I saw this issue, I used a C program that would print out the exact milisecond every 10ms, and watch the output for inconsistensies. Imagine how much text browsing that would take

So like I said, for now, I think it is safe to ignore those messages, and once 2012 drivers are available, we'd have to open a case to investigate. BTW, I assume this doesn't happen on 2008/2008r2 VMs?

> BTW, I assume this doesn't happen on 2008/2008r2 VMs?

Hmm... Well, let me take a look.  I have a couple of them at different sites in a RHEV 3.0 environment.  I haven't noticed any, but I wasn't looking for it either.  I'll report back what I find.

- Greg

As promised, here is a quick survey of virtual disk warnings on a few other Windows VMs in libvirt/RHEL 6.3, RHEV 3.0, and RHEV 3.1 environments.  Nobody else had the retry warnings, but a few other warnings were common, especially the one about enabling write caching. 

Win 2008 R2, RHEV 3.0

Disk - The driver detected that the device \Device\Harddisk0\DR0 has its write cache enabled. Data corruption may occur.

Viostor -  Reset to device, \Device\RaidPort0, was issued.

 

Win2008R2, libvirt/RHEL6.3

Disk - The driver detected that the device \Device\Harddisk0\DR0 has its write cache enabled. Data corruption may occur.

Win2012, RHEV 3.1

The driver detected that the device \Device\Harddisk0\DR0 has its write cache enabled. Data corruption may occur.

 

OK, a few things:

1. Would you be able to try with RHEL6.4 and latest drivers?

2. You mentioned you have additional 2012 guests, that do not present the original issue, do they have the the same storage subsystem?

3. About the "cache enabled" warning, this is just a reminder, nothing to worry about. BTW, is there a chance there is a floppy or CD attached?

4. About the "reset to device" error, if it was happening with a SAN I will probably start blaming it. Reset can be triggered by time-out. Every SRB (SCSI Request Block) issued to mini-port driver has it as a parameter. Even though the default timeout value is 30 sec, we have seen this problem on NFS, but never with a local storage.    

 

Next steps:

1. for #4, I suggest you open a support case and uplod a log collector as well as a dump of the windows system logs from the affected guests. Once you do that, I'd also like to see the case (post a case number here, so I can move in as well)

I think the issue referenced in #4, the one about the device reset with Win2008R2 as a RHEV 3.0 VM, uses an Equallogic iSCSI SAN for backend storage.  I'll have to look into that one a little more and I've been meaning to upgrade that one to RHEV 3.1 anyway.  This might have been just a one time event and I don't have any user complaints so let's not get too aggressive on it. 

On the original post about Windows 2012 and the timeouts, I think I can schedule the update.  I'm in a training class all week so give me a little while to set it up. 

thanks

- Greg