Virtio-Win block driver with Windows Server 2012 and libvirt
Hello -
I have a Windows Server 2012 virtual machine running as a RHEL guest. This one is in a RHEL KVM environment, managed with virt-manager. RHEV is not installed anywhere at this site. Looking at the Windows Event Viewer, I see occasional warnings with text similar to this:
The IO operation at logical block address 40028 for Disk 2 was retried
Although block number 40028 is commonly referenced, other block numbers range all over the place. I see an average of maybe a couple of these warnings every day. There were around 10 on 2/23, none 2/24, 2 on 2/25, 2 so far today.
In the old days, before we had software pretending to be hardware, this generally meant a disk may have had some bad blocks. But now ... ?
Here are the layers at this site:
From the Windows VM point of view, looking at Driver Details for the Red Hat VirtIO SCSI Disk Device, I see two files named disk.sys and partmgr.sys. These must be standard Microsoft files.
The viostor.sys driver in C:\Windows\System32 has a modify date of 7/6/2012, File version 60.63.103.2600. Looking on the RHEL host, rpm -qa | grep virtio shows virtio-win-1.5.3-1.el6_3.noarch. So I think I have the latest viostor driver inside that VM.
The host is running RHEL 6.3. It's a Dell PowerEdge R515 with a PERC H700i RAID controller and five 600 GB 15K RPM SAS drives. These are RAIDed into one RAID 1 (mirror) set and a 3 drive RAID 5 set. All storage is local on this host - no SAN or other shared storage here. This host also has another VM running a firewall based on Fedora.
The virtual disks for this VM are LVM logical volumes.
I have a couple of other Windows 2012 virtual machines in a RHEV environment at a different site with different hardware and these are not logging similar warnings.
So my question - should I be concerned about these warnings and how do I find out the cause? Why does this Windows driver need to retry IO operations sometimes?
Oh yes - from the host point of view -
dmesg | grep sda
dmesg | grep sdb, and
dmseg | grep /dev/mapper
all return nothing.
thanks
- Greg Scott
Responses
An IO operation can be retried if the first attempt timed out or failed. Now that can happen because of high storage latency, and isn't too uncommon, you would probably see that on systems with very high IO loads and slow disk subsystems. If this is not the case for you, I suspect you are simply seeing a minor timedrift, when the IO request was made, the guest drifet a few ms, and that appeared to the disk driver as if the timeout for the request elapsed.
In any case, there are still no 2012 virtio_blk drivers available, but once they are released, and if you keep seeing these errors, please open a support case, so we can look deeper into the issue.
Without really investigating, all I've written is just a well informed guess of course, so if you do experience the same issues when win2012 is supported and has a designated set of drivers, please do open a case, it is important for us to catch these things, even if they aren't dangerous and minor
I doubt milisecond delays is what NTP fixes. Last time I saw this issue, I used a C program that would print out the exact milisecond every 10ms, and watch the output for inconsistensies. Imagine how much text browsing that would take
So like I said, for now, I think it is safe to ignore those messages, and once 2012 drivers are available, we'd have to open a case to investigate. BTW, I assume this doesn't happen on 2008/2008r2 VMs?
OK, a few things:
1. Would you be able to try with RHEL6.4 and latest drivers?
2. You mentioned you have additional 2012 guests, that do not present the original issue, do they have the the same storage subsystem?
3. About the "cache enabled" warning, this is just a reminder, nothing to worry about. BTW, is there a chance there is a floppy or CD attached?
4. About the "reset to device" error, if it was happening with a SAN I will probably start blaming it. Reset can be triggered by time-out. Every SRB (SCSI Request Block) issued to mini-port driver has it as a parameter. Even though the default timeout value is 30 sec, we have seen this problem on NFS, but never with a local storage.
Next steps:
1. for #4, I suggest you open a support case and uplod a log collector as well as a dump of the windows system logs from the affected guests. Once you do that, I'd also like to see the case (post a case number here, so I can move in as well)