RHEL 5 virtual machine hangs when doing NFS file copy over virtio-net network interface
Environment
- RHEV hypervisor
- RHEL 5.11 32-bit virtual machine
- kernels
2.6.18-398.el5.i686
and2.6.18-400.1.1.el5.i686
virtio-net
network interface- RHEL 5 VM as NFS Client
Issue
- When I try to write files on a remote NFS filesystem using the VirtIO device, the system hangs up. Doing the same procedure using
e1000
driver, the problem doesn't occur. - New installation of RHEL 5.11 32-bit i686 virtual machine hangs when doing NFS file copy over
virtio-net
network interface
Resolution
Workaround
Use the e1000
virtual interface type, instead of the virtio-net
interface type.
Diagnostic Steps
Time source is kvm-clock
:
$ grep -i time var/log/dmesg
ACPI: PM-Timer IO Port: 0xb008
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Using local APIC timer interrupts.
WARNING calibrate_APIC_clock: the APIC timer calibration may be wrong.
Real Time Clock Driver v1.12ac
Time: kvm-clock clocksource has been installed.
Tried different clocksources but no difference:
clocksource=acpi_pm
clocksource=pit
clocksource=tsc
Captured vmcore. Logs show both IDE hard drive and network appear to have hung:
hdc: drive_cmd: status=0x41 { DriveReady Error }
hdc: drive_cmd: error=0x04 { AbortedCommand }
ide: failed opcode was: 0xec
nfs: server 10.0.0.144 not responding, still trying
nfs: server 10.0.0.144 not responding, still trying
nfs: server 10.0.0.144 not responding, still trying
INFO: task cp:3652 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cp D 0000008F 1716 3652 3613 (NOTLB)
f79e7dc4 00200086 a636f013 0000008f c041d52b a9124215 0000008f 00000009
f7c6d190 a6428707 0000008f 000b96f4 00000000 f7c6d29c ca00c580 f7c39200
00000184 00000000 f79556cc f79e7e00 0006b0b2 c042e4c3 f79556cc ffffffff
Call Trace:
[<c041d52b>] pvclock_clocksource_read+0x101/0x117
[<c042e4c3>] getnstimeofday+0x30/0xb6
[<c0628148>] io_schedule+0x36/0x59
[<c045bdfa>] sync_page+0x38/0x3b
[<c062831f>] __wait_on_bit+0x33/0x58
[<c045bdc2>] sync_page+0x0/0x3b
[<c045be82>] wait_on_page_bit+0x5b/0x62
[<c0438448>] wake_bit_function+0x0/0x3c
[<c045c816>] wait_on_page_writeback_range+0x4d/0xf6
[<c045cb90>] filemap_write_and_wait+0x1f/0x29
[<f8ded274>] nfs_setattr+0xac/0x111 [nfs]
[<c045ba2f>] file_read_actor+0x0/0xd5
[<c045cfe3>] generic_file_aio_read+0x3d/0x44
[<c047928f>] do_sync_read+0xb6/0xf1
[<c042b2c6>] current_fs_time+0x4a/0x54
[<c04923b5>] notify_change+0x117/0x2ae
[<c047902a>] sys_fchmod+0xa4/0xc2
[<c0404f9b>] syscall_call+0x7/0xb
=======================
Kernel panic - not syncing: hung_task: blocked tasks
IDE Disk is virtual CD-ROm drive, but removing this shows no change.
Using NFS rsize=32768,wsize=32768
shows no change.
vmcore analysis:
- panic context is
khungtaskd
, the other CPU core is idle:
crash> bt -a
PID: 191 TASK: f7f11450 CPU: 0 COMMAND: "khungtaskd"
#0 [f7c76f50] crash_kexec at c0447179
#1 [f7c76f94] panic at c0426641
#2 [f7c76fb0] watchdog at c0451a75
#3 [f7c76fcc] kthread at c0438358
#4 [f7c76fe4] kernel_thread_helper at c0405cd5
PID: 0 TASK: f7c01050 CPU: 1 COMMAND: "swapper"
#0 [f7c9ff08] crash_nmi_callback at c041b3f0
#1 [f7c9ff54] do_nmi at c0406895
#2 [f7c9ff7c] nmi at c0405bad
EAX: 00000000 EBX: 00000001 ECX: c0403c24 EDX: f7c9f000 EBP: 00000000
DS: 007b ESI: 0000e020 ES: 007b EDI: 00000000
CS: 0060 EIP: c0403c55 ERR: 00000000 EFLAGS: 00000246
#3 [f7c9ffb0] default_idle at c0403c55
#4 [f7c9ffb4] cpu_idle at c0403d1a
Change context to the hung task:
crash> ps | grep cp
2538 2462 0 ca3e8690 UN 0.0 4308 700 cp
crash> set 2538
PID: 2538
COMMAND: "cp"
TASK: ca3e8690 [THREAD_INFO: f7a36000]
CPU: 0
STATE: TASK_UNINTERRUPTIBLE
However the backtrace shows the process is now at a different point of execution:
crash> bt
PID: 2538 TASK: ca3e8690 CPU: 0 COMMAND: "cp"
#0 [f7a36e60] schedule at c0627b52
#1 [f7a36ed8] io_schedule at c06281f3
#2 [f7a36ee4] nfs_wait_bit_uninterruptible at f8df08a4 [nfs]
#3 [f7a36ee8] __wait_on_bit at c06283cc
#4 [f7a36f00] out_of_line_wait_on_bit at c0628451
#5 [f7a36f38] nfs_wait_on_request at f8df0898 [nfs]
#6 [f7a36f44] nfs_sync_inode_wait at f8df4c94 [nfs]
#7 [f7a36f88] nfs_do_fsync at f8deb4d3 [nfs]
#8 [f7a36f98] filp_close at c0477c92
#9 [f7a36fa8] sys_close at c0478f07
#10 [f7a36fb8] system_call at c0404f94
EAX: ffffffda EBX: 00000004 ECX: bfbd7858 EDX: 080593a4
DS: 007b ESI: 00000003 ES: 007b EDI: 00000000
SS: 007b ESP: bfbd4388 EBP: bfbd7508
CS: 0073 EIP: 00651402 ERR: 00000006 EFLAGS: 00000246
The hung task timeout may simply be too short.
The log and core contain io_schedule
executing at different points, though the core is executing at a point earlier than the log, so that function may be looping.
Increasing the hung task timeout may relieve the issue:
sysctl -w kernel.hung_task_timeout_secs=240
If not, using the e1000
interface provides a workaround, use this.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments