System crashes with fsf_ops_open [cdr] symbol on backtrace
Environment
- Red Hat Enterprise Linux, Red Hat Gluster Storage
- CommVault modules installed and loaded (cdr, talpa_vfshook, talpa_linux).
- Sophos Antivirus (savd)
Issue
- gluster nodes keep rebooting. One of the nodes crashes as soon as you enter the command "gluster". Following crash might be seen on the kernel log:
general protection fault: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/block/dm-14/dev
CPU 4
Modules linked in: ip6_tables ebtable_nat ebtables fuse cdr(P)(U) softdog talpa_vfshook(U) talpa_pedconnector(U) talpa_pedevice(U) talpa_vcdevice(U) talpa_core(U) talpa_linux(U
) talpa_syscallhook
net eth0: eth0: tq[2] error 0x80000003
vmxnet3 0000:0b:00.0: eth0: resetting
(U) vsock(U) nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_LOG xt_limit iptable_filter ip_tables xfs exportfs dm_multipath vhost_net macvtap macvlan tun ppdev parp
ort_pc parport sg microcode serio_raw vmxnet3 vmware_balloon vmci(U) i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif vmw_pvscsi sr_mod cdrom pata_acpi ata_generic
ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 3028, comm: glusterfsd Tainted: P -- ------------ 2.6.32-573.8.1.el6.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffff8105e3fa>] [<ffffffff8105e3fa>] task_rq_lock+0x4a/0xa0
RSP: 0018:ffff88102639fa28 EFLAGS: 00010086
RAX: 3d0000f000250000 RBX: 00000000000159c0 RCX: 0000000000000000
RDX: 0000000000000086 RSI: ffff88102639fa80 RDI: ffffffffa03c2a74
RBP: ffff88102639fa48 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa03c2a74
R13: ffff88102639fa80 R14: 00000000000159c0 R15: 0000000000000003
FS: 00007f45effff700(0000) GS:ffff880062280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f460537d000 CR3: 0000001023f05000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process glusterfsd (pid: 3028, threadinfo ffff88102639c000, task ffff88101ff76040)
Stack:
ffffffffa03c2a74 0000000025febec8 0000000000000000 0000000000000004
<d> ffff88102639fab8 ffffffff81066f0c 00007f460537d000 ffff881021d10180
<d> ffff880f957b5e08 ffff88101ff76040 ffff88102639fb98 0000000000000086
Call Trace:
[<ffffffffa03c2a74>] ? fsf_ops_open+0x64/0x110 [cdr]
[<ffffffff81066f0c>] try_to_wake_up+0x3c/0x3e0
[<ffffffff810672c2>] default_wake_function+0x12/0x20
[<ffffffff81059939>] __wake_up_common+0x59/0x90
[<ffffffff81059983>] __wake_up_locked+0x13/0x20
[<ffffffff811db98b>] ep_poll_callback+0x7b/0xf0
[<ffffffff81059939>] __wake_up_common+0x59/0x90
[<ffffffff8105e0d3>] __wake_up_sync_key+0x53/0x80
[<ffffffff81459bc4>] sock_def_readable+0x44/0x80
[<ffffffff81505e6b>] unix_stream_sendmsg+0x20b/0x4a0
[<ffffffff8145860b>] sock_aio_write+0x19b/0x1c0
[<ffffffff81137b74>] ? __pagevec_free+0x44/0x90
[<ffffffff811917ba>] do_sync_write+0xfa/0x140
[<ffffffff8103adfe>] ? physflat_send_IPI_mask+0xe/0x10
[<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81158150>] ? unmap_region+0x110/0x130
[<ffffffff811b4580>] ? mntput_no_expire+0x30/0x110
[<ffffffff812316c6>] ? security_file_permission+0x16/0x20
[<ffffffff81191b84>] vfs_write+0x184/0x1a0
[<ffffffff81192fa6>] ? fget_light_pos+0x16/0x50
[<ffffffff811925f1>] sys_write+0x51/0xb0
[<ffffffff811a9fc1>] ? sys_poll+0x71/0x100
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Code: 00 48 c7 c3 c0 59 01 00 49 89 fc 49 89 f5 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 55 00 49 8b 44 24 08 49 89 de <8b> 40 18 4c 03 34 c5 a0 e5 c0 81 4c 89 f7 e8 d3 d5 4d 00 49 8b
RIP [<ffffffff8105e3fa>] task_rq_lock+0x4a/0xa0
RSP <ffff88102639fa28>
Resolution
- It was reported that disabling Sophos Antivirus (savd) is a solution to the problem. You may also want contact the antivirus/cdr/talpa_vfshook/talpa_linux vendor reporting this issue and asking for support on this problem (see root cause/diagnostic steps below showing the issue on these modules).
Root Cause
- The problem is a runaway recursion which causes memory corruption in the Linux kernel, causing unpredictable behaviour and random system crashes. The runaway recursion is caused by recursive calls between cdr and talpa_vfshook modules.
Diagnostic Steps
- Grab a vmcore from a system which crashed. On a system affected by the issue, one initial clue is that a lot of kernel slabs got corrupted by the same pattern, in a sequential virtual address memory range:
crash> kmem -s
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff8810261b3180 cdr_ilock_item 32 0 0 0 4k
ffff881026333100 pid_2 128 0 0 0 4k
ffff88101ed830c0 fuse_request 424 0 0 0 4k
...
kmem: xfs_buf: full list: slab: ffff881025f71000 bad prev pointer: 0
kmem: xfs_buf: full list: slab: ffff881025f71000 bad inuse counter: 2688320780
kmem: xfs_buf: full list: slab: ffff881025f71000 bad inuse counter: 2688320780
ffff881025552f40 xfs_buf 384 1868 1880 188 4k
...
ffff881028fd1ac0 TCP 1728 292 308 77 8k
kmem: eventpoll_pwq: full list: slab: ffff881025f81000 bad prev pointer: 0
kmem: eventpoll_pwq: full list: slab: ffff881025f81000 bad inuse counter: 2688320780
kmem: eventpoll_pwq: full list: slab: ffff881025f81000 bad inuse counter: 2688320780
ffff881028fa1a80 eventpoll_pwq 72 0 53 1 4k
...
kmem: sysfs_dir_cache: full list: slab: ffff881025f4d000 bad prev pointer: 0
kmem: sysfs_dir_cache: full list: slab: ffff881025f4d000 bad inuse counter: 2688320780
kmem: sysfs_dir_cache: full list: slab: ffff881025f4d000 bad inuse counter: 2688320780
ffff8810298f1040 sysfs_dir_cache 144 947 1026 38 4k
...
kmem: task_struct: full list: slab: ffff881025fb8000 bad prev pointer: 0
kmem: task_struct: full list: slab: ffff881025fb8000 bad inuse counter: 2688320780
kmem: task_struct: full list: slab: ffff881025fb8000 bad inuse counter: 2688320780
...
-
It appears a lot of slabs got corrupted, common to the corruption is the inuse counter == 2688320780.
-
If we take a look at one of the slabs, for example:
crash> slab ffff881025fb8000
struct slab {
list = {
next = 0xffff8810205d16c0,
prev = 0x0
},
colouroff = 18446612201692672392,
s_mem = 0xffff881025fb8088,
inuse = 2688320780, <---- bad counter
free = 4294967295,
nodeid = 0
}
crash> slab -x ffff881025fb8000
struct slab {
list = {
next = 0xffff8810205d16c0,
prev = 0x0
},
colouroff = 0xffff881027473188,
s_mem = 0xffff881025fb8088,
inuse = 0xa03c850c, <---- bad counter in hexadecimal format
free = 0xffffffff,
nodeid = 0x0
}
crash> rd ffff881025fb8000 6
ffff881025fb8000: ffff8810205d16c0 0000000000000000 ..] ............
ffff881025fb8010: ffff881027473188 ffff881025fb8088 .1G'.......%....
ffff881025fb8020: ffffffffa03c850c 502073616c750000 ..<.......ulas P <---- 0xa03c850 is part of a module address/symbol
crash> rd -s ffff881025fb8000 6
ffff881025fb8000: ffff8810205d16c0 0000000000000000
ffff881025fb8010: ffff881027473188 ffff881025fb8088
ffff881025fb8020: cxfs_ops_open+204 502073616c750000 <------- 0xa03c850 is part of symbol cxfs_ops_open+204
- Thus for some reason we got corruption with possibly an return address from function cxfs_ops_open
crash> sym cxfs_ops_open
ffffffffa03c8440 (t) cxfs_ops_open [cdr]
-
This is part of the proprietary cdr module which was installed on the system.
-
At the vmcore, other CPUs may have other processes/kernel code crashing as well:
PID: 27 TASK: ffff88102904f520 CPU: 0 COMMAND: "events/0"
#3 [ffff88102905bd20] invalid_op at ffffffff8100c01b
[exception RIP: vmxnet3_quiesce_dev+821] <---- BUG_ON on vmxnet driver
PID: 3075 TASK: ffff881025fb8040 CPU: 1 COMMAND: "<C0><FB>G"^P<88><FF><FF><D4>I<EF> ^P<88><FF>" <--- corrupted thread_info ?
[exception RIP: oops_begin+116] <---- oops/crash
PID: 0 TASK: ffff881029bbcab0 CPU: 3 COMMAND: "swapper"
[exception RIP: oops_begin+116] <---- oops/crash
[exception RIP: hrtimer_interrupt+234]
-
Which is not surprising giving the memory corruption as can be seen from kmem output above.
-
A sequence of symbols can always be seen in the memory range corrupted:
crash> rd -s ffff881025fb8000 40
ffff881025fb8000: ffff8810205d16c0 0000000000000000
ffff881025fb8010: ffff881027473188 ffff881025fb8088
ffff881025fb8020: cxfs_ops_open+204 502073616c750000 <---- cxfs_ops_open+204, symbol from cdr module
ffff881025fb8030: 0000000200000001 616b636fffffffff
ffff881025fb8040: 0000000000000000 ffff881025ff0000
ffff881025fb8050: 0040216000000002 ffff881025fb8088
ffff881025fb8060: ffff881027473188 ffff88102163c000
ffff881025fb8070: 0000000000000000 ffff88102247fbc0
ffff881025fb8080: ffff881027473188 ffff881025fb80c8
ffff881025fb8090: fsf_ops_open+100 0000000000000000 <---- fsf_ops_open+100, symbol from cdr module
ffff881025fb80a0: 0000000062235a68 ffff881020ef49c0
ffff881025fb80b0: ffff881027473188 ffff88102247fbc0
ffff881025fb80c0: ffff881020ef49d4 ffff881025fb8118
ffff881025fb80d0: talpaOpen+271 0000000000000000 <---- symbol from talpa_vfshook module
ffff881025fb80e0: 0000000000000000 0000000000000007
ffff881025fb80f0: ffff88102247fbc0 ffff88102163c000
ffff881025fb8100: ffff8810205d16c0 0000000000000000
ffff881025fb8110: ffff881027473188 ffff881025fb8188
ffff881025fb8120: cxfs_ops_open+204 00000000000000bf <---- ...
ffff881025fb8130: 000000007fea7a23 0000000000000000
- The crashes and the problem are the result of something one of these modules are doing (cdr/talpa). The sequence of symbols above on the memory are return addresses of functions from talpa_vfshook and cdr modules. Which means something on these modules entered a runaway recursion overwriting and causing corruption in a big chunk of memory. The process/task in the runway recursion can be identified to be the one running on one of the CPUs, and usually it will have its command name corrupted, eg.:
crash> set -c 1
PID: 3075
COMMAND: "<C0><FB>G"^P<88><FF><FF><D4>I<EF> ^P<88><FF>" <---- command name corrupted, as runaway recursion runs over thread_info structure
TASK: ffff881025fb8040 [THREAD_INFO: ffff881025ff0000]
CPU: 1
STATE: TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE|TASK_STOPPED|TASK_TRACED|TASK_WAKEKILL (ACTIVE)
crash> bt -S 0xffff881025ff2000
PID: 3075 TASK: ffff881025fb8040 CPU: 1 COMMAND: "<C0><FB>G"^P<88><FF><FF><D4>I<EF> ^P<88><FF>"
#0 [ffff881025ff2000] schedule at ffffffff81538500
#1 [ffff881025ff2020] cxfs_ops_open at ffffffffa03c850c [cdr]
#2 [ffff881025ff2090] fsf_ops_open at ffffffffa03c2a74 [cdr]
#3 [ffff881025ff20d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#4 [ffff881025ff2120] cxfs_ops_open at ffffffffa03c850c [cdr]
#5 [ffff881025ff2190] fsf_ops_open at ffffffffa03c2a74 [cdr]
#6 [ffff881025ff21d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#7 [ffff881025ff2220] cxfs_ops_open at ffffffffa03c850c [cdr]
#8 [ffff881025ff2290] fsf_ops_open at ffffffffa03c2a74 [cdr]
#9 [ffff881025ff22d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#10 [ffff881025ff2320] cxfs_ops_open at ffffffffa03c850c [cdr]
#11 [ffff881025ff2390] fsf_ops_open at ffffffffa03c2a74 [cdr]
#12 [ffff881025ff23d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#13 [ffff881025ff2420] cxfs_ops_open at ffffffffa03c850c [cdr]
#14 [ffff881025ff2490] fsf_ops_open at ffffffffa03c2a74 [cdr]
#15 [ffff881025ff24d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#16 [ffff881025ff2520] cxfs_ops_open at ffffffffa03c850c [cdr]
#17 [ffff881025ff2590] fsf_ops_open at ffffffffa03c2a74 [cdr]
#18 [ffff881025ff25d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#19 [ffff881025ff2620] cxfs_ops_open at ffffffffa03c850c [cdr]
#20 [ffff881025ff2690] fsf_ops_open at ffffffffa03c2a74 [cdr]
#21 [ffff881025ff26d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#22 [ffff881025ff2720] cxfs_ops_open at ffffffffa03c850c [cdr]
#23 [ffff881025ff2790] fsf_ops_open at ffffffffa03c2a74 [cdr]
#24 [ffff881025ff27d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#25 [ffff881025ff2820] cxfs_ops_open at ffffffffa03c850c [cdr]
#26 [ffff881025ff2890] fsf_ops_open at ffffffffa03c2a74 [cdr]
#27 [ffff881025ff28d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#28 [ffff881025ff2920] cxfs_ops_open at ffffffffa03c850c [cdr]
#29 [ffff881025ff2990] fsf_ops_open at ffffffffa03c2a74 [cdr]
#30 [ffff881025ff29d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#31 [ffff881025ff2a20] cxfs_ops_open at ffffffffa03c850c [cdr]
#32 [ffff881025ff2a90] fsf_ops_open at ffffffffa03c2a74 [cdr]
#33 [ffff881025ff2ad0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#34 [ffff881025ff2b20] cxfs_ops_open at ffffffffa03c850c [cdr]
#35 [ffff881025ff2b90] fsf_ops_open at ffffffffa03c2a74 [cdr]
#36 [ffff881025ff2bd0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#37 [ffff881025ff2c20] cxfs_ops_open at ffffffffa03c850c [cdr]
#38 [ffff881025ff2c90] fsf_ops_open at ffffffffa03c2a74 [cdr]
#39 [ffff881025ff2cd0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#40 [ffff881025ff2d20] cxfs_ops_open at ffffffffa03c850c [cdr]
#41 [ffff881025ff2d90] fsf_ops_open at ffffffffa03c2a74 [cdr]
#42 [ffff881025ff2dd0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#43 [ffff881025ff2e20] cxfs_ops_open at ffffffffa03c850c [cdr]
#44 [ffff881025ff2e90] fsf_ops_open at ffffffffa03c2a74 [cdr]
#45 [ffff881025ff2ed0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#46 [ffff881025ff2f20] cxfs_ops_open at ffffffffa03c850c [cdr]
#47 [ffff881025ff2f90] fsf_ops_open at ffffffffa03c2a74 [cdr]
#48 [ffff881025ff2fd0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#49 [ffff881025ff3020] cxfs_ops_open at ffffffffa03c850c [cdr]
#50 [ffff881025ff3090] fsf_ops_open at ffffffffa03c2a74 [cdr]
#51 [ffff881025ff30d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#52 [ffff881025ff3120] cxfs_ops_open at ffffffffa03c850c [cdr]
#53 [ffff881025ff3190] fsf_ops_open at ffffffffa03c2a74 [cdr]
#54 [ffff881025ff31d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#55 [ffff881025ff3220] cxfs_ops_open at ffffffffa03c850c [cdr]
#56 [ffff881025ff3290] fsf_ops_open at ffffffffa03c2a74 [cdr]
#57 [ffff881025ff32d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#58 [ffff881025ff3320] cxfs_ops_open at ffffffffa03c850c [cdr]
#59 [ffff881025ff3390] fsf_ops_open at ffffffffa03c2a74 [cdr]
#60 [ffff881025ff33d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#61 [ffff881025ff3420] cxfs_ops_open at ffffffffa03c850c [cdr]
#62 [ffff881025ff3490] fsf_ops_open at ffffffffa03c2a74 [cdr]
#63 [ffff881025ff34d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#64 [ffff881025ff3520] cxfs_ops_open at ffffffffa03c850c [cdr]
#65 [ffff881025ff3590] fsf_ops_open at ffffffffa03c2a74 [cdr]
#66 [ffff881025ff35d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#67 [ffff881025ff3620] cxfs_ops_open at ffffffffa03c850c [cdr]
#68 [ffff881025ff3690] fsf_ops_open at ffffffffa03c2a74 [cdr]
#69 [ffff881025ff36d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#70 [ffff881025ff3720] cxfs_ops_open at ffffffffa03c850c [cdr]
#71 [ffff881025ff3790] fsf_ops_open at ffffffffa03c2a74 [cdr]
#72 [ffff881025ff37d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#73 [ffff881025ff3820] cxfs_ops_open at ffffffffa03c850c [cdr]
#74 [ffff881025ff3890] fsf_ops_open at ffffffffa03c2a74 [cdr]
#75 [ffff881025ff38d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#76 [ffff881025ff3920] cxfs_ops_open at ffffffffa03c850c [cdr]
#77 [ffff881025ff3990] fsf_ops_open at ffffffffa03c2a74 [cdr]
#78 [ffff881025ff39d0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#79 [ffff881025ff3a20] __dentry_open at ffffffff8118eaa2
#80 [ffff881025ff3a80] dentry_open at ffffffff8118ed52
#81 [ffff881025ff3ab0] openDentry at ffffffffa0368df3 [talpa_linux]
#82 [ffff881025ff3ae0] examineFile at ffffffffa037ac8b [talpa_core]
#83 [ffff881025ff3b90] examineFileInfo at ffffffffa0374a43 [talpa_core]
#84 [ffff881025ff3be0] talpaOpen at ffffffffa039df06 [talpa_vfshook]
#85 [ffff881025ff3c30] cxfs_ops_open at ffffffffa03c850c [cdr]
#86 [ffff881025ff3ca0] fsf_ops_open at ffffffffa03c2a74 [cdr]
#87 [ffff881025ff3ce0] talpaOpen at ffffffffa039de8f [talpa_vfshook]
#88 [ffff881025ff3d30] __dentry_open at ffffffff8118eaa2
#89 [ffff881025ff3d90] nameidata_to_filp at ffffffff8118ee14
#90 [ffff881025ff3db0] do_filp_open at ffffffff811a4c80
#91 [ffff881025ff3f20] do_sys_open at ffffffff8118e847
#92 [ffff881025ff3f70] sys_open at ffffffff8118e950
#93 [ffff881025ff3f80] system_call_fastpath at ffffffff8100b0d2
RIP: 00007f098975101d RSP: 00007f09657f92f0 RFLAGS: 00000293
RAX: 0000000000000002 RBX: ffffffff8100b0d2 RCX: 00007f09657f9cd8
RDX: 00000000000001a4 RSI: 0000000000000042 RDI: 00007f09657f9cb0
RBP: 00007f09657f9c70 R8: 0000000000000002 R9: 0000000000000002
R10: 0000000000000000 R11: 0000000000000297 R12: ffffffff8118e950
R13: ffff881025ff3f78 R14: 00007f09657fadb0 R15: 00007f09657facc3
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
- So really the root cause as above is a runaway recursion from CommVault modules (cdr, talpa_vfshook, talpa_linux).
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
