How to Find the root cause for VM hung/crash
Hi All,
Recently we are facing Linux VMs going to hung state.
I don't see any abnormal message in messages.
VMware analysed and saying even nothing abnormal from there end.
Please help to set any proactive measure to capture or identify what caused the VM to go to hung state/crashed.
If kdump is the way to capture, can you please direct me what parameters I have to set up in sysctl to capture the logs to find the root cause.
Logs:
Feb 16 13:46:41 RHELserver01 vgptool[21770]: [UnixLogger:ERROR] [ERROR extension.cpp:259] Error was 2: No such file or directory
Feb 16 13:46:41 RHELserver01 vgptool[21770]: [UnixLogger:ERROR] [ERROR vtlalic.cpp:177] CSE: License file '/var/opt/quest/vgp/gpt/8463B4F7-364B-4498-A327-EFC36732AD92/Machine/V
GP/VTLA/Licensing/VAS_license_171-35680' failed installation#012 Error was License file has expired
Feb 16 20:59:34 RHELserver01 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Feb 16 20:59:34 RHELserver01 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1855" x-info="http://www.rsyslog.com"] start
Feb 16 20:59:34 RHELserver01 kernel: Initializing cgroup subsys cpuset
Feb 16 20:59:34 RHELserver01 kernel: Initializing cgroup subsys cpu
Feb 16 20:59:34 RHELserver01 kernel: Linux version 2.6.32-279.el6.x86_64 (mockbuild@x86-008.build.bos.redhat.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Jun 13 18:24:36 EDT 2012
Feb 16 20:59:34 RHELserver01 kernel: Command line: ro root=/dev/mapper/VolGroup00-LogVol00 rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_LVM_LV=VolGroup00/LogVol01 rd_LVM_LV=VolGroup00/LogVol00 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet
For SAR output:
Linux 2.6.32-279.el6.x86_64 (RHELserver01) 2014-02-16 x86_64 (8 CPU)
00:00:01 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle
15:00:01 6 1.13 0.00 0.24 0.00 0.00 0.00 0.01 0.00 98.62
15:00:01 7 0.07 0.00 0.04 0.01 0.00 0.00 0.00 0.00 99.88
Average: all 2.09 0.00 0.31 0.04 0.00 0.00 0.01 0.00 97.55
Average: 0 0.86 0.00 0.19 0.11 0.00 0.00 0.01 0.00 98.83
Average: 1 4.08 0.00 0.82 0.05 0.00 0.00 0.04 0.00 95.02
Responses
I had bookmarked a link for "dark" days around kdump, perhaps it helps
https://access.redhat.com/site/solutions/6038
Depending on the exact state of the issue, perhaps engaging RH Support will speed the resolution of your issue.
Hope this helps,
Hello Manikandan Palani,
Are you using quest authentication with this server? I noticed in the logs you posted it seems to be complaining of an expired license:
'/var/opt/quest/vgp/gpt/8463B4F7-364B-4498-A327-EFC36732AD92/Machine/VGP/VTLA/Licensing/VAS_license_171-35680' failed installation#012 Error was License file has expired"
Does the file below exist on the system (from your output you posted)? If so, do you have a more current license file? (Is it the proper permissions/owner/SELinux context if you are using SELinux?)
'/var/opt/quest/vgp/gpt/8463B4F7-364B-4498-A327-EFC36732AD92/Machine/VGP/VTLA/Licensing/VAS_license_171-35680'
Quest was known as one time as VAS (Vintella Authentication Services), and then they changed to Quest and then quest was bought by Dell.
Was this server ever joined to the domain with VAS previously? If so, run:
[root@yoursystem ~] # vastool flush ; service vasd restart
Did you install vasgrp, vasclnt and the other quest rpms associated with it?
NOTE: before joining, make sure your time source is proper or the join will fail!!
[root@yoursystem ~] # service ntpd stop
[root@yoursystem ~] # for i in {1..10};do ntpdate -b name_of_valid_time_server.fully.qualified.domain.name;done
[root@yoursystem ~] # service ntpd start
There is a specific vastool join command and some rare times you have to specify the domain controller. see this link too
# /opt/quest/bin/vastool -u <domain-admin-user> join <domain-name>
# /opt/quest/bin/vastool -u administrator join -f acme.com
- there is also a vasjoin.sh script.
I've never seen an expired license error with VAS (one customer I support has quest/VAS), at least with the version we are using.
I've seen some servers on rare occasion need to have their network restarted, then a vastool flush, and restart vas, but with the license issue in the output you posted, that may be something else.
If it were not a license complaint in your log files, I'd recommend verifying time syncrhonization is set up well. Do you have ntp properly set up (with servers and peers, see ntp.org), perhaps temporarily shut off ntpd and do a time sync and then restart ntpd. (is that server's time proper against other servers?
Do you have a more current license file? Are any other systems having this issue? If you have support with Dell (they bought Quest), you might want to see what they have to say... https://support.software.dell.com/authentication-services/4.1/
If the system is busy attempting to do something with kerberos (via VAS/quest), it will attempt to nag VAS (aka quset), and can cause consternation on the system.
Hope this helps,
Remmele
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
