Red Hat Insights Blog
Most critical physical systems use multiple network interfaces bonded together to provide redundancy and, depending on the workload, to provide greater network throughput. Bonding can be configured in either manner depending on the mode specified in the bonding configuration file. It is quite common to misconfigure bonding. It is case sensitive so something might be capitalized that shouldn’t be. You might have misunderstood the documentation and configured an incorrect or suboptimal bonding mode. The Red Hat Insights team has identified a number of misconfigurations that can leave your system without the redundancy you expect, or that will degrade the network performance when you most need it. We have bundled all of these rules into one Network Bonding topic so that you can easily know whether Insights has detected an issue and, if so, what steps you should take to remediate it. We’ll keep adding rules to the topic as we discover new issues related to network bonding.
Here is the list of rules initially included in the Network Bonding topic:
- Decreased network performance when GRO is not enabled on all parts of bonded interfaces
- Verify EtherChannel configuration
- Upgrade initscripts package
- Bonding might activate incorrect interface
- Bonding negotiation issue
- VLAN tagging failover issue on bonded interface
- Unexpected behavior with bad syntax in bond config
- Decreased network performance when not using a correct LACP hash in networking bonding
- Monitoring disabled for network bond
- Failure to generate a vmcore over the network when missing bonding option parameter in kdump configuration file
You may have noticed that the interface for Red Hat Insights underwent some changes recently. Our developers have been hard at work to provide a richer, more streamlined experience based on your feedback and recently released some new features. Here is a detailed list of recent Insights UI improvements.
- Introducing Topics - Topics are a new way to present groups of actionable intelligence providing Insights with additional categories such as SAP, Oracle, kdump and networking.
- Redesigned Overview - Our overview page provides a glimpse into your infrastructure health, upcoming plans, system registration. In addition, customers now receive a curated feed from Red Hat product and security teams, late-breaking vulnerabilities, and other Red Hat Insights news.
- Provide Direct Feedback - We truly value your input and many enhancements are directly related to customer feedback. The ability to quickly and easily provide feedback is now integrated within the interface giving customers a direct line for feedback regarding any Insights features, suggested improvements, or rules.
- Additional Views On Inventory - It is now possible to select between card or table views for enhanced sorting & filtering of systems, deployments, and assets in the inventory.
- Enhanced Actions Page - In an effort to continue providing quick access to the most critical information in your infrastructure we have enhanced our Actions interface. New charts provide more visibility infrastructure wide to identify risk based on severity levels.
- Notifications Icon- Receive alerts about systems not checking in, suggested or upcoming plans, or other critical Insights information from the new notifications icon. Click to view, drill down, or dismiss alerts.
If you would like to see the latest features in development, take a look at Insights Beta.Posted: 2017-01-24T16:21:13+00:00
Red Hat Insights is all about making sure your systems are running as smoothly as possible. Not just for Red Hat applications, but also for your other enterprise apps. We’ve begun developing rules tailored to large enterprise applications that could use the fine tuning expertise that Red Hat provides. We’ve nailed down the optimal settings required by SAP apps, and now Insights can let you know if those are in place on your systems.
We’ve introduced SAP related rules for alerting you to system configurations which are not up to the specs recommended by either Red Hat or SAP. Having these enterprise apps on systems tailored to their specific needs can be greatly beneficial for the system and more importantly to the clients that have to use them.
We want the apps you entrust to Red Hat Enterprise Linux to be as effective and efficient as you need them to be. These new rules will help you accomplish that goal.
Rule Description Reference SAP application incompatibility with installed RHEL Version SAP applications will encounter compatibility errors when not running on RHEL for SAP. Overview of Red Hat Enterprise Linux for SAP Business Applications subscription Decreased application performance when not running sap-netweaver tuned profile with SAP applications Enable the sap-netweaver tuned profile to optimize hosts for SAP applications Overview of Red Hat Enterprise Linux for SAP Business Applications subscription Decreased SAP application performance when using incorrect kernel parameters When SAP's kernel parameter recommendations are not followed, SAP applications will experience decreased performance. Red Hat Enterprise Linux 6.x: Installation and Upgrade - SAP Note Decreased SAP application performance when file handler limits do not meet SAP requirements Current file handle limits do not meet the application requirements as defined by SAP. This results in decreased SAP application performance. Red Hat Enterprise Linux 7.x: Installation and Upgrade - SAP Note Time discrepancy in SAP applications when not running ntp on SAP servers SAP strongly recommends running an ntp service on systems running SAP Red Hat Enterprise Linux 7.x: Installation and Upgrade - SAP Note Database inconsistencies when UUIDD not running with SAP applications SAP applications require UUIDD to be installed and running in order to prevent UUIDs from being reused in the application. When UUIDD is not running, database inconsistencies can occur. Linux UUID solutions - SAP NotePosted: 2017-01-06T13:54:30+00:00
The basic timekeeping standard for almost all of the world's local time zones is called Coordinated Universal Time (UTC). UTC is derived from International Atomic Time (TAI) and Universal Time (UT1), also known as mean solar time because it’s the time it takes for the Earth to rotate once on its axis. Because the rotation of the earth varies a bit over time and is slowly decreasing its mean rotation speed, a deviation occurs between UTC and UT1. When this deviation approached .9 seconds, a leap second is inserted into the UTC time scale, which adjusts the UTC time to actual earth rotation.
Leap seconds correct a discontinuity of civil time. The correction does not increase monotonically but it is stepped by one second. Leap seconds present a challenge to computer system timekeeping because standard UNIX time is defined as a set number of seconds since 00:00:00 UTC on 1 January 1970, but without leap seconds. A system clock cannot recognize 23:59:60 because every minute has only 60 seconds and every day has only 86400 seconds.
To help you avoid downtime during the leap second insertion on December 31, 2016, Red Hat Insights has recently released a set of rules to detect various leap second issues. Check out the rules below.
Rule Description Reference System clock changes instantly in leap second event in NTP system with slew mode configured [ntpd_slew_mode] Previous versions of NTP (ntp-4.2.6p5-1 and greater) incorrectly changed the system clock instantantly during a leap second event, despite configuring ntpd with -x. In certain applications, this could lead to a variety of system clock related problems such as incorrect event sorting or triggering. Does Red Hat plan to release a ntp package including xleap.patch (Important for slew mode i.e. with -x ntp configuration)? System hangs or has high CPU usage during leap second event in NTP systems [leapsec_system_hard_hang] NTP systems can hang or encounter high CPU usage when a leap second event occurs. Systems hang due to leap-second livelock., High CPU usage after inserting leap second System clock inaccurate in leap second event in non-NTP systems following TAI timescale [tzdata_need_upgrade] In a non-NTP RHEL host following the TAI timescale, one can configure non-NTP RHEL systems to report time corrected for leap seconds by updating the tzdata package to the latest version available and then using appropriate 'right' timezone files. Leap Second queries related to tzdata Ntpd service continues to announce upcoming leap second to clients following leap second insertion [ntpd_not_reset_leap_status] The ntpd service does not reset the leap status and continues announcing an upcoming leap second to its clients when finishing a leap second insertion in the NTP server due to a known bug in the NTP package. The ntpd leap status is not reset after inserting a leap second Chronyd service in leap smear NTP server has ~50% chance of crash when configured with smoothtime directive and 1-second polling interval [leap_smear_chronyd_crash] When chronyd is configured with the smoothtime directive and the smoothing process is updated with an extremely small offset, it may not be able to select a direction in which the offset needs to be smoothed out due to numerical errors in floating-point operations and this causes an assertion failure. Chronyd crashes when performing server leap smear
To see if you have systems affected by these new rules, check out the Stability category here.Posted: 2016-12-01T07:50:45+00:00
A system crash can be one of the most frustrating issues that administrators can encounter in their day to day work. They often strike without warning, require hard reboots, and can kill a process uncleanly, leaving various locked files in place that an admin must go back and manually clean up. These kind of interruptions can take a few minutes to a few hours to overcome. That’s time you could be spending engineering new solutions that change the world, or at the very least, drinking some coffee and catching up on your email.
Insights has your back when it comes to these day-wrecking errors. The Red Hat Knowledgebase is full of incidents that we have diagnosed and tracked down to specific causes. We use this data within Insights to catch these errors before they happen. We see the problems coming so you can take action to fix them before they pull you away from the creative work you’re doing.
Rule Description Reference Kernel panic when scanning network for LUNs in bnx2fc driver When scanning the network for Logical Unit Numbers (LUNs), a kernel panic can occur due to a race condition in the bnx2fc driver. A kernel panic can also occur from scanning for an FCoE-served LUN after an initial LUN scan. Why "bnx2fc: byte_count = 62 != scsi_bufflen = 0," messages appears while accessing FCOE provisioned LUNS in RHEL6? Kernel panic when using TSO in bnx2x driver When booting a system with TCP segmentation offload (TSO) enabled and the boot parameters '[intel|amd]_iommu=on', a kernel panic occurs during transmission of tunneled TSO packets. kernel crashes with instruction pointer at "rb_erase+0x1fb/0x390" if [intel|amd]_iommu=on and TSO with bnx2x driver are enabled. Kernel panic when using an Emulex network interface with GRO enabled in be2net driver A kernel panic can occur when systems with an Emulex network interface are using the be2net driver with GRO set to "on." Kernel panic when using the be2net driver occurs when networking is started Kernel panic after 200+ days of uptime when using the TSC clock source Early RHEL kernels were susceptible to a system panic after 208.5 of uptime when the system was using the TSC clock source. Red Hat Enterprise Linux systems using the TSC clock source reboots or panics when 'sched_clock()' overflows after an uptime of 208.5 days Kernel panic after 200+ days of uptime on certain Xeon CPUs Intel Xeon P5, P5 v2, and P7 v2 CPUs running certain Red Hat Enterprise Linux kernels are susceptible to a bug that can lead to a system panic based on accumulated uptime. Servers with Intel® Xeon® Processor E5, Intel® Xeon® Processor E5 v2, or Intel® Xeon® Processor E7 v2 and certain versions of Red Hat Enterprise Linux 6 kernels become unresponsive/hung or incur a kernel panic Kernel panic when using HP Proliant Gen8 servers with older iLO4 firmware versions HP ProLiant Gen8-series servers with Integrated Lights-Out 4 (iLO 4) firmware versions 1.30, 1.32, 1.40, and 1.50 have a bug that when combined with the HP watchdog driver (hpwdt) can cause system panics. Why does the system crash with HP NMI Watchdog [hpwdt]?
Register your machines now and spend less time fire-fighting crashed systems.Posted: 2016-11-11T17:54:37+00:00
The only thing worse than a crash is not knowing why it happened. Insights can make sure kdump is there for you.
Recovery is by far the most important first step to take after a system goes down. However, after your systems have recovered, you'll want to perform some level of root cause analysis in order to understand why the crash happened and how to prevent future similar events. This type of analysis is impossible to perform without access to pre-crash system information.
Several weeks ago we published a blog entitled Disaster Recovery, which outlined how many systems would be unable to properly generate a vmcore at the point of failure. You don’t want to be in this situation, which is why Red Hat Insights has developed rules to make sure that kdump is fully functional. Checkout the rules below, which detect common misconfigurations of kdump.
Rule Description Reference Failure to generate a vmcore when kdump and HP ASR are both enabled There is a conflict between the HP Advanced Server Recovery Daemon (ASRD) and the kdump utility which can result in a system restart if vmcore collection exceeds the ASR timeout. How can I disable Automatic System Recovery (ASR) on HP systems Failure to generate a vmcore when tab characters exist in kdump configuration file When tab characters exist in the kdump configuration file, kdump fails to generate a vmcore. Why does kdump not work and I see "msh: can't execute 'makedumpfile': No such file or directory" in the kdump output? Failure to generate a vmcore over the network when missing bonding option parameter in kdump configuration file When there is no bonding option parameter specified in kdump.conf and kdump is configured to dump vmcores over the network, vmcore generation will fail. kdump doesn't accept module options from ifcfg-* files Failure to generate a vmcore when systems are configured with 250 or more LUNs When kdump is enabled and the system is configured with 250 or more LUNs, kdump fails to generate a vmcore. RHEL6: kdump on a system with many LUNs never finishes, or shows "soft lockups" Failure to generate a vmcore when inline comments exist in kdump configuration file When inline comments exist in the kdump configuration file, kdump fails to generate a vmcore. Red Hat Enterprise Linux kdump not creating a vmcore file when inline comments are used in /etc/kdump.conf Failure to generate a vmcore when Intel IOMMU extensions are enabled When Intel IOMMU extensions are enabled in the kernel, kdump fails to generate a vmcore. Cannot collect a vmcore with kdump while Intel IOMMU is enabled Failure to generate a vmcore when using older P410i/P220i controller firmware versions on HP BL460C G7 systems When using older P410i/P220i controller firmware versions on HP BL460C G7 systems, kdump fails to generate a vmcore. Why Kdump fails/ hangs on "HP BL460c G7" using "P220i/P410i" controller? Failure to generate a vmcore when kdump is configured to save to local storage using HPSA or CCISS drivers When kdump is configured to save the core dump to local storage using HPSA or CCISS drivers, it fails to generate a vmcore. Why does kdump fail on HP system using the 'hpsa' driver for storage in Red Hat Enterprise Linux 6? Failure to generate a vmcore on CCISS targets when firmware version is prior to v5.06 When kdump is configured to save the core dump to a CCISS target, it fails if the firmware version is prior to v5.06. kdump failed to dump core file to cciss target running firmware versions lower than v5.06 Failure to generate a vmcore when saving to an SSH/NFS resource When kdump is configured to save a core dump to an SSH/NFS resource when using a static IP and no DNS setting, it fails to generate a vmcore. Kdump over network requires DNS1 variable set when static IP address is used.
Register your machines now and always have the necessary information to diagnose a crashed system.Posted: 2016-11-04T19:46:23+00:00
Stability is one of the most important topics in IT. Although a system might have “five 9s” availability (up for 99.999% of time), there is still a chance of a disaster occurring. And when disaster strikes, the most important action for an IT team is to perform proper RCA (Root Cause Analysis). Luckily Red Hat Enterprise Linux created a feature to help with failed systems.
kdump is a feature of the linux kernel used to assist with crashed systems. kdump works by booting another linux kernel while the main kernel is either hung, crashed, or otherwise inoperable. The second kernel that is booted dumps the main memory into a file called a vmcore that can later be recovered and used for RCA.
Red Hat Insights
The Red Hat Insights team knows that RHEL customers care deeply about properly generating a vmcore at the time of failure. The worst part of a crash is knowing another one is right around the corner, with more downtime and lost revenue. We have tracked a handful of statistics around kdump to better understand its adoption and use among our customers. One notable statistic is the percent of unique systems since 2009 that have enabled kdump.
kdump is a complex program that depends on many pieces of a system to function correctly. This gives kdump a significant margin for errors, bugs, and other issues that can cause kdump to fail. Red Hat Insights has created many rules around kdump to facilitate the proper generation of vmcores.
We have back tested a subset of our kdump rules against historical data to review how many systems would be unable to properly generate a vmcore at the point of failure.
Surprisingly, in previous months we have seen as many as one fifth of all systems be unable to properly generate a vmcore. Luckily the percentage of problematic systems decreases over time, but we still expect percentages to hover around 5% (remember, this was tested against only a subset of Red Hat Insights kdump rules, and thus will be an underestimation).
All systems running and testing production programs should have kdump properly configured. Not doing this places unnecessary risk on any company. Red Hat Insights engineers continually review support tickets, historical data, and previous support solutions to better identify what prevents, or could prevent, kdump from generating a vmcore. This gives our customers peace of mind that they will have a vmcore properly generated at the time of failure.
Get started now with Red Hat Insights here.Posted: 2016-10-14T03:25:04+00:00
Early in my career I was responsible for maintaining build machines for multiple software engineering teams. Those build machines not only built the actual binaries for the product but they also served up critical services leveraged by engineering teams across the company. Whenever we encountered networking issues with those machines, I distinctly remember opening my email inbox and being inundated with emails from coworkers complaining about problems connecting to those services. I had to immediately derail whatever I was working on and go into firefighting mode.
Sound familiar? Most of us probably have similar stories to this and would have preferred knowing the network was at risk before it actually went down. Sound like a pipe dream? This proactive knowledge and readiness is what Red Hat Insights is all about. To prevent you from spending the majority of your working hours firefighting network issues after they happen, Insights has released new rules to detect issues and recommend fixes before they cause network downtime.
For example, one new rule detects a bug that causes a kernel panic when a NIC with a jumbo frame joins an IPv6 multicast group. Wouldn’t you prefer knowing about this issue before your inbox explodes with complaints? Learn about this and other rules we have released that detect these types of issues before they happen.
Rule Description Reference FTP/TFTP service failure under high connection volume due to nf_conntrack expectations table being full When there are huge numbers of client connections to an FTP/TFTP server, many “nf_conntrack: expectation table full” errors occur. How to fix the 'expectation table full' error Network interface with bnx2x driver intermittently drops connectivity with DCB mode errors A network interface with the bnx2x driver intermittently drops connectivity when DCB is disabled in hardware but lldapd service is enabled. bnx2x displaying DCBX mode errors Device name contains extra whitespace after last double-quote When the device or hardware name in the configuration file (/etc/sysconfig/network-scripts/ifcfg*) contains a space after the last double-quotation marks (
"), it can cause the network interface to become inaccessible and the server to go offline.
udev: renamed network interface eth0 to _eth0_ Running ifdown on an alias interface such as eth0:1 removes all IPv4 IP addresses from the primary interface due to a known initscripts bug When an alias interface is brought down using the ifdown command, the LABEL variable is not set due to a typo in the ifdown-eth script. Consequently, instead of removing just addresses belonging to the alias interface, all IP addresses on the device are removed. ifdown on an alias interface removes all IP addresses skb_over_panic happens on systems with jumbo frames and multiple IPv6 addresses There is a known bug whereby generating an MLD listener report on devices with large MTUs and a high number of IPv6 addresses can trigger a skb_over_panic() event. skb_over_panic after add_grhead System crashes when “ethtool -S” is executed due to a known bug of the qlcnic driver in RHEL 7 In some versions of the qlcnic driver, an issue exists that can cause it to drop its links and reset when ethtool -S is run against one of its managed devices. This command is executed by sosreport and various other collection or diagnostic tools, so it is possible to see a host start failing devices, returning I/O errors, or exhibiting other unexpected behavior after running such a tool. RHEL server with qlcnic driver experiences link failures, "LOOP DOWN" messages, path failures, I/O errors, and/or crashes when running sosreport or 'ethtool' ONBOOT=no setting is ignored for interface aliases When ONBOOT=no is set on alias interfaces (ex. eth0:1), it has no effect. When the parent interface (ex. eth0) is brought up, the alias will always come up. Setting ONBOOT=no is not valid for interfaces aliases Docker containers are allowed to communicate regardless of “--icc=false” due to incorrect value of kernel bridge parameters Docker daemon would create iptable rules to prevent containers from communicating with each other when “--icc=false”. But in Red Hat Enterprise Linux, the default value of bridge-nf-call-iptables and bridge-nf-call-ip6tables is 0 which means that iptables' rules do not affect bridge's traffic. Thus docker containers are allowed to communicate regardless of “--icc=false” option. Why does not --icc work well?
Register your machines now and avoid the future headaches of troubleshooting networking issues.Posted: 2016-09-23T15:44:13+00:00
Every system administrator knows the feeling of having to wake up in the middle of the night because a server crashed or lost connectivity. This is where Red Hat Insights comes in. Thanks to our expansive knowledge base, the Insights team has been able to identify several critical stability issues that could cause a system outage. Don’t let these issues catch you by surprise. Check out our latest stability rules here!
Rule Description Reference The “rpmdbNextIterator” error exists in the rpm command output due to a corrupt RPM database The
rpm -qacommand outputs the “rpmdbNextIterator” error in Red Hat Enterprise Linux 6 or 7 when the rpm database is corrupted. Normally, rebuilding the rpm database might fix this issue; however, restoring from a backup might be necessary in some extreme situations.
Why does rpm command show error "rpmdbNextIterator"? RHEL 6.6 kernel panics when disconnecting storage due to known bugs In the Red Hat Enterprise Linux 6.6 kernel, a bug was introduced whereby the removal of a Fibre Channel SCSI host might cause a kernel panic event . RHEL 6.6 kernel panics when disconnecting storage Network stops responding under high connection volume due to the ARP table getting too full The Red Hat Enterprise Linux 6 or 7 kernel throws "Neighbour table overflow" messages after connecting to or discovering a large number of network hosts, which can lead to the network not responding. "Neighbour table" refers to the ARP cache and the overflow occurs when the ARP table gets too full and at least one new entry was denied due to the size limit being reached. Kernel throws "Neighbour table overflow" messages after connecting to or discovering a large number of network hosts Unintentional bridge kernel module loading while sosreport is run When
sosreportis run, some kernel modules are loaded as a side effect. The output of
sysctl -achanges after a run of
sosreportdue to the bridge kernel module being loaded. The loading of unwanted kernel modules consumes memory and might cause other issues.
sosreport loads the bridge kernel module unintentionally Network is down when the IP address of the loopback interface is assigned to the same subnet as that of the primary interface If the IP address of the loopback interface is assigned to the same subnet as that of the primary interface, the network will be unreachable. IPv4 Connection Unstable HA VMs restart twice in RHEV3.5 because of a known bug in JSON RPC In Red Hat Enterprise Virtualization (RHEV) 3.5 compatibility version, hosts that are newly-added to a cluster will use json-rpc by default. There is a bug in json-rpc that does not have “flag _recovery”. When there is a network issue or a hypervisor is not responding, RHEV will try to fence the problematic hypervisor and try to restart the vdsm process on this hypervisor. If this is successful, the HA-configured VM running on this hypervisor will be started twice. Possible duplicate HA VMs during vdsm reinitialize The number of 802.1q VLANs in an OpenStack host may exceed the limitation of 4,096 Red Hat Enterprise Linux supports a maximum of 4,094 802.1q VLANs per host. When hosts in an OpenStack environment get within 90% of this threshold, a warning is shown. Red Hat Enterprise Linux OpenStack Platform 7 Architecture Guide Unexpected NUMA node mapping when a CPU has more than 256 logical processors In cases when BIOS is populating MADT with both x2apic and local apic entries (as per ACPI spec), the kernel builds its processor table in the following order: BSP, X2APIC, local APIC. This results in processors on the same core not being separated by core count. Bad NUMA cpu numbering on server with more than 256 cpus Unsupported Journal Mode Detected This rule has been enhanced ti now check /etc/fstab for issues that may not be present when running the
mountcommand. Issues may not be present while the system is running, but could become a problem after reboot. Checks include looking for device existence and filesystem consistency between the
mountcommand and /etc/fstab.
Why are logs similar to "JBD: Spotted dirty metadata buffer" logged in /var/log/messages?
To see if you have systems affected by these new rules, check out the Stability category here.
As a Red Hat Insights customer, you can register machines now and remember what a good night’s sleep feels like.Posted: 2016-09-16T20:07:01+00:00
CRC (Cyclic Redundancy Check) is a test to ensure data does not become corrupt when sent across networks or storage devices. The test begins by calculating a check value that is based on the data’s contents that will be sent over the network. The check value is recalculated when the data arrives at its destination, and if the recalculated check value differs from the initial check value, then the data has been corrupted.
CRC Errors and RHEL
Red Hat Enterprise Linux (RHEL) will log received CRC errors as well as the number of packets successfully transmitted through the network card. By dividing the number of errors by the number of total transactions we can compute the network card’s CRC error percentage. This allows us to review how severe and often CRC errors are affecting the system. Figure 1 displays how often CRC errors are found on a monthly scale.
Using data to drive Insights
By reviewing historical data, we can measure the distribution of CRC error percentages. Figure 2 below displays the percentile versus CRC error percentage.
This graph shows the severity of CRC error percentages. For example, we can see that if a system has a CRC error percentage of 0.1%, then it is in the 87th percentile among systems encountering CRC errors. Red Hat Insights can use this information to create and improve rules to more accurately identify risk and severity.
To create the most effective rule, we can review the growth of the CRC error percentages per percentile. Figure 3 easily illustrates that around the 95th percentile there is a large jump in growth. To ensure we inform our customers of a potential issue, we set our rule threshold at a CRC error percentage of 1%, thus ensuring we will flag any systems in the upper 93rd percentile of CRC error percentages.
Red Hat Insights uses the results of this analysis to finetune our rules to better serve our customers. By mining historical data, Red Hat Insights is able to track not just how common issues are, but how severe they can become. Using the comprehensive support data, Red Hat Insights is able to provide customers with the best possible prescriptive solutions.Posted: 2016-08-26T15:24:19+00:00