Red Hat Insights Blog
The Red Hat Insights team is pleased to highlight our first post Summit 2017 service release for functionality and feature enhancement.
Red Hat Insights is a Software as a service (SaaS) enabling customers to prevent downtime and proactively monitor for infrastructure risk and critical security alerts detected in customer environments while requiring no added infrastructure. Insights offers automated remediation capabilities via Ansible Playbooks, Executive reporting features and health scoring, as well as recommended guidance on how to quickly and securely fix any identified issues.
Read below or go check them out and let us know your thoughts with the "Provide Feedback" button.
Incident Detection [Beta Release Pending]
Detecting "Incidents" within an infrastructure is a new concept added to Red Hat Insights. Previously Insights would proactively detect issues you were at risk of encountering in the future and identify them early so they could be acted upon before they're encountered. This core functionality still exists, however the Insights engine has been expanded to now detect critical issues we know are currently impacting your infrastructure at time of analysis. By highlighting these incidents differently within the UI, we aim to direct immediate attention and prioritize these incidents to be addressed quickly preventing further or impending disruption.
Insights Analysis of Openshift Infrastructure [Beta Release Pending]
Expanding the capabilities of Insights to provide analysis of Openshift Infrastructure (Master & Nodes).
Global Group Filtering [Beta & Stable]
Global group filters are now located throughout the UI on almost all pages. These filters allow for modified views to only show the results within a selected group. The selected filter will remain with you as you navigate through Insights until reset or another group is chosen.
Additional Page Filtering Capabilities [Beta & Stable]
Additional filtering capabilities have been added to Actions and Inventory views. Results can now be filtered by System Status (Checking-In or Stale), System Health (Affected or Healthy) & Incidents. Filtering is now designed to provide a consistent user experience no matter what page within Insights is being used.
Red Hat Insights Blog Subscription [Beta & Stable]
In an effort to keep users up to date with the latest news regarding Red Hat Insights, users are now subscribed to the Red Hat Insights blog. New blog posts are submitted as new rules or features are added to Insights. Users may manage their own subscriptions to this blog.
Red Hat’s Status Page Integration [Available]
Integration with Red Hat’s Status Page (status.redhat.com) has been completed to now provide up to date status of Red Hat Insights availability. The status page is used to communicate current outages, known availability issues or upcoming maintenance windows of Red Hat Insights stable, beta and API.
Automatic Stale System Removal [Beta & Stable]
Automatic removal of stale systems helps users focus on the most up to date critical actions in their infrastructure without the noise of older stale systems. A “stale” system is a system that is no longer checking-in with the Insights service daily as expected. Once identified, the UI will highlight this system for action to be taken. After one month has passed as stale, the system will be automatically removed from Insights views.
Executive Reporting Enhancements [Beta]
Executive reporting was added in the April 2017 update of Red Hat Insights, providing users with views of historical trends and snapshots of infrastructure health. We have received multiple requests to enhance the reporting and have added the following features:
Tracking progress and reporting on the number of issues resolved over the past 30 days.
An appendix of all rule hits provides a quick report of issues identified by insights within an account’s infrastructure & the number of impacted systems.
Overall Score improvements provide additional details as to what the score means & how it’s calculated on hover over. Additionally the score color is modified based off of the score’s health.
Export to PDF allows users to save and share their complete executive report. [Coming Soon]
Planner / Playbook Generation Improvements [Beta]
Planner & Playbook builder UI has been improved to allow for more flexibility when adding to existing or creating new plans. Systems can now be added by previously specified grouping, individual systems or all systems. Actions available to add are now displayed in intelligent views to allow for easier and quicker selection.
Thanks to all those who helped beta test so far. We're always hard at work on adding new features and functionality. Let us know how we can continue to improve Insights.Posted: 2017-06-07T17:04:51+00:00
Pairing Ansible and Insights may be the smartest thing since putting peanut butter and jelly together. With this partnership, we’ve enabled the ability for you to download playbooks from Insights to solve the problems in your infrastructure. With a few clicks, you can stop worrying, kick back, and bask in the glorious rays of automation.
Our developers have done all the work of creating playbooks for you so that you don’t have to come up with them yourselves. We go through each rule in the Insights database, verify the steps, and create a playbook that deals with that exact problem on your systems. When you create an Insights Plan, all actions with available playbooks will have those playbooks automatically merged together into one single playbook. This makes it incredibly simple to fix many problems on all of your systems quickly and easily.
If you already own Ansible Tower, you can easily take advantage of the Insights integration by selecting the plan you configured in Insights to run automatically as an Ansible playbook. Our playbooks will work in both Tower and Ansible Core, so you can utilize Insights automated remediation no matter what your Ansible infrastructure looks like.
All this functionality is also available over our REST API. Resources for creating maintenance plans, obtaining playbooks, verifying systems’ state, etc. are built-in. Our API Documentation is a good place to learn more about integrating Insights detection and remediation capabilities into external systems or scripts.
Amaze your colleagues and managers with your lightning fast response to critical infrastructure issues thanks to Ansible and Insights!
Stay tuned for Parts 2 and 3 of our Ansible and Insights series for a walkthrough on how to quickly setup remediation with Ansible Core and Ansible Tower.Posted: 2017-05-15T21:06:53+00:00
Recently we rolled out a couple new features to help you assess and prioritize your risk. These would be the Likelihood and Impact that you will see assigned to individual Insights Rules.
Likelihood is the probability that a system will experience impact described in the rule. Since we are trying to be proactive in detecting the conditions before there is an impact, Likelihood is an important factor when prioritizing work. The higher the Likelihood, the more urgent it is to proactively remediate the conditions, so you won’t be unexpectedly impacted.
Impact is similarly important for determining risk. If the impact is low, then the priority to fix would be lower. Something like an intermittent performance degradation that might be low impact, versus an issue that could eat your data for lunch. Data loss would be a higher impact, generally speaking.
When you combine Likelihood and Impact you get your Total Risk. Insights gives you these three metrics to help you make better decisions about what should be fixed first. One of the main goals of Insights is to give you the information necessary to decide what is the most important and urgent thing to fix in your environment. We strive to help you avoid being impacted by an unplanned outage.Posted: 2017-04-10T19:25:12+00:00
For many customers, Satellite is a vital part of their infrastructure - distributing and managing package updates, organizing systems, and providing a robust virtualization infrastructure. The overall health of your Satellite system can impact much of your daily workflow within your environment. Issues with Satellite can lead you into digging through log files, googling for answers, or calling support to find the source of the problem. With Insights, you can save multiple hours of troubleshooting time by having the root cause and the solution at your fingertips.
We have bundled these rules into one Satellite topic so that you can easily determine if Insights has detected an issue and what steps you should take to remediate it. And as usual, we’ll keep adding rules to the topic as we discover new issues related to Satellite.
Here is a list of rules initially included in the Satellite topic:
- Failure to synchronize content to Satellite due to deadlock in postgresql when database needs cleaning
- Database deadlock on Satellite server when serving too many connections to postgresql
- Decreased performance when clients with duplicate OSAD IDs connect to the Satellite server
- Newly synced content will not be available to clients due to taskomatic service not running
- Satellite 5 subscription certificate has expiredPosted: 2017-02-27T15:21:28+00:00
Most critical physical systems use multiple network interfaces bonded together to provide redundancy and, depending on the workload, to provide greater network throughput. Bonding can be configured in either manner depending on the mode specified in the bonding configuration file. It is quite common to misconfigure bonding. It is case sensitive so something might be capitalized that shouldn’t be. You might have misunderstood the documentation and configured an incorrect or suboptimal bonding mode. The Red Hat Insights team has identified a number of misconfigurations that can leave your system without the redundancy you expect, or that will degrade the network performance when you most need it. We have bundled all of these rules into one Network Bonding topic so that you can easily know whether Insights has detected an issue and, if so, what steps you should take to remediate it. We’ll keep adding rules to the topic as we discover new issues related to network bonding.
Here is the list of rules initially included in the Network Bonding topic:
- Decreased network performance when GRO is not enabled on all parts of bonded interfaces
- Verify EtherChannel configuration
- Upgrade initscripts package
- Bonding might activate incorrect interface
- Bonding negotiation issue
- VLAN tagging failover issue on bonded interface
- Unexpected behavior with bad syntax in bond config
- Decreased network performance when not using a correct LACP hash in networking bonding
- Monitoring disabled for network bond
- Failure to generate a vmcore over the network when missing bonding option parameter in kdump configuration file
You may have noticed that the interface for Red Hat Insights underwent some changes recently. Our developers have been hard at work to provide a richer, more streamlined experience based on your feedback and recently released some new features. Here is a detailed list of recent Insights UI improvements.
- Introducing Topics - Topics are a new way to present groups of actionable intelligence providing Insights with additional categories such as SAP, Oracle, kdump and networking.
- Redesigned Overview - Our overview page provides a glimpse into your infrastructure health, upcoming plans, system registration. In addition, customers now receive a curated feed from Red Hat product and security teams, late-breaking vulnerabilities, and other Red Hat Insights news.
- Provide Direct Feedback - We truly value your input and many enhancements are directly related to customer feedback. The ability to quickly and easily provide feedback is now integrated within the interface giving customers a direct line for feedback regarding any Insights features, suggested improvements, or rules.
- Additional Views On Inventory - It is now possible to select between card or table views for enhanced sorting & filtering of systems, deployments, and assets in the inventory.
- Enhanced Actions Page - In an effort to continue providing quick access to the most critical information in your infrastructure we have enhanced our Actions interface. New charts provide more visibility infrastructure wide to identify risk based on severity levels.
- Notifications Icon- Receive alerts about systems not checking in, suggested or upcoming plans, or other critical Insights information from the new notifications icon. Click to view, drill down, or dismiss alerts.
If you would like to see the latest features in development, take a look at Insights Beta.Posted: 2017-01-24T16:21:13+00:00
Red Hat Insights is all about making sure your systems are running as smoothly as possible. Not just for Red Hat applications, but also for your other enterprise apps. We’ve begun developing rules tailored to large enterprise applications that could use the fine tuning expertise that Red Hat provides. We’ve nailed down the optimal settings required by SAP apps, and now Insights can let you know if those are in place on your systems.
We’ve introduced SAP related rules for alerting you to system configurations which are not up to the specs recommended by either Red Hat or SAP. Having these enterprise apps on systems tailored to their specific needs can be greatly beneficial for the system and more importantly to the clients that have to use them.
We want the apps you entrust to Red Hat Enterprise Linux to be as effective and efficient as you need them to be. These new rules will help you accomplish that goal.
Rule Description Reference SAP application incompatibility with installed RHEL Version SAP applications will encounter compatibility errors when not running on RHEL for SAP. Overview of Red Hat Enterprise Linux for SAP Business Applications subscription Decreased application performance when not running sap-netweaver tuned profile with SAP applications Enable the sap-netweaver tuned profile to optimize hosts for SAP applications Overview of Red Hat Enterprise Linux for SAP Business Applications subscription Decreased SAP application performance when using incorrect kernel parameters When SAP's kernel parameter recommendations are not followed, SAP applications will experience decreased performance. Red Hat Enterprise Linux 6.x: Installation and Upgrade - SAP Note Decreased SAP application performance when file handler limits do not meet SAP requirements Current file handle limits do not meet the application requirements as defined by SAP. This results in decreased SAP application performance. Red Hat Enterprise Linux 7.x: Installation and Upgrade - SAP Note Time discrepancy in SAP applications when not running ntp on SAP servers SAP strongly recommends running an ntp service on systems running SAP Red Hat Enterprise Linux 7.x: Installation and Upgrade - SAP Note Database inconsistencies when UUIDD not running with SAP applications SAP applications require UUIDD to be installed and running in order to prevent UUIDs from being reused in the application. When UUIDD is not running, database inconsistencies can occur. Linux UUID solutions - SAP NotePosted: 2017-01-06T13:54:30+00:00
The basic timekeeping standard for almost all of the world's local time zones is called Coordinated Universal Time (UTC). UTC is derived from International Atomic Time (TAI) and Universal Time (UT1), also known as mean solar time because it’s the time it takes for the Earth to rotate once on its axis. Because the rotation of the earth varies a bit over time and is slowly decreasing its mean rotation speed, a deviation occurs between UTC and UT1. When this deviation approached .9 seconds, a leap second is inserted into the UTC time scale, which adjusts the UTC time to actual earth rotation.
Leap seconds correct a discontinuity of civil time. The correction does not increase monotonically but it is stepped by one second. Leap seconds present a challenge to computer system timekeeping because standard UNIX time is defined as a set number of seconds since 00:00:00 UTC on 1 January 1970, but without leap seconds. A system clock cannot recognize 23:59:60 because every minute has only 60 seconds and every day has only 86400 seconds.
To help you avoid downtime during the leap second insertion on December 31, 2016, Red Hat Insights has recently released a set of rules to detect various leap second issues. Check out the rules below.
Rule Description Reference System clock changes instantly in leap second event in NTP system with slew mode configured [ntpd_slew_mode] Previous versions of NTP (ntp-4.2.6p5-1 and greater) incorrectly changed the system clock instantantly during a leap second event, despite configuring ntpd with -x. In certain applications, this could lead to a variety of system clock related problems such as incorrect event sorting or triggering. Does Red Hat plan to release a ntp package including xleap.patch (Important for slew mode i.e. with -x ntp configuration)? System hangs or has high CPU usage during leap second event in NTP systems [leapsec_system_hard_hang] NTP systems can hang or encounter high CPU usage when a leap second event occurs. Systems hang due to leap-second livelock., High CPU usage after inserting leap second System clock inaccurate in leap second event in non-NTP systems following TAI timescale [tzdata_need_upgrade] In a non-NTP RHEL host following the TAI timescale, one can configure non-NTP RHEL systems to report time corrected for leap seconds by updating the tzdata package to the latest version available and then using appropriate 'right' timezone files. Leap Second queries related to tzdata Ntpd service continues to announce upcoming leap second to clients following leap second insertion [ntpd_not_reset_leap_status] The ntpd service does not reset the leap status and continues announcing an upcoming leap second to its clients when finishing a leap second insertion in the NTP server due to a known bug in the NTP package. The ntpd leap status is not reset after inserting a leap second Chronyd service in leap smear NTP server has ~50% chance of crash when configured with smoothtime directive and 1-second polling interval [leap_smear_chronyd_crash] When chronyd is configured with the smoothtime directive and the smoothing process is updated with an extremely small offset, it may not be able to select a direction in which the offset needs to be smoothed out due to numerical errors in floating-point operations and this causes an assertion failure. Chronyd crashes when performing server leap smear
To see if you have systems affected by these new rules, check out the Stability category here.Posted: 2016-12-01T07:50:45+00:00
A system crash can be one of the most frustrating issues that administrators can encounter in their day to day work. They often strike without warning, require hard reboots, and can kill a process uncleanly, leaving various locked files in place that an admin must go back and manually clean up. These kind of interruptions can take a few minutes to a few hours to overcome. That’s time you could be spending engineering new solutions that change the world, or at the very least, drinking some coffee and catching up on your email.
Insights has your back when it comes to these day-wrecking errors. The Red Hat Knowledgebase is full of incidents that we have diagnosed and tracked down to specific causes. We use this data within Insights to catch these errors before they happen. We see the problems coming so you can take action to fix them before they pull you away from the creative work you’re doing.
Rule Description Reference Kernel panic when scanning network for LUNs in bnx2fc driver When scanning the network for Logical Unit Numbers (LUNs), a kernel panic can occur due to a race condition in the bnx2fc driver. A kernel panic can also occur from scanning for an FCoE-served LUN after an initial LUN scan. Why "bnx2fc: byte_count = 62 != scsi_bufflen = 0," messages appears while accessing FCOE provisioned LUNS in RHEL6? Kernel panic when using TSO in bnx2x driver When booting a system with TCP segmentation offload (TSO) enabled and the boot parameters '[intel|amd]_iommu=on', a kernel panic occurs during transmission of tunneled TSO packets. kernel crashes with instruction pointer at "rb_erase+0x1fb/0x390" if [intel|amd]_iommu=on and TSO with bnx2x driver are enabled. Kernel panic when using an Emulex network interface with GRO enabled in be2net driver A kernel panic can occur when systems with an Emulex network interface are using the be2net driver with GRO set to "on." Kernel panic when using the be2net driver occurs when networking is started Kernel panic after 200+ days of uptime when using the TSC clock source Early RHEL kernels were susceptible to a system panic after 208.5 of uptime when the system was using the TSC clock source. Red Hat Enterprise Linux systems using the TSC clock source reboots or panics when 'sched_clock()' overflows after an uptime of 208.5 days Kernel panic after 200+ days of uptime on certain Xeon CPUs Intel Xeon P5, P5 v2, and P7 v2 CPUs running certain Red Hat Enterprise Linux kernels are susceptible to a bug that can lead to a system panic based on accumulated uptime. Servers with Intel® Xeon® Processor E5, Intel® Xeon® Processor E5 v2, or Intel® Xeon® Processor E7 v2 and certain versions of Red Hat Enterprise Linux 6 kernels become unresponsive/hung or incur a kernel panic Kernel panic when using HP Proliant Gen8 servers with older iLO4 firmware versions HP ProLiant Gen8-series servers with Integrated Lights-Out 4 (iLO 4) firmware versions 1.30, 1.32, 1.40, and 1.50 have a bug that when combined with the HP watchdog driver (hpwdt) can cause system panics. Why does the system crash with HP NMI Watchdog [hpwdt]?
Register your machines now and spend less time fire-fighting crashed systems.Posted: 2016-11-11T17:54:37+00:00
The only thing worse than a crash is not knowing why it happened. Insights can make sure kdump is there for you.
Recovery is by far the most important first step to take after a system goes down. However, after your systems have recovered, you'll want to perform some level of root cause analysis in order to understand why the crash happened and how to prevent future similar events. This type of analysis is impossible to perform without access to pre-crash system information.
Several weeks ago we published a blog entitled Disaster Recovery, which outlined how many systems would be unable to properly generate a vmcore at the point of failure. You don’t want to be in this situation, which is why Red Hat Insights has developed rules to make sure that kdump is fully functional. Checkout the rules below, which detect common misconfigurations of kdump.
Rule Description Reference Failure to generate a vmcore when kdump and HP ASR are both enabled There is a conflict between the HP Advanced Server Recovery Daemon (ASRD) and the kdump utility which can result in a system restart if vmcore collection exceeds the ASR timeout. How can I disable Automatic System Recovery (ASR) on HP systems Failure to generate a vmcore when tab characters exist in kdump configuration file When tab characters exist in the kdump configuration file, kdump fails to generate a vmcore. Why does kdump not work and I see "msh: can't execute 'makedumpfile': No such file or directory" in the kdump output? Failure to generate a vmcore over the network when missing bonding option parameter in kdump configuration file When there is no bonding option parameter specified in kdump.conf and kdump is configured to dump vmcores over the network, vmcore generation will fail. kdump doesn't accept module options from ifcfg-* files Failure to generate a vmcore when systems are configured with 250 or more LUNs When kdump is enabled and the system is configured with 250 or more LUNs, kdump fails to generate a vmcore. RHEL6: kdump on a system with many LUNs never finishes, or shows "soft lockups" Failure to generate a vmcore when inline comments exist in kdump configuration file When inline comments exist in the kdump configuration file, kdump fails to generate a vmcore. Red Hat Enterprise Linux kdump not creating a vmcore file when inline comments are used in /etc/kdump.conf Failure to generate a vmcore when Intel IOMMU extensions are enabled When Intel IOMMU extensions are enabled in the kernel, kdump fails to generate a vmcore. Cannot collect a vmcore with kdump while Intel IOMMU is enabled Failure to generate a vmcore when using older P410i/P220i controller firmware versions on HP BL460C G7 systems When using older P410i/P220i controller firmware versions on HP BL460C G7 systems, kdump fails to generate a vmcore. Why Kdump fails/ hangs on "HP BL460c G7" using "P220i/P410i" controller? Failure to generate a vmcore when kdump is configured to save to local storage using HPSA or CCISS drivers When kdump is configured to save the core dump to local storage using HPSA or CCISS drivers, it fails to generate a vmcore. Why does kdump fail on HP system using the 'hpsa' driver for storage in Red Hat Enterprise Linux 6? Failure to generate a vmcore on CCISS targets when firmware version is prior to v5.06 When kdump is configured to save the core dump to a CCISS target, it fails if the firmware version is prior to v5.06. kdump failed to dump core file to cciss target running firmware versions lower than v5.06 Failure to generate a vmcore when saving to an SSH/NFS resource When kdump is configured to save a core dump to an SSH/NFS resource when using a static IP and no DNS setting, it fails to generate a vmcore. Kdump over network requires DNS1 variable set when static IP address is used.
Register your machines now and always have the necessary information to diagnose a crashed system.Posted: 2016-11-04T19:46:23+00:00