Red Hat Insights Blog
The Red Hat Insights team is pleased to highlight our first post-Summit 2017 service release for functionality and feature enhancement.
Red Hat Insights is a Software-as-a-Service (SaaS) that potentially prevents downtime by enabling customers to proactively monitor for infrastructure risks and critical security alerts detected in their environments, while requiring no added infrastructure. Insights offers automated remediation capabilities via Ansible Playbooks, as well as Executive Reporting features and Health Scoring, and recommends guidance on how to quickly and securely fix identified issues.
Our June 2017 release brings several new features to the Customer Portal Insights Web UI that are currently available for production environments, and beta features that are offered for testing and feedback in Insights Beta.
Read below for more informatin or go check them out and let us know your thoughts by using the "Provide Feedback" button.
For more information about the latest Insight release, refer to our Red Hat Insight Release Notes.
Incident Detection [Beta Release Pending]
Detecting "Incidents" within an infrastructure is a new concept added to Red Hat Insights. Previously, Insights would proactively detect issues you were at risk of encountering in the future and identify them early so they could be acted upon before they're encountered. This core functionality still exists; however, the Insights engine has been expanded to now detect critical issues we know are currently impacting your infrastructure at the time of analysis. By highlighting these incidents differently within the UI, we aim to direct immediate attention and prioritize these incidents to be addressed quickly, preventing further or impending disruption.
Insights Analysis of Openshift Infrastructure [Beta Release Pending]
Expands the capabilities of Insights to provide analysis of Openshift infrastructures (Master & Nodes).
Global Group Filtering [Beta & Stable]
Global group filters are now located throughout the UI, on almost all pages. These filters allow for modified views to only show the results within a selected group. The selected filter will remain with you as you navigate through Insights, until you reset or select another group.
Additional Page Filtering Capabilities [Beta & Stable]
Additional filtering capabilities have been added to Actions and Inventory views. Results can now be filtered by System Status (Checking-In or Stale), System Health (Affected or Healthy), and Incidents. Filtering is now designed to provide a consistent user experience no matter what page within Insights is being used.
Red Hat Insights Blog Subscription [Beta & Stable]
In an effort to keep users up to date with the latest news regarding Red Hat Insights, users are now automatically subscribed to the Red Hat Insights blog. New blog posts are submitted as new rules or features are added to Insights. Users can manage their subscriptions to this blog.
Red Hat’s Status Page Integration [Stable]
Integration with the Red Hat Status Page (status.redhat.com) has been completed and now provides up-to-date status of Red Hat Insights availability. The status page is used to communicate current outages, known availability issues or upcoming maintenance windows of Red Hat Insights stable, beta, and API.
Automatic Stale System Removal [Beta & Stable]
Automatic removal of stale systems helps users focus on the most up-to-date critical actions in their infrastructure, without the noise of older stale systems. A “stale” system is a system that is no longer checking-in with the Insights service daily, as expected. Once identified, the UI will highlight this system for action to be taken. After one month has passed with a stale status, the system will automatically be removed from Insights views.
Executive Reporting Enhancements [Beta]
Executive reporting was added in the April 2017 update of Red Hat Insights, providing users with views of historical trends and snapshots of infrastructure health. We have received multiple requests to enhance the reporting and have added the following features:
- Progress tracking and reporting on the number of issues resolved over the past 30 days.
- Appendix of all rule hits provides a quick report of issues identified by Insights within an account infrastructure, and the number of impacted systems.
- Overall Score improvements, on hover-over, provide additional details of what the score means and how it’s calculated. Additionally, the score color is modified based on the health of all systems.
- Export to PDF allows users to save and share their complete executive report. [Coming Soon]
Planner and Ansible Playbook Generation Improvements [Beta]
The Planner and playbook-builder UI has been improved to allow for more flexibility when adding to existing plans or creating new plans. Systems can now be added to previously specified groups, as individual systems, or all systems. Actions available to add are now displayed in intelligent views to allow for easier and quicker selection.
** The Insights team thanks all those who helped beta test. We're always hard at work adding new features and functionality. Let us know how we can continue to improve Insights.**Posted: 2017-06-07T17:04:51+00:00
- Progress tracking and reporting on the number of issues resolved over the past 30 days.
As we discussed in our previous blog post about enabling Ansible automation with Insights, we will look closer at taking findings from Insights and using the actionable intelligence provided to perform an automated remediation via Ansible playbook. Ansible Tower setup and remediation will be covered in an upcoming post.
Currently you can generate playbooks for Insights and Tower via Red Hat's customer portal. An upcoming release of Satellite 6 will further integrate Insights automated remediation into Satellite by allowing you to generate playbooks from the Satellite UI.
Prerequisites for being able to utilize Ansible functionality with Insights are:
- Active RHEL subscription
- Active Insights evaluation or entitlement
- RHEL 7 or RHEL 6.4 and later
- Ansible (or Ansible Tower) installed
- Insights systems registered and reporting with an identifiable problem
- Ability to manage systems via Ansible with Insights system hostname or "display name" as the hostname in your ansible inventory
Begin by logging into the Insights interface on the customer portal at https://access.redhat.com/insights
If you're already logged in, you'll be presented with the Insights Overview.
From the Overview you can see quickly if you have any systems that have automated remediation identified. In the lower right of the console under Planner you will see "# issues can be resolved automatically by Ansible" or something similar. You can use this to quickly see all items you can remediate with Ansible.
From here you have options. You can use Planner on the left nav menu to build a plan, you can click "Create a Plan/Playbook" from Overview, or you can use listed Actions (Actions -> Category) dropdowns for affected systems.
In this example we will navigate to Actions -> Security, and choose the "Kernel vulnerable to man-in-the-middle payload injection". We see that several systems are affected by this risk, and it has a medium likelihood, a critical impact, and a high overall total risk. This Action is also Ansible enabled.
Clicking into the Action itself gives us a description of the problem and a list of systems affected. From here we can create a playbook for the affected systems.
I'll choose the three affected systems and use the Actions dropdown dialogue to Create a New Plan/Playbook.
Give this plan a name (this is important; if you're using Tower integration this name is how we quickly identify the playbooks within Tower as well) and ensure the systems selected are correct. Click "Save" and the plan is created. From here you can delete or edit the plan to specify a maintenance window and duration, edit systems associated with this plan, or Generate Playbook and Export to CSV. We want to generate a playbook, so click that button.
If the playbook you're building has options (like this example) you will be presented with a dialogue to decide what tasks you want to include in your Ansible playbook. Currently you may need to goto "Playbook Summary", like the graphic above, to modify the playbook options. Since the selected machines are critical to my environment, and I can't afford to take downtime to fix them with a kernel update and reboot, I'll use the active mitigation and "Set sysctl ipv4 challenge ack limit". This will allow me to actively mitigate the system and make it non-vulnerable. A more permanent fix would be to update the kernel, but if I'm sure nothing is going to change my sysctl variable back (config management tools may reverse these changes if not also updated), then I would be safe with this active mitigation.
Click Save to confirm your selection and finalize playbook generation by Downloading Playbook.
You can then use this downloaded Ansible playbook YML file to remediate the systems with: $ ansible-playbook $downloaded_filename.yml
Filenames follow a scheme of plan_name-plan_number-unixtime.yml and contain information inside about which remediation systems and rule versions are being utilized.
After watching the playbook run, assuming there are no errors you need to further investigate, refreshing Planner shows us 3/3 systems have been remediated.
Upon refreshing the Planner interface we see that the remediations were performed successfully and these systems now have a check mark as their status.
That's how simple it is to start using Ansible playbooks to remediate systems reporting risks. Stay tuned for another upcoming blog post on how to scale this to your entire infrastructure with Ansible Tower.
Let us know your thoughts on the new features highlighted in our last post, in the comments on the blogs or with the Provide Feedback button inside of Insights!
Thanks from all of us here at the Insights engineering and product teams, and happy remediating. Stay tuned for part 3, where we will be using Ansible Tower and Insights for enterprise remediation.
-Will NixPosted: 2017-06-01T15:55:58+00:00
Pairing Ansible and Insights may be the smartest thing since putting peanut butter and jelly together. With this partnership, we’ve enabled the ability for you to download playbooks from Insights to solve the problems in your infrastructure. With a few clicks, you can stop worrying, kick back, and bask in the glorious rays of automation.
Our developers have done all the work of creating playbooks for you so that you don’t have to come up with them yourselves. We go through each rule in the Insights database, verify the steps, and create a playbook that deals with that exact problem on your systems. When you create an Insights Plan, all actions with available playbooks will have those playbooks automatically merged together into one single playbook. This makes it incredibly simple to fix many problems on all of your systems quickly and easily.
If you already own Ansible Tower, you can easily take advantage of the Insights integration by selecting the plan you configured in Insights to run automatically as an Ansible playbook. Our playbooks will work in both Tower and Ansible Core, so you can utilize Insights automated remediation no matter what your Ansible infrastructure looks like.
All this functionality is also available over our REST API. Resources for creating maintenance plans, obtaining playbooks, verifying systems’ state, etc. are built-in. Our API Documentation is a good place to learn more about integrating Insights detection and remediation capabilities into external systems or scripts.
Amaze your colleagues and managers with your lightning fast response to critical infrastructure issues thanks to Ansible and Insights!
Stay tuned for Parts 2 and 3 of our Ansible and Insights series for a walkthrough on how to quickly setup remediation with Ansible Core and Ansible Tower.
Part 2 now available here: Ansible and Insights Part 2 - Automating Ansible Core remediationPosted: 2017-05-15T21:06:53+00:00
Recently we rolled out a couple new features to help you assess and prioritize your risk. These would be the Likelihood and Impact that you will see assigned to individual Insights Rules.
Likelihood is the probability that a system will experience impact described in the rule. Since we are trying to be proactive in detecting the conditions before there is an impact, Likelihood is an important factor when prioritizing work. The higher the Likelihood, the more urgent it is to proactively remediate the conditions, so you won’t be unexpectedly impacted.
Impact is similarly important for determining risk. If the impact is low, then the priority to fix would be lower. Something like an intermittent performance degradation that might be low impact, versus an issue that could eat your data for lunch. Data loss would be a higher impact, generally speaking.
When you combine Likelihood and Impact you get your Total Risk. Insights gives you these three metrics to help you make better decisions about what should be fixed first. One of the main goals of Insights is to give you the information necessary to decide what is the most important and urgent thing to fix in your environment. We strive to help you avoid being impacted by an unplanned outage.Posted: 2017-04-10T19:25:12+00:00
For many customers, Satellite is a vital part of their infrastructure - distributing and managing package updates, organizing systems, and providing a robust virtualization infrastructure. The overall health of your Satellite system can impact much of your daily workflow within your environment. Issues with Satellite can lead you into digging through log files, googling for answers, or calling support to find the source of the problem. With Insights, you can save multiple hours of troubleshooting time by having the root cause and the solution at your fingertips.
We have bundled these rules into one Satellite topic so that you can easily determine if Insights has detected an issue and what steps you should take to remediate it. And as usual, we’ll keep adding rules to the topic as we discover new issues related to Satellite.
Here is a list of rules initially included in the Satellite topic:
- Failure to synchronize content to Satellite due to deadlock in postgresql when database needs cleaning
- Database deadlock on Satellite server when serving too many connections to postgresql
- Decreased performance when clients with duplicate OSAD IDs connect to the Satellite server
- Newly synced content will not be available to clients due to taskomatic service not running
- Satellite 5 subscription certificate has expiredPosted: 2017-02-27T15:21:28+00:00
Most critical physical systems use multiple network interfaces bonded together to provide redundancy and, depending on the workload, to provide greater network throughput. Bonding can be configured in either manner depending on the mode specified in the bonding configuration file. It is quite common to misconfigure bonding. It is case sensitive so something might be capitalized that shouldn’t be. You might have misunderstood the documentation and configured an incorrect or suboptimal bonding mode. The Red Hat Insights team has identified a number of misconfigurations that can leave your system without the redundancy you expect, or that will degrade the network performance when you most need it. We have bundled all of these rules into one Network Bonding topic so that you can easily know whether Insights has detected an issue and, if so, what steps you should take to remediate it. We’ll keep adding rules to the topic as we discover new issues related to network bonding.
Here is the list of rules initially included in the Network Bonding topic:
- Decreased network performance when GRO is not enabled on all parts of bonded interfaces
- Verify EtherChannel configuration
- Upgrade initscripts package
- Bonding might activate incorrect interface
- Bonding negotiation issue
- VLAN tagging failover issue on bonded interface
- Unexpected behavior with bad syntax in bond config
- Decreased network performance when not using a correct LACP hash in networking bonding
- Monitoring disabled for network bond
- Failure to generate a vmcore over the network when missing bonding option parameter in kdump configuration file
You may have noticed that the interface for Red Hat Insights underwent some changes recently. Our developers have been hard at work to provide a richer, more streamlined experience based on your feedback and recently released some new features. Here is a detailed list of recent Insights UI improvements.
- Introducing Topics - Topics are a new way to present groups of actionable intelligence providing Insights with additional categories such as SAP, Oracle, kdump and networking.
- Redesigned Overview - Our overview page provides a glimpse into your infrastructure health, upcoming plans, system registration. In addition, customers now receive a curated feed from Red Hat product and security teams, late-breaking vulnerabilities, and other Red Hat Insights news.
- Provide Direct Feedback - We truly value your input and many enhancements are directly related to customer feedback. The ability to quickly and easily provide feedback is now integrated within the interface giving customers a direct line for feedback regarding any Insights features, suggested improvements, or rules.
- Additional Views On Inventory - It is now possible to select between card or table views for enhanced sorting & filtering of systems, deployments, and assets in the inventory.
- Enhanced Actions Page - In an effort to continue providing quick access to the most critical information in your infrastructure we have enhanced our Actions interface. New charts provide more visibility infrastructure wide to identify risk based on severity levels.
- Notifications Icon- Receive alerts about systems not checking in, suggested or upcoming plans, or other critical Insights information from the new notifications icon. Click to view, drill down, or dismiss alerts.
If you would like to see the latest features in development, take a look at Insights Beta.Posted: 2017-01-24T16:21:13+00:00
Red Hat Insights is all about making sure your systems are running as smoothly as possible. Not just for Red Hat applications, but also for your other enterprise apps. We’ve begun developing rules tailored to large enterprise applications that could use the fine tuning expertise that Red Hat provides. We’ve nailed down the optimal settings required by SAP apps, and now Insights can let you know if those are in place on your systems.
We’ve introduced SAP related rules for alerting you to system configurations which are not up to the specs recommended by either Red Hat or SAP. Having these enterprise apps on systems tailored to their specific needs can be greatly beneficial for the system and more importantly to the clients that have to use them.
We want the apps you entrust to Red Hat Enterprise Linux to be as effective and efficient as you need them to be. These new rules will help you accomplish that goal.
Rule Description Reference SAP application incompatibility with installed RHEL Version SAP applications will encounter compatibility errors when not running on RHEL for SAP. Overview of Red Hat Enterprise Linux for SAP Business Applications subscription Decreased application performance when not running sap-netweaver tuned profile with SAP applications Enable the sap-netweaver tuned profile to optimize hosts for SAP applications Overview of Red Hat Enterprise Linux for SAP Business Applications subscription Decreased SAP application performance when using incorrect kernel parameters When SAP's kernel parameter recommendations are not followed, SAP applications will experience decreased performance. Red Hat Enterprise Linux 6.x: Installation and Upgrade - SAP Note Decreased SAP application performance when file handler limits do not meet SAP requirements Current file handle limits do not meet the application requirements as defined by SAP. This results in decreased SAP application performance. Red Hat Enterprise Linux 7.x: Installation and Upgrade - SAP Note Time discrepancy in SAP applications when not running ntp on SAP servers SAP strongly recommends running an ntp service on systems running SAP Red Hat Enterprise Linux 7.x: Installation and Upgrade - SAP Note Database inconsistencies when UUIDD not running with SAP applications SAP applications require UUIDD to be installed and running in order to prevent UUIDs from being reused in the application. When UUIDD is not running, database inconsistencies can occur. Linux UUID solutions - SAP NotePosted: 2017-01-06T13:54:30+00:00
The basic timekeeping standard for almost all of the world's local time zones is called Coordinated Universal Time (UTC). UTC is derived from International Atomic Time (TAI) and Universal Time (UT1), also known as mean solar time because it’s the time it takes for the Earth to rotate once on its axis. Because the rotation of the earth varies a bit over time and is slowly decreasing its mean rotation speed, a deviation occurs between UTC and UT1. When this deviation approached .9 seconds, a leap second is inserted into the UTC time scale, which adjusts the UTC time to actual earth rotation.
Leap seconds correct a discontinuity of civil time. The correction does not increase monotonically but it is stepped by one second. Leap seconds present a challenge to computer system timekeeping because standard UNIX time is defined as a set number of seconds since 00:00:00 UTC on 1 January 1970, but without leap seconds. A system clock cannot recognize 23:59:60 because every minute has only 60 seconds and every day has only 86400 seconds.
To help you avoid downtime during the leap second insertion on December 31, 2016, Red Hat Insights has recently released a set of rules to detect various leap second issues. Check out the rules below.
Rule Description Reference System clock changes instantly in leap second event in NTP system with slew mode configured [ntpd_slew_mode] Previous versions of NTP (ntp-4.2.6p5-1 and greater) incorrectly changed the system clock instantantly during a leap second event, despite configuring ntpd with -x. In certain applications, this could lead to a variety of system clock related problems such as incorrect event sorting or triggering. Does Red Hat plan to release a ntp package including xleap.patch (Important for slew mode i.e. with -x ntp configuration)? System hangs or has high CPU usage during leap second event in NTP systems [leapsec_system_hard_hang] NTP systems can hang or encounter high CPU usage when a leap second event occurs. Systems hang due to leap-second livelock., High CPU usage after inserting leap second System clock inaccurate in leap second event in non-NTP systems following TAI timescale [tzdata_need_upgrade] In a non-NTP RHEL host following the TAI timescale, one can configure non-NTP RHEL systems to report time corrected for leap seconds by updating the tzdata package to the latest version available and then using appropriate 'right' timezone files. Leap Second queries related to tzdata Ntpd service continues to announce upcoming leap second to clients following leap second insertion [ntpd_not_reset_leap_status] The ntpd service does not reset the leap status and continues announcing an upcoming leap second to its clients when finishing a leap second insertion in the NTP server due to a known bug in the NTP package. The ntpd leap status is not reset after inserting a leap second Chronyd service in leap smear NTP server has ~50% chance of crash when configured with smoothtime directive and 1-second polling interval [leap_smear_chronyd_crash] When chronyd is configured with the smoothtime directive and the smoothing process is updated with an extremely small offset, it may not be able to select a direction in which the offset needs to be smoothed out due to numerical errors in floating-point operations and this causes an assertion failure. Chronyd crashes when performing server leap smear
To see if you have systems affected by these new rules, check out the Stability category here.Posted: 2016-12-01T07:50:45+00:00
A system crash can be one of the most frustrating issues that administrators can encounter in their day to day work. They often strike without warning, require hard reboots, and can kill a process uncleanly, leaving various locked files in place that an admin must go back and manually clean up. These kind of interruptions can take a few minutes to a few hours to overcome. That’s time you could be spending engineering new solutions that change the world, or at the very least, drinking some coffee and catching up on your email.
Insights has your back when it comes to these day-wrecking errors. The Red Hat Knowledgebase is full of incidents that we have diagnosed and tracked down to specific causes. We use this data within Insights to catch these errors before they happen. We see the problems coming so you can take action to fix them before they pull you away from the creative work you’re doing.
Rule Description Reference Kernel panic when scanning network for LUNs in bnx2fc driver When scanning the network for Logical Unit Numbers (LUNs), a kernel panic can occur due to a race condition in the bnx2fc driver. A kernel panic can also occur from scanning for an FCoE-served LUN after an initial LUN scan. Why "bnx2fc: byte_count = 62 != scsi_bufflen = 0," messages appears while accessing FCOE provisioned LUNS in RHEL6? Kernel panic when using TSO in bnx2x driver When booting a system with TCP segmentation offload (TSO) enabled and the boot parameters '[intel|amd]_iommu=on', a kernel panic occurs during transmission of tunneled TSO packets. kernel crashes with instruction pointer at "rb_erase+0x1fb/0x390" if [intel|amd]_iommu=on and TSO with bnx2x driver are enabled. Kernel panic when using an Emulex network interface with GRO enabled in be2net driver A kernel panic can occur when systems with an Emulex network interface are using the be2net driver with GRO set to "on." Kernel panic when using the be2net driver occurs when networking is started Kernel panic after 200+ days of uptime when using the TSC clock source Early RHEL kernels were susceptible to a system panic after 208.5 of uptime when the system was using the TSC clock source. Red Hat Enterprise Linux systems using the TSC clock source reboots or panics when 'sched_clock()' overflows after an uptime of 208.5 days Kernel panic after 200+ days of uptime on certain Xeon CPUs Intel Xeon P5, P5 v2, and P7 v2 CPUs running certain Red Hat Enterprise Linux kernels are susceptible to a bug that can lead to a system panic based on accumulated uptime. Servers with Intel® Xeon® Processor E5, Intel® Xeon® Processor E5 v2, or Intel® Xeon® Processor E7 v2 and certain versions of Red Hat Enterprise Linux 6 kernels become unresponsive/hung or incur a kernel panic Kernel panic when using HP Proliant Gen8 servers with older iLO4 firmware versions HP ProLiant Gen8-series servers with Integrated Lights-Out 4 (iLO 4) firmware versions 1.30, 1.32, 1.40, and 1.50 have a bug that when combined with the HP watchdog driver (hpwdt) can cause system panics. Why does the system crash with HP NMI Watchdog [hpwdt]?
Register your machines now and spend less time fire-fighting crashed systems.Posted: 2016-11-11T17:54:37+00:00