Policies and service definition
Understanding Red Hat OpenShift Service on AWS Policies
Abstract
Chapter 1. Red Hat OpenShift Service on AWS service definition
This documentation outlines the service definition for the Red Hat OpenShift Service on AWS (ROSA) managed service.
1.1. Account management
This section provides information about the service definition for Red Hat OpenShift Service on AWS account management.
1.1.1. Billing
Red Hat OpenShift Service on AWS is billed through Amazon Web Services (AWS) based on the usage of AWS components used by the service, such as load balancers, storage, EC2 instances, other components, and Red Hat subscriptions for the OpenShift service.
Any additional Red Hat software must be purchased separately.
1.1.2. Cluster self-service
Customers can self-service their clusters, including, but not limited to:
- Create a cluster
- Delete a cluster
- Add or remove an identity provider
- Add or remove a user from an elevated group
- Configure cluster privacy
- Add or remove machine pools and configure autoscaling
- Define upgrade policies
These tasks can be self-serviced using the rosa
CLI utility.
1.1.3. Compute
Single availability zone clusters require a minimum of 3 control planes, 2 infrastructure nodes, and 2 worker nodes deployed to a single availability zone.
Multiple availability zone clusters require a minimum of 3 control planes. 3 infrastructure nodes, and 3 worker nodes. Additional nodes must be purchased in multiples of three to maintain proper node distribution.
The Default
machine pool node type and size cannot be changed after the cluster is created.
Control plane and infrastructure nodes are deployed and managed by Red Hat. There are at least 3 control plane nodes that handle etcd- and API-related workloads. There are at least 2 infrastructure nodes that handle metrics, routing, the web console, and other workloads. Control plane and infrastructure nodes are strictly for Red Hat workloads to operate the service, and customer workloads are not permitted to be deployed on these nodes.
1 vCPU core and 1 GiB of memory are reserved on each worker node to run processes required as part of the managed service. This includes, but is not limited to, audit log aggregation, metrics collection, DNS, image registry, and SDN.
1.1.4. AWS compute types
Red Hat OpenShift Service on AWS offers the following worker node types and sizes:
General purpose
- M5.xlarge (4 vCPU, 16 GiB)
- M5.2xlarge (8 vCPU, 32 GiB)
- M5.4xlarge (16 vCPU, 64 GiB)
Memory-optimized
- R5.xlarge (4 vCPU, 32 GiB)
- R5.2xlarge (8 vCPU, 64 GiB)
- R5.4xlarge (16 vCPU, 128 GiB)
Compute-optimized
- C5.2xlarge (8 vCPU, 16 GiB)
- C5.4xlarge (16 vCPU, 32 GiB)
1.1.5. Regions and availability zones
The following AWS regions are supported by Red Hat OpenShift 4 and are supported for Red Hat OpenShift Service on AWS. Note: China and GovCloud (US) regions are not supported, regardless of their support on OpenShift 4.
- ap-northeast-1 (Tokyo)
- ap-northeast-2 (Seoul)
- ap-south-1 (Mumbai)
- ap-southeast-1 (Singapore)
- ap-southeast-2 (Sydney)
- ca-central-1 (Central)
- eu-central-1 (Frankfurt)
- eu-north-1 (Stockholm)
- eu-west-1 (Ireland)
- eu-west-2 (London)
- eu-west-3 (Paris)
- me-south-1 (Bahrain)
- sa-east-1 (São Paulo)
- us-east-1 (N. Virginia)
- us-east-2 (Ohio)
- us-west-1 (N. California)
- us-west-2 (Oregon)
Multiple availability zone clusters can only be deployed in regions with at least 3 availability clouds. For more information, see the Regions and Availability Zones section in the AWS documentation.
Each new Red Hat OpenShift Service on AWS cluster is installed within an installer-created or preexisting Virtual Private Cloud (VPC) in a single region, with the option to deploy into a single availability zone (Single-AZ) or across multiple availability zones (Multi-AZ). This provides cluster-level network and resource isolation, and enables cloud-provider VPC settings, such as VPN connections and VPC Peering. Persistent volumes (PVs) are backed by AWS Elastic Block Storage (EBS), and are specific to the availability zone in which they are provisioned. Persistent volume claims (PVCs) do not bind to a volume until the associated pod resource is assigned into a specific availability zone to prevent unschedulable pods. Availability zone-specific resources are only usable by resources in the same availability zone.
The region and the choice of single or multiple availability zone cannot be changed after a cluster has been deployed.
1.1.6. Service Level Agreement (SLA)
Any SLAs for the service itself are defined in Appendix 4 of the Red Hat Enterprise Agreement Appendix 4 (Online Subscription Services).
1.1.7. Support
Red Hat OpenShift Service on AWS includes Red Hat Premium Support, which can be accessed by using the Red Hat Customer Portal.
See Red Hat OpenShift Service on AWS SLAs for support response times.
AWS support is subject to a customer’s existing support contract with AWS.
1.2. Logging
Red Hat OpenShift Service on AWS provides optional integrated log forwarding to AWS CloudWatch.
1.2.1. Cluster audit logging
Cluster audit logs are always enabled. Audit logs are streamed to a log aggregation system outside the cluster VPC for automated security analysis and secure retention for 1 year. Red Hat controls the log aggregation system. Customers do not have access. Customers can receive a copy of their cluster’s audit logs upon request through a support ticket. Audit log requests must specify a date and time range not to exceed 21 days. When requesting audit logs, customers should be aware that audit logs are many GB per day in size.
1.2.2. Application logging
Application logs sent to STDOUT
are collected by Fluentd and forwarded to AWS CloudWatch through the cluster logging stack, if it is installed.
1.3. Monitoring
This section provides information about the service definition for Red Hat OpenShift Service on AWS monitoring.
1.3.1. Cluster metrics
Red Hat OpenShift Service on AWS clusters come with an integrated Prometheus stack for cluster monitoring including CPU, memory, and network-based metrics. This is accessible through the web console. These metrics also allow for horizontal pod autoscaling based on CPU or memory metrics provided by an Red Hat OpenShift Service on AWS user.
1.3.2. Cluster status notification
Red Hat communicates the health and status of Red Hat OpenShift Service on AWS clusters through a combination of a cluster dashboard available in the OpenShift Cluster Manager (OCM), and email notifications sent to the email address of the contact that originally deployed the cluster, and any additional contacts specified by the customer.
1.4. Networking
This section provides information about the service definition for Red Hat OpenShift Service on AWS networking.
1.4.1. Custom domains for applications
To use a custom hostname for a route, you must update your DNS provider by creating a canonical name (CNAME) record. Your CNAME record should map the OpenShift canonical router hostname to your custom domain. The OpenShift canonical router hostname is shown on the Route Details page after a route is created. Alternatively, a wildcard CNAME record can be created once to route all subdomains for a given hostname to the cluster’s router.
1.4.2. Domain validated certificates
Red Hat OpenShift Service on AWS includes TLS security certificates needed for both internal and external services on the cluster. For external routes, there are two separate TLS wildcard certificates that are provided and installed on each cluster: one is for the web console and route default hostnames, and the other is for the API endpoint. Let’s Encrypt is the certificate authority used for certificates. Routes within the cluster, such as the internal API endpoint, use TLS certificates signed by the cluster’s built-in certificate authority and require the CA bundle available in every pod for trusting the TLS certificate.
1.4.3. Custom certificate authorities for builds
Red Hat OpenShift Service on AWS supports the use of custom certificate authorities to be trusted by builds when pulling images from an image registry.
1.4.4. Load Balancers
Red Hat OpenShift Service on AWS uses up to five different load balancers:
- An internal control plane load balancer that is internal to the cluster and used to balance traffic for internal cluster communications.
- An external control plane load balancer that is used for accessing the OpenShift and Kubernetes APIs. This load balancer can be disabled in OCM. If this load balancer is disabled, Red Hat reconfigures the API DNS to point to the internal control plane load balancer.
- An external control plane load balancer for Red Hat that is reserved for cluster management by Red Hat. Access is strictly controlled, and communication is only possible from whitelisted bastion hosts.
-
A default external router/ingress load balancer that is the default application load balancer, denoted by
apps
in the URL. The default load balancer can be configured in OCM to be either publicly accessible over the Internet or only privately accessible over a pre-existing private connection. All application routes on the cluster are exposed on this default router load balancer, including cluster services such as the logging UI, metrics API, and registry. -
Optional: A secondary router/ingress load balancer that is a secondary application load balancer, denoted by
apps2
in the URL. The secondary load balancer can be configured in OCM to be either publicly accessible over the Internet or only privately accessible over a pre-existing private connection. If aLabel match
is configured for this router load balancer, then only application routes matching this label are exposed on this router load balancer; otherwise, all application routes are also exposed on this router load balancer. - Optional: Load balancers for services. Enable non-HTTP/SNI traffic and non-standard ports for services. These load balancers can be mapped to a service running on Red Hat OpenShift Service on AWS to enable advanced ingress features, such as non-HTTP/SNI traffic or the use of non-standard ports. These can be purchased in groups of 4 for standard clusters or can be provisioned without charge in Red Hat Customer Cloud Subscription (CCS) clusters; however, each AWS account has a quota which limits the number of Classic Load Balancers that can be used within each cluster.
1.4.5. Cluster ingress
Project administrators can add route annotations for many different purposes, including ingress control through IP allow-listing.
Ingress policies can also be changed by using NetworkPolicy
objects, which leverage the ovs-networkpolicy
plug-in. This allows for full control over the ingress network policy down to the pod level, including between pods on the same cluster and even in the same namespace.
All cluster ingress traffic will go through the defined load balancers. Direct access to all nodes is blocked by cloud configuration.
1.4.6. Cluster egress
Pod egress traffic control through EgressNetworkPolicy
objects can be used to prevent or limit outbound traffic in Red Hat OpenShift Service on AWS.
Public outbound traffic from the control plane and infrastructure nodes is required and necessary to maintain cluster image security and cluster monitoring. This requires that the 0.0.0.0/0
route belongs only to the Internet gateway; it is not possible to route this range over private connections.
OpenShift 4 clusters use NAT gateways to present a public, static IP for any public outbound traffic leaving the cluster. Each availability zone a cluster is deployed into receives a distinct NAT gateway, therefore up to 3 unique static IP addresses can exist for cluster egress traffic. Any traffic that remains inside the cluster, or that does not go out to the public Internet, will not pass through the NAT gateway and will have a source IP address belonging to the node that the traffic originated from. Node IP addresses are dynamic; therefore, a customer must not rely on whitelisting individual IP addresses when accessing private resources.
Customers can determine their public static IP addresses by running a pod on the cluster and then querying an external service. For example:
$ oc run ip-lookup --image=busybox -i -t --restart=Never --rm -- /bin/sh -c "/bin/nslookup -type=a myip.opendns.com resolver1.opendns.com | grep -E 'Address: [0-9.]+'"
1.4.7. Cloud network configuration
Red Hat OpenShift Service on AWS allows for the configuration of a private network connection through AWS-managed technologies:
- VPN connections
- VPC peering
- Transit Gateway
- Direct Connect
Red Hat site reliability engineers (SREs) do not monitor private network connections. Monitoring these connections is the responsibility of the customer.
1.4.8. DNS forwarding
For Red Hat OpenShift Service on AWS clusters that have a private cloud network configuration, a customer can specify internal DNS servers available on that private connection, that should be queried for explicitly provided domains.
1.5. Storage
This section provides information about the service definition for Red Hat OpenShift Service on AWS storage.
1.5.1. Encrypted-at-rest OS and node storage
Control plane nodes use encrypted-at-rest AWS Elastic Block Store (EBS) storage.
1.5.2. Encrypted-at-rest PV
EBS volumes that are used for PVs are encrypted-at-rest by default.
1.5.3. Block storage (RWO)
Persistent volumes (PVs) are backed by AWS EBS, which is Read-Write-Once.
PVs can be attached only to a single node at a time and are specific to the availability zone in which they were provisioned. However, PVs can be attached to any node in the availability zone.
Each cloud provider has its own limits for how many PVs can be attached to a single node. See AWS instance type limits for details.
1.6. Platform
This section provides information about the service definition for the Red Hat OpenShift Service on AWS platform.
1.6.1. Cluster backup policy
It is critical that customers have a backup plan for their applications and application data.
Application and application data backups are not a part of the Red Hat OpenShift Service on AWS service. All Kubernetes objects and persistent volumes (PVs) in each Red Hat OpenShift Service on AWS cluster are backed up to facilitate a prompt recovery in the unlikely event that a cluster becomes irreparably inoperable.
The backups are stored in a secure object storage, or multiple availability zone, bucket in the same account as the cluster. Node root volumes are not backed up, as Red Hat CoreOS is fully managed by the Red Hat OpenShift Service on AWS cluster and no stateful data should be stored on a node’s root volume.
The following table shows the frequency of backups:
Component | Snapshot frequency | Retention | Notes |
---|---|---|---|
Full object store backup, all cluster PVs | Daily at 0100 UTC | 7 days | This is a full backup of all Kubernetes objects, as well as all mounted PVs in the cluster. |
Full object store backup, all cluster PVs | Weekly on Mondays at 0200 UTC | 30 days | This is a full backup of all Kubernetes objects, as well as all mounted PVs in the cluster. |
Full object store backup | Hourly at 17 minutes past the hour | 24 hours | This is a full backup of all Kubernetes objects. No PVs are backed up in this backup schedule. |
1.6.2. Autoscaling
Node autoscaling is available on Red Hat OpenShift Service on AWS.
1.6.3. Daemonsets
Customers can create and run daemonsets on Red Hat OpenShift Service on AWS. To restrict daemonsets to only running on worker nodes, use the following nodeSelector
:
... spec: nodeSelector: role: worker ...
1.6.4. Multiple availability zone
In a multiple availability zone cluster, control plane nodes are distributed across availability zones and at least one worker node is required in each availability zone.
1.6.5. Node labels
Custom node labels are created by Red Hat during node creation and cannot be changed on Red Hat OpenShift Service on AWS clusters at this time. However, custom labels are supported when creating new machine pools.
1.6.6. OpenShift version
Red Hat OpenShift Service on AWS is run as a service and is kept up to date with the latest OpenShift Container Platform version. Upgrade scheduling to the latest version is available.
1.6.7. Upgrades
Upgrades can be scheduled using the rosa
CLI utility or through OpenShift Cluster Manager (OCM).
See the OpenShift Dedicated Life Cycle for more information on the upgrade policy and procedures.
1.6.8. Windows Containers
Windows Containers are not available on Red Hat OpenShift Service on AWS at this time.
1.6.9. Container engine
Red Hat OpenShift Service on AWS runs on OpenShift 4 and uses CRI-O as the only available container engine.
1.6.10. Operating system
Red Hat OpenShift Service on AWS runs on OpenShift 4 and uses Red Hat CoreOS as the operating system for all control plane and worker nodes.
1.6.11. Kubernetes Operator support
All Operators listed in the Operator Hub marketplace should be available for installation. These operators are considered customer workloads, and are not monitored by Red Hat SRE.
1.7. Security
This section provides information about the service definition for Red Hat OpenShift Service on AWS security.
1.7.1. Authentication provider
Authentication for the cluster can be configured using either the OpenShift Cluster Manager (OCM) or cluster creation process or using the rosa
CLI. Red Hat OpenShift Service on AWS is not an identity provider, and all access to the cluster must be managed by the customer as part of their integrated solution. The use of multiple identity providers provisioned at the same time is supported. The following identity providers are supported:
- GitHub or GitHub Enterprise
- GitLab
- LDAP
- OpenID Connect
1.7.2. Privileged containers
Privileged containers are available for users with the cluster-admin
role. Usage of privileged containers as cluster-admin
is subject to the responsibilities and exclusion notes in the Red Hat Enterprise Agreement Appendix 4 (Online Subscription Services).
1.7.3. Customer administrator user
In addition to normal users, Red Hat OpenShift Service on AWS provides access to an Red Hat OpenShift Service on AWS-specific group called dedicated-admin
. Any users on the cluster that are members of the dedicated-admin
group:
- Have administrator access to all customer-created projects on the cluster.
- Can manage resource quotas and limits on the cluster.
-
Can add and manage
NetworkPolicy
objects. - Are able to view information about specific nodes and PVs in the cluster, including scheduler information.
-
Can access the reserved
dedicated-admin
project on the cluster, which allows for the creation of service accounts with elevated privileges and also gives the ability to update default limits and quotas for projects on the cluster.
1.7.4. Cluster administration role
The administrator of Red Hat OpenShift Service on AWS has default access to the cluster-admin
role for your organization’s cluster. While logged into an account with the cluster-admin
role, users have increased permissions to run privileged security contexts.
1.7.5. Project self-service
By default, all users have the ability to create, update, and delete their projects. This can be restricted if a member of the dedicated-admin
group removes the self-provisioner
role from authenticated users:
$ oc adm policy remove-cluster-role-from-group self-provisioner system:authenticated:oauth
Restrictions can be reverted by applying:
$ oc adm policy add-cluster-role-to-group self-provisioner system:authenticated:oauth
1.7.6. Regulatory compliance
See OpenShift Dedicated Process and Security Overview for the latest compliance information.
1.7.7. Network security
With Red Hat OpenShift Service on AWS, AWS provides a standard DDoS protection on all load balancers, called AWS Shield. This provides 95% protection against most commonly used level 3 and 4 attacks on all the public facing load balancers used for Red Hat OpenShift Service on AWS. A 10-second timeout is added for HTTP requests coming to the haproxy
router to receive a response or the connection is closed to provide additional protection.
Chapter 2. Responsibility assignment matrix
This documentation outlines Red Hat, cloud provider, and customer responsibilities for the Red Hat OpenShift Service on AWS (ROSA) managed service.
2.1. Overview of responsibilities for Red Hat OpenShift Service on AWS
While Red Hat and Amazon Web Services (AWS) manage the Red Hat OpenShift Service on AWS service, the customer shares certain responsibilities. The Red Hat OpenShift Service on AWS services are accessed remotely, hosted on public cloud resources, created in customer-owned AWS accounts, and have underlying platform and data security that is owned by Red Hat.
If the cluster-admin
role is added to a user, see the responsibilities and exclusion notes in the Red Hat Enterprise Agreement Appendix 4 (Online Subscription Services).
Resource | Incident and operations management | Change management | Identity and access management | Security and regulation compliance | Disaster recovery |
---|---|---|---|---|---|
Customer data | Customer | Customer | Customer | Customer | Customer |
Customer applications | Customer | Customer | Customer | Customer | Customer |
Developer services | Customer | Customer | Customer | Customer | Customer |
Platform monitoring | Red Hat | Red Hat | Red Hat | Red Hat | Red Hat |
Logging | Red Hat | Shared | Shared | Shared | Red Hat |
Application networking | Shared | Shared | Shared | Red Hat | Red Hat |
Cluster networking | Red Hat | Shared | Shared | Red Hat | Red Hat |
Virtual networking | Shared | Shared | Shared | Shared | Shared |
Master and infrastructure nodes | Red Hat | Red Hat | Red Hat | Red Hat | Red Hat |
Worker nodes | Red Hat | Red Hat | Red Hat | Red Hat | Red Hat |
Cluster version | Red Hat | Shared | Red Hat | Red Hat | Red Hat |
Capacity managment | Red Hat | Shared | Red Hat | Red Hat | Red Hat |
Virtual storage | Red Hat and cloud provider | Red Hat and cloud provider | Red Hat and cloud provider | Red Hat and cloud provider | Red Hat and cloud provider |
Physical infrastructure and security | Cloud provider | Cloud provider | Cloud provider | Cloud provider | Cloud provider |
2.3. Customer responsibilities for data and applications
The customer is responsible for the applications, workloads, and data that they deploy to Red Hat OpenShift Service on AWS. However, Red Hat provides various tools to help the customer manage data and applications on the platform.
Resource | Red Hat responsibilities | Customer responsibilities |
---|---|---|
Customer data |
| Maintain responsibility for all customer data stored on the platform and how customer applications consume and expose this data. |
Customer applications |
|
|
Developer services (CodeReady) | Make CodeReady Workspaces available as an add-on through OpenShift Cluster Manager (OCM). | Install, secure, and operate CodeReady Workspaces and the Developer CLI. |
Chapter 3. Understanding process and security for Red Hat OpenShift Service on AWS
This document details the Red Hat responsibilities for the managed Red Hat OpenShift Service on AWS (ROSA).
Acronyms and terms
- AWS - Amazon Web Services
- CEE - Customer Experience and Engagement (Red Hat Support)
- CI/CD - Continuous Integration / Continuous Delivery
- CVE - Common Vulnerabilities and Exposures
- OCM - OpenShift Cluster Manager
- PVs - Persistent Volumes
- ROSA - Red Hat OpenShift Service on AWS
- SRE - Red Hat Site Reliability Engineering
- VPC - Virtual Private Cloud
3.1. Incident and operations management
This documentation details the Red Hat responsibilities for the Red Hat OpenShift Service on AWS (ROSA) managed service.
3.1.1. Platform monitoring
Red Hat site reliability engineers (SREs) maintain a centralized monitoring and alerting system for all ROSA cluster components, the SRE services, and underlying AWS accounts. Platform audit logs are securely forwarded to a centralized security information and event monitoring (SIEM) system, where they may trigger configured alerts to the SRE team and are also subject to manual review. Audit logs are retained in the SIEM system for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted.
3.1.2. Incident management
An incident is an event that results in a degradation or outage of one or more Red Hat services. An incident can be raised by a customer or a Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
Depending on the impact on the service and customer, the incident is categorized in terms of severity.
When managing a new incident, Red Hat uses the following general workflow:
- An SRE first responder is alerted to a new incident and begins an initial investigation.
- After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
- An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates.
- The incident is recovered.
- The incident is documented and a root cause analysis (RCA) is performed within 3 business days of the incident.
- An RCA draft document will be shared with the customer within 7 business days of the incident.
3.1.3. Notifications
Platform notifications are configured using email. Some customer notifications are also sent to an account’s corresponding Red Hat account team, including a Technical Account Manager, if applicable.
The following activities can trigger notifications:
- Platform incident
- Performance degradation
- Cluster capacity warnings
- Critical vulnerabilities and resolution
- Upgrade scheduling
3.1.4. Backup and recovery
All Red Hat OpenShift Service on AWS clusters are backed up using AWS snapshots. Notably, this does not include customer data stored on persistent volumes (PVs). All snapshots are taken using the appropriate AWS snapshot APIs and are uploaded to a secure AWS S3 object storage bucket in the same account as the cluster.
Component | Snapshot frequency | Retention | Notes |
---|---|---|---|
Full object store backup, all SRE-managed cluster PVs | Daily | 7 days | This is a full backup of all Kubernetes objects, such as etcd, and all SRE-managed PVs in the cluster. |
Weekly | 30 days | ||
Full object store backup | Hourly | 24-hour | This is a full backup of all Kubernetes objects, such as etcd. No PVs are backed up in this backup schedule. |
Node root volume | Never | N/A | Nodes are considered to be short-term. Do not store anything critical on a node’s root volume. |
- The SRE rehearses recovery processes quarterly.
- Red Hat does not commit to any Recovery Point Objective (RPO) or Recovery Time Objective (RTO).
- Customers are responsible for taking regular backups of their data.
- Backups performed by the SRE are taken as a precautionary measure only. They are stored in the same region as the cluster.
- Customers can access the SRE backup data on request through a support case.
- Red Hat encourages customers to deploy multiple availability zone (multi-AZ) clusters with workloads that follow Kubernetes best practices to ensure high availability within a region.
- In the event an entire AWS region is unavailable, customers must install a new cluster in a different region and restore their apps using their backup data.
3.1.5. Cluster capacity
Evaluating and managing cluster capacity is a responsibility that is shared between Red Hat and the customer. Red Hat SRE is responsible for the capacity of all control plane and infrastructure nodes on the cluster.
Red Hat SRE also evaluates cluster capacity during upgrades and in response to cluster alerts. The impact of a cluster upgrade on capacity is evaluated as part of the upgrade testing process to ensure that capacity is not negatively impacted by new additions to the cluster. During a cluster upgrade, additional worker nodes are added to make sure that total cluster capacity is maintained during the upgrade process.
Capacity evaluations by the Red Hat SRE staff also happen in response to alerts from the cluster, after usage thresholds are exceeded for a certain period of time. Such alerts can also result in a notification to the customer.
3.2. Change management
This section describes the policies about how cluster changes, configuration changes, patches, and releases are managed.
Cluster changes are initiated in one of two ways:
- A customer initiates changes through self-service capabilities such as cluster deployment, worker node scaling, or cluster deletion.
- Red Hat site reliability engineering (SRE) initiates a change through Operator-driven capabilities such as configuration, upgrade, patching, or configuration changes.
Change history is captured in the Cluster History section in the OpenShift Cluster Manager (OCM) Overview tab and is available to customers. The change history includes, but is not limited to, logs from the following changes:
- Adding or removing identity providers
-
Adding or removing users to or from the
dedicated-admins
group - Scaling the cluster compute nodes
- Scaling the cluster load balancer
- Scaling the cluster persistent storage
- Upgrading the cluster
The SRE-initiated changes that require manual intervention by SRE generally follow this process:
Preparing for change
- Change characteristics are identified and a gap analysis is performed against current state.
- Change steps are documented and validated.
- A communication plan and schedule are shared with all stakeholders.
- CI/CD and end-to-end tests are updated to automate change validation.
- A change request that captures change details is submitted for management approval.
Managing change
- Automated nightly CI/CD jobs pick up the change and run tests.
- The change is made to integration and stage environments, and manually validated before updating the customer cluster.
- Major change notifications are sent before and after the event.
Reinforcing the change
- Feedback on the change is collected and analyzed.
- Potential gaps are diagnosed to understand resistance and automate similar change requests.
- Corrective actions are implemented.
SRE only uses manual changes as a fallback process because manual intervention is considered to be a failure of change management.
3.2.1. Configuration management
The infrastructure and configuration of the Red Hat OpenShift Service on AWS environment is managed as code. SRE manages changes to the Red Hat OpenShift Service on AWS environment using a GitOps workflow and automated CI/CD pipeline.
Each proposed change undergoes a series of automated verifications immediately upon check-in. Changes are then deployed to a staging environment where they undergo automated integration testing. Finally, changes are deployed to the production environment. Each step is fully automated.
An authorized SRE reviewer must approve advancement to each step. The reviewer cannot be the same individual who proposed the change. All changes and approvals are fully auditable as part of the GitOps workflow.
3.2.2. Patch management
OpenShift Container Platform software and the underlying immutable Red Hat CoreOS (RHCOS) operating system image are patched for bugs and vulnerabilities in regular z-stream upgrades. Read more about RHCOS architecture in the OpenShift Container Platform documentation.
3.2.3. Release management
ROSA clusters can be configured for automatic upgrades on a schedule. Alternatively, you can perform manual upgrades using the rosa
CLI. For more details, see the Life Cycle policy.
Customers can review the history of all cluster upgrade events in their OCM web console on the Events tab.
3.3. Identity and access management
Most access by Red Hat site reliability engineering (SRE) teams is done using cluster Operators through automated configuration management.
3.3.1. SRE access to all Red Hat OpenShift Service on AWS clusters
SREs access Red Hat OpenShift Service on AWS clusters through the web console or command-line tools. Authentication requires multi-factor authentication (MFA) with industry-standard requirements for password complexity and account lockouts. SREs must authenticate as individuals to ensure auditability. All authentication attempts are logged to a Security Information and Event Management (SIEM) system.
SREs access private clusters using an encrypted tunnel through a hardened SRE support pod running in the cluster. Connections to the SRE support pod are permitted only from a secured Red Hat network using an IP allow-list. In addition to the cluster authentication controls described above, authentication to the SRE support pod is controlled by using SSH keys. SSH key authorization is limited to SRE staff and automatically synchronized with Red Hat corporate directory data. Corporate directory data is secured and controlled by HR systems, including management review, approval, and audits.
3.3.2. Privileged access controls in Red Hat OpenShift Service on AWS
SRE adheres to the principle of least privilege when accessing Red Hat OpenShift Service on AWS and AWS components. There are four basic categories of manual SRE access:
- SRE admin access through the Red Hat Portal with normal two-factor authentication and no privileged elevation.
- SRE admin access through the Red Hat corporate SSO with normal two-factor authentication and no privileged elevation.
- OpenShift elevation, which is a manual elevation using Red Hat SSO. Access is limited to 2 hours, is fully audited, and requires management approval.
- AWS access or elevation, which is a manual elevation for AWS console access. Access is limited to 60 minutes, is fully audited, and requires management approval.
Each of these access types have different levels of access to components:
Component | Typical SRE admin access (Red Hat Portal) | Typical SRE admin access (Red Hat SSO) | Openshift elevation | Cloud provider access or elevation |
---|---|---|---|---|
OpenShift Cluster Manager (OCM) | R/W | No access | No access | No access |
OpenShift console | No access | R/W | R/W | No access |
Node Operatiing system | No access | A specific list of elevated OS and network permissions. | A specific list of elevated OS and network permissions. | No access |
AWS Console | No access | No access, but this is the account used to request cloud provider access. | No access | All cloud provider permissions using the SRE identity. |
3.3.3. SRE access to AWS accounts
Red Hat personnel do not access AWS accounts in the course of routine Red Hat OpenShift Service on AWS operations. For emergency troubleshooting purposes, the SREs have well-defined and auditable procedures to access cloud infrastructure accounts.
SREs generate a short-lived AWS access token for the osdManagedAdminSRE
user using the AWS Security Token Service (STS). Access to the STS token is audit-logged and traceable back to individual users. The osdManagedAdminSRE
user has the AdministratorAccess IAM policy attached.
3.3.4. Red Hat support access
Members of the Red Hat Customer Experience and Engagement (CEE) team typically have read-only access to parts of the cluster. Specifically, CEE has limited access to the core and product namespaces and does not have access to the customer namespaces.
Role | Core namespace | Layered product namespace | Customer namespace | AWS account* |
---|---|---|---|---|
OpenShift SRE | Read: All Write: Very limited [1] | Read: All Write: None | Read: None[2] Write: None | Read: All [3] Write: All [3] |
CEE | Read: All Write: None | Read: All Write: None | Read: None[2] Write: None | Read: None Write: None |
Customer administrator | Read: None Write: None | Read: None Write: None | Read: All Write: All | Read: All Write: All |
Customer user | Read: None Write: None | Read: None Write: None | Read: Limited[4] Write: Limited[4] | Read: None Write: None |
Everybody else | Read: None Write: None | Read: None Write: None | Read: None Write: None | Read: None Write: None |
- Limited to addressing common use cases such as failing deployments, upgrading a cluster, and replacing bad worker nodes.
- Red Hat associates have no access to customer data by default.
- SRE access to the AWS account is an emergency procedure for exceptional troubleshooting during a documented incident.
- Limited to what is granted through RBAC by the Customer Administrator, as well as namespaces created by the user.
3.3.5. Customer access
Customer access is limited to namespaces created by the customer and permissions that are granted using RBAC by the Customer Administrator role. Access to the underlying infrastructure or product namespaces is generally not permitted without cluster-admin
access. More information on customer access and authentication can be found in the "Understanding Authentication" section of the documentation.
3.3.6. Access approval and review
New SRE user access requires management approval. Separated or transferred SRE accounts are removed as authorized users through an automated process. Additionally, the SRE performs periodic access review, including management sign-off of authorized user lists.
3.4. Security and regulation compliance
Security and regulation compliance includes tasks such as the implementation of security controls and compliance certification.
3.4.1. Data classification
Red Hat defines and follows a data classification standard to determine the sensitivity of data and highlight inherent risk to the confidentiality and integrity of that data while it is collected, used, transmitted, stored, and processed. Customer-owned data is classified at the highest level of sensitivity and handling requirements.
3.4.2. Data management
Red Hat OpenShift Service on AWS (ROSA) uses AWS KMS to help securely manage keys for encrypted data. These keys are used for control plane data volumes that are encrypted by default. Persistent volumes (PVs) for customer applications also use AWS KMS for key management.
When a customer deletes their ROSA cluster, all cluster data is permanently deleted, including control plane data volumes, customer application data volumes, such as PVs, and backup data.
3.4.3. Vulnerability management
Red Hat performs periodic vulnerability scanning of ROSA using industry standard tools. Identified vulnerabilities are tracked to their remediation according to timelines based on severity. Vulnerability scanning and remediation activities are documented for verification by third-party assessors in the course of compliance certification audits.
3.4.4. Network security
3.4.4.1. Firewall and DDoS protection
Each ROSA cluster is protected by a secure network configuration using firewall rules for AWS Security Groups. ROSA customers are also protected against DDoS attacks with AWS Shield Standard.
3.4.4.2. Private clusters and network connectivity
Customers can optionally configure their ROSA cluster endpoints, such as web console, API, and application router, to be made private so that the cluster control plane and applications are not accessible from the Internet. Red Hat SRE still requires Internet-accessible endpoints that are protected with IP allow-lists.
AWS customers can configure a private network connection to their ROSA cluster through technologies such as AWS VPC peering, AWS VPN, or AWS Direct Connect.
3.4.4.3. Cluster network access controls
Fine-grained network access control rules can be configured by customers, on a per-project basis, using NetworkPolicy
objects and the OpenShift SDN.
3.4.5. Penetration testing
Red Hat performs periodic penetration tests against ROSA. Tests are performed by an independent internal team by using industry standard tools and best practices.
Any issues that may be discovered are prioritized based on severity. Any issues found belonging to open source projects are shared with the community for resolution.
3.4.6. Compliance
ROSA follows common industry best practices for security and controls.
ROSA is certified for SOC 2 Type I and ISO 27001.
3.5. Disaster recovery
Red Hat OpenShift Service on AWS (ROSA) provides disaster recovery for failures that occur at the pod, worker node, infrastructure node, master node, and availability zone levels.
All disaster recovery requires that the customer use best practices for deploying highly available applications, storage, and cluster architecture, such as single-zone deployment or multi-zone deployment, to account for the level of desired availability.
One single-zone cluster will not provide disaster avoidance or recovery in the event of an availability zone or region outage. Multiple single-zone clusters with customer-maintained failover can account for outages at the zone or at the regional level.
One multi-zone cluster will not provide disaster avoidance or recovery in the event of a full region outage. Multiple multi-zone clusters with customer-maintained failover can account for outages at the regional level.
3.6. Additional resources
- For more information about customer or shared responsibilities, see the ROSA Responsibilities document.
- For more information about ROSA and its components, see the ROSA Service Definition.
Chapter 4. About availability for Red Hat OpenShift Service on AWS
Availability and disaster avoidance are extremely important aspects of any application platform. Although Red Hat OpenShift Service on AWS (ROSA) provides many protections against failures at several levels, customer-deployed applications must be appropriately configured for high availability. To account for outages that might occur with cloud providers, additional options are available such as deploying a cluster across multiple availability zones and maintaining multiple clusters with failover mechanisms.
4.1. Potential points of failure
Red Hat OpenShift Service on AWS (ROSA) provides many features and options for protecting your workloads against downtime, but applications must be architected appropriately to take advantage of these features.
ROSA can help further protect you against many common Kubernetes issues by adding Red Hat site reliability engineering (SRE) support and the option to deploy a multiple availability zone cluster, but there are a number of ways in which a container or infrastructure can still fail. By understanding potential points of failure, you can understand risks and appropriately architect both your applications and your clusters to be as resilient as necessary at each specific level.
An outage can occur at several different levels of infrastructure and cluster components.
4.1.1. Container or pod failure
By design, pods are meant to exist for a short time. Appropriately scaling services so that multiple instances of your application pods are running can protect against issues with any individual pod or container. The OpenShift node scheduler can also make sure these workloads are distributed across different worker nodes to further improve resiliency.
When accounting for possible pod failures, it is also important to understand how storage is attached to your applications. Single persistent volumes attached to single pods cannot leverage the full benefits of pod scaling, whereas replicated databases, database services, or shared storage can.
To avoid disruption to your applications during planned maintenance, such as upgrades, it is important to define a Pod Disruption Budget. These are part of the Kubernetes API and can be managed with oc
commands such as other object types. They allow for the specification of safety constraints on pods during operations, such as draining a node for maintenance.
4.1.2. Worker node failure
Worker nodes are the virtual machines that contain your application pods. By default, a ROSA cluster has a minimum of two worker nodes for a single availability-zone cluster. In the event of a worker node failure, pods are relocated to functioning worker nodes, as long as there is enough capacity, until any issue with an existing node is resolved or the node is replaced. More worker nodes means more protection against single-node outages, and ensures proper cluster capacity for rescheduled pods in the event of a node failure.
When accounting for possible node failures, it is also important to understand how storage is affected. EFS volumes are not affected by node failure. However, EBS volumes are not accessible if they are connected to a node that fails.
4.1.3. Cluster failure
ROSA clusters have at least three control plane nodes and three infrastructure nodes that are preconfigured for high availability, either in a single zone or across multiple zones, depending on the type of cluster you have selected. Control plane and infrastructure nodes have the same resiliency as worker nodes, with the added benefit of being managed completely by Red Hat.
In the event of a complete control plane outage, the OpenShift APIs will not function, and existing worker node pods are unaffected. However, if there is also a pod or node outage at the same time, the control planes must recover before new pods or nodes can be added or scheduled.
All services running on infrastructure nodes are configured by Red Hat to be highly available and distributed across infrastructure nodes. In the event of a complete infrastructure outage, these services are unavailable until these nodes have been recovered.
4.1.4. Zone failure
A zone failure from AWS affects all virtual components, such as worker nodes, block or shared storage, and load balancers that are specific to a single availability zone. To protect against a zone failure, ROSA provides the option for clusters that are distributed across three availability zones, known as multiple availability zone clusters. Existing stateless workloads are redistributed to unaffected zones in the event of an outage, as long as there is enough capacity.
4.1.5. Storage failure
If you have deployed a stateful application, then storage is a critical component and must be accounted for when thinking about high availability. A single block storage PV is unable to withstand outages even at the pod level. The best ways to maintain availability of storage are to use replicated storage solutions, shared storage that is unaffected by outages, or a database service that is independent of the cluster.
Chapter 5. Red Hat OpenShift Service on AWS update life cycle
5.1. Overview
Red Hat provides a published product life cycle for Red Hat OpenShift Service on AWS in order for customers and partners to effectively plan, deploy, and support their applications running on the platform. Red Hat publishes this life cycle in order to provide as much transparency as possible and may make exceptions from these policies as conflicts arise.
Red Hat OpenShift Service on AWS is a managed instance of Red Hat OpenShift and maintains an independent release schedule. More details about the managed offering can be found in the Red Hat OpenShift Service on AWS service definition. The availability of Security Advisories and Bug Fix Advisories for a specific version are dependent upon the Red Hat OpenShift Container Platform life cycle policy and subject to the Red Hat OpenShift Service on AWS maintenance schedule.
Additional resources
5.2. Definitions
Table 5.1. Version reference
Version format | Major | Minor | Patch | Major.minor.patch |
---|---|---|---|---|
x | y | z | x.y.z | |
Example | 4 | 5 | 21 | 4.5.21 |
- Major releases or X-releases
Referred to only as major releases or X-releases (X.y.z).
Example
- "Major release 5" → 5.y.z
- "Major release 4" → 4.y.z
- "Major release 3" → 3.y.z
- Minor releases or Y-releases
Referred to only as minor releases or Y-releases (x.Y.z).
Example
- "Minor release 4" → 4.4.z
- "Minor release 5" → 4.5.z
- "Minor release 6" → 4.6.z
- Patch releases or Z-releases
Referred to only as patch releases or Z-releases (x.y.Z).
Example
- "Patch release 14 of Minor release 5" → 4.5.14
- "Patch release 25 of Minor release 5" → 4.5.25
- "Patch release 26 of Minor release 6" → 4.6.26
5.3. Major versions X.y.z
Major versions of Red Hat OpenShift Service on AWS, for example version 4, are supported for one year following the release of a subsequent major version or the retirement of the product.
Example
- If version 5 were made available on Red Hat OpenShift Service on AWS on January 1, version 4 would be allowed to continue running on managed clusters for 12 months, until December 31. After this time, clusters would need to be upgraded or migrated to version 5.
5.4. Minor versions x.Y.z
Red Hat supports two minor versions of the major release.
- Y: The latest available minor release. For example, 4.8.
- Y-1: The previous minor version. For example, 4.7.
After an upgrade path from the previous minor version (Y-1) to the latest minor version (Y) is available, clusters running Y-2 must upgrade their cluster within a 30 day grace period. Any cluster remaining on Y-2 30 days after notification of upgrade availability will be classified as being in limited support status until the cluster is upgraded to a supported release.
Example
- A customer’s cluster is currently running on 4.5.18. The latest version for 4.6 is 4.6.27.
- On February 25, 4.7.2 is released as an available upgrade path from 4.6.27 and the customer is notified.
- The cluster must be upgraded to 4.6.27 or later by March 25.
- If the upgrade has not been performed, then the cluster will have SRE alerting disabled and will be unsupported until it is upgraded to 4.6.27 or later.
Additional resources
5.5. Patch versions x.y.Z
During the period in which a minor release is supported, all OpenShift Container Platform patch releases will be supported unless otherwise specified.
For reasons of platform security and stability, a patch release might be deprecated, which would prevent installations of that release and trigger mandatory upgrades off that release.
Example
- 4.7.6 is found to contain a critical CVE.
- Any releases impacted by the CVE will be removed from the supported patch release list. In addition, any clusters running 4.7.6 will be scheduled for automatic upgrades within 48 hours.
5.6. Limited support status
While operating outside of the supported versions list, you might be asked to upgrade the cluster to a supported version when requesting support, unless you are within the 30-day grace period after version deprecation. Additionally, Red Hat does not make any runtime or SLA guarantees for clusters outside of the supported versions list at the end of the 30-day grace period.
Red Hat will provide best effort to ensure an upgrade path from an unsupported release to a supported release is available. However, if a supported upgrade path is no longer available, you might be required to create a new cluster and migrate your workloads.
5.7. Supported versions exception policy
Red Hat reserves the right to add or remove new or existing versions, or delay upcoming minor release versions, that have been identified to have one or more critical production impacting bugs or security issues without advance notice.
5.8. Install policy
While Red Hat recommends installation of the latest support release, Red Hat OpenShift Service on AWS will support installation of any supported release as covered by the preceding policy.
5.9. Mandatory upgrades
In the event that a Critical or Important CVE, or other bug identified by Red Hat, significantly impacts the security or stability of the cluster, the customer must upgrade to the next supported patch release within 48 hours.
In extreme circumstances and based on Red Hat’s assessment of the CVE criticality to the environment, if the upgrade to the next supported patch release has not been performed within 48 hours of notification, the cluster will be automatically updated to the latest patch release to mitigate potential security breach or instability.
5.10. Life cycle dates
Version | General availability | End of life |
---|---|---|
4.7 | Mar 24, 2021 | Release of 4.9 + 30 days |