Summary of the OpenShift Online Pro Service Disruption in the N. Virginia (pro-us-east-1) Region

Updated -

We’d like to give you some additional information about the service disruption that occurred in the N. Virginia (pro-us-east-1) Region on the evening of October 19th, 2017. The OpenShift Online Pro team was debugging an issue that allowed users to use services that they had not purchased. At 3:51PM EST, an authorized OpenShift Online team member used an established script to execute a command to remove users who had created an account without purchasing. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of users and their projects were removed than intended. For this set of users, the script immediately removed all projects, including deployment configurations, build configurations, customer images, and persistent volumes.

The OpenShift Online SRE team reacted quickly to restore projects, deployments, builds, and images from an etcd snapshot taken from October 19, 2017 at 3:00PM EST. We removed web console access at 4:52PM EST to prevent customer usage from impacting the restoration of persistent volumes. OpenShift AWS-backed EBS volumes were last snapshot from October 18, 2017 between 8:00PM EST and 8:11PM EST. All projects and persistent volumes were restored by 9:04PM EST. Web console access was restored at 9:07PM EST.

We are making several changes to prevent this from occurring again. Removal of abusing users will continue to be part of our operations management, but we have modified the tool we use to remove users to ensure strict matching of user to project ownership. This will prevent a similar event in the future. We are also ensuring all our other operational tools have similar safety checks. We have various staging environments that test the impact of these scripts before they are able to run in production, however, our staging environments were not considered persistent and project loss was not monitored. We are promoting one staging environment to be considered a long-lived, persistent environment where such project loss would have been more evident. We are investigating our storage volume retention policies to ensure that we have snapshots more often than every 24 hours to reduce customer data loss in a disaster situation.

From the beginning of this event, we wanted to ensure our customers had awareness into this incident. We first notified customers via Status Page (https://status.pro.openshift.com) at 4:15PM EST, and continued to provide updates roughly every hour until 9:07PM EST when the incident was resolved. Additionally, we sent an email to all users affected at 6:20PM EST. A resolution email was sent at 9:10PM EST.

Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our record of availability with OpenShift Online Pro, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.