AWS ROSA Cluster Build - image-registry is not available

Latest response

We have been trying to deploy a cluster into our AWS STS for a couple of weeks now, most of the issues have been around the networking side, and the company's implementation of specific tf codesets. Like whitelisting s3's (that caught us out!).

This latest issue though, has us completely lost. The cluster builds and the bootstrap does its work, and finally destroys itself, letting the master nodes take over the grunt work. They eventually fail, near the end of the build with the following error:

"the cluster operator image-registry is not available"

There are some interesting observations from the log files, such as: level=info msg=Cluster operator image-registry Progressing is True with Error: Progressing: Unable to apply resources: unable to sync storage configuration: AccessDenied: Access Denied level=info msg=Progressing: \tstatus code: 403, request id: WAKY0VN0G4SYMCGD, host id: 4q8tIjVC7fZTu8zQ3PxWPGvU33LG2aHvWhecWyR5T3KNHIFZxlmjrplH+3Zal8Gmb63dRRN9kGSDo/mzKKkbYA== level=info msg=NodeCADaemonProgressing: The daemon set node-ca is deployed level=error msg=Cluster operator image-registry Degraded is True with Unavailable: Degraded: The deployment does not exist level=info msg=Cluster operator ingress EvaluationConditionsDetected is False with AsExpected: level=info msg=Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer level=info msg=Cluster operator insights Disabled is False with AsExpected: level=info msg=Cluster operator insights SCAAvailable is True with Updated: SCA certs successfully updated in the etc-pki-entitlement secret level=info msg=Cluster operator network ManagementStateDegraded is False with : level=error msg=Cluster initialization failed because one or more operators are not functioning properly. level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=error msg=failed to initialize the cluster: Cluster operator image-registry is not available " installID=8fttkm5k time="2023-07-20T19:11:10Z" level=debug msg="no additional log fields found" installID=8fttkm5k time="2023-07-20T19:11:10Z" level=error msg="failed due to install error" error="exit status 6" installID=8fttkm5k time="2023-07-20T19:11:10Z" level=fatal msg="runtime error" error="exit status 6" Suggesting that potentially something within the cluster does not have access to the PVC it requires.

Its all very difficult to troubleshoot, since I can't run oc login commands from our bastion where we run the rosa CLI, because its not in a state to receive setup commands, and I have no way as yet to ssh across to the master node. We are seemingly completely blocked, spending hours upon hours on redhat/aws/google/stack sites trying to resolve.

Two more points to add. We have checked the roles that ROSA uses, they did have a network deny policy against them, but I have whitelisted and it made no difference.

I'm also getting emails from redhat:

"Your cluster's installation is blocked because of the missing route to internet in the route table(s) associated with the supplied subnet(s) for cluster installation. Please review and validate the routes by following documentation and re-install the cluster: "

But there is no real information on here to help, we have added everything from the pre-reqs to our egress proxy.

Any ideas would be great!

Darren

Responses