Scaling MachineSet with acceleratedNetworking is causing panic in machine-controller on OpenShift Container Platform 4

Solution Verified - Updated -

Issue

  • Recently, the MachineSet were changed, to enable some accelerated networking configuration. Overall, it worked fine, but we found some Machines stuck during provisioning. Checking openshift-machine-api namespace, it was found that the machine-controller was crashing on a loop with the below stack trace. Only way to resolve it was to disable acceleratedNetworking and change the vmSize.

    E1003 07:32:21.097221       1 controller.go:129] Unable to set scale from zero annotations: unknown instance type: %sStandard_D8s_v5
    E1003 07:32:21.097234       1 controller.go:130] Autoscaling from zero will not work. To fix this, manually populate machine annotations for your instance type: %v[machine.openshift.io/vCPU machine.openshift.io/memoryMb machine.openshift.io/GPU]
    I1003 07:32:21.097505       1 logr.go:252] events "msg"="Warning"  "message"="Failed to set autoscaling from zero annotations, instance type unknown" "object"={"kind":"MachineSet","namespace":"openshift-machine-api","name":"ci-ln-14rihvk-1d09d-q7862-worker-nic-centralus1","uid":"9c96244b-70db-4a43-bae0-ef3db1881b5e","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"32999"} "reason"="FailedUpdate"
    I1003 07:32:21.119355       1 controller.go:72] controllers/MachineSet "msg"="Reconciling" "machineset"="ci-ln-14rihvk-1d09d-q7862-worker-centralus1" "namespace"="openshift-machine-api" 
    I1003 07:32:21.135903       1 controller.go:72] controllers/MachineSet "msg"="Reconciling" "machineset"="ci-ln-14rihvk-1d09d-q7862-worker-centralus2" "namespace"="openshift-machine-api" 
    I1003 07:32:21.174299       1 reflector.go:219] Starting reflector *v1.Secret (9m50.552991678s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250
    I1003 07:32:21.174318       1 reflector.go:255] Listing and watching *v1.Secret from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250
    I1003 07:32:21.583232       1 reconciler.go:404] Provisioning state is 'Succeeded' for machine ci-ln-14rihvk-1d09d-q7862-worker-centralus3-fk8wc
    I1003 07:32:21.583258       1 controller.go:319] ci-ln-14rihvk-1d09d-q7862-worker-centralus3-fk8wc: reconciling machine triggers idempotent update
    I1003 07:32:21.583264       1 actuator.go:173] Updating machine ci-ln-14rihvk-1d09d-q7862-worker-centralus3-fk8wc
    I1003 07:32:21.738375       1 machine_scope.go:176] ci-ln-14rihvk-1d09d-q7862-worker-centralus3-fk8wc: status unchanged
    I1003 07:32:21.738424       1 machine_scope.go:192] ci-ln-14rihvk-1d09d-q7862-worker-centralus3-fk8wc: patching machine
    I1003 07:32:21.798872       1 controller.go:175] ci-ln-14rihvk-1d09d-q7862-worker-nic-centralus1-9mw88: reconciling Machine
    I1003 07:32:21.798896       1 actuator.go:213] ci-ln-14rihvk-1d09d-q7862-worker-nic-centralus1-9mw88: actuator checking if machine exists
    W1003 07:32:21.885062       1 virtualmachines.go:93] vm ci-ln-14rihvk-1d09d-q7862-worker-nic-centralus1-9mw88 not found: %!w(string=compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Compute/virtualMachines/ci-ln-14rihvk-1d09d-q7862-worker-nic-centralus1-9mw88' under resource group 'ci-ln-14rihvk-1d09d-q7862-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix")
    I1003 07:32:21.885098       1 controller.go:386] ci-ln-14rihvk-1d09d-q7862-worker-nic-centralus1-9mw88: reconciling machine triggers idempotent create
    I1003 07:32:21.885107       1 actuator.go:85] Creating machine ci-ln-14rihvk-1d09d-q7862-worker-nic-centralus1-9mw88
    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x18aa08e]
    
    goroutine 509 [running]:
    github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Reconciler).createNetworkInterface(0xc0000b4100, {0x1fa31c8, 0xc00018e310}, {0xc000d1c100, 0x39})
        /go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go:508 +0x1ee
    github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Reconciler).CreateMachine(0xc0000b4100, {0x1fa31c8, 0xc00018e310})
        /go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go:119 +0x105
    github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Reconciler).Create(0xc0000b4100, {0x1fa31c8, 0xc00018e310})
        /go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go:97 +0x46
    github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Actuator).Create(0xc0005a1d10, {0x1, 0x1}, 0xc000b1c000)
        /go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/actuator.go:96 +0x2c5
    github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc000119040, {0x1fa3238, 0xc000c88ab0}, {{{0xc000620540, 0x1c20180}, {0xc0008a2840, 0x30}}})
        /go/src/github.com/openshift/machine-api-provider-azure/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:387 +0xab4
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc0000c0160, {0x1fa3238, 0xc000c88a80}, {{{0xc000620540, 0x1c20180}, {0xc0008a2840, 0x413af4}}})
        /go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x26f
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000c0160, {0x1fa3190, 0xc00081db80}, {0x1b18d80, 0xc000144860})
        /go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x33e
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000c0160, {0x1fa3190, 0xc00081db80})
        /go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x205
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85
    created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x357
    
  • Scaling MachineSet with acceleratedNetworking is causing panic in machine-controller on OpenShift Container Platform 4

Environment

  • Red Hat OpenShift Container Platform (RHOCP) 4

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content