Red Hat Ceph Storage 5, Alertmanager stuck in state <deleting>

Solution Verified - Updated -

Environment

  • Red Hat Ceph Storage 5

Issue

  • Alertmanager stuck in state <deleting>

Resolution

  • cleanup all full disk spaces to ensure proper function of the Ceph daemons
  • identify any blocking process on all Cluster nodes looking for a process waiting on input

    $ ps -ef | grep "import sys;exec(eval(sys.stdin.readline()))"
    
  • cleanup these processes to get the orchestrator unblocked

    $ kill -9 <pid-of-process-found>
    

    Note: killing the process will leave an orphaned ssh process. This process can only be removed by restarting the affected ceph-mgr daemon

  • ensure to have proper monitoring of all file systems in place

Root Cause

  • full file system blocking ceph process to finish

    $ ceph config set mgr mgr/cephadm/log_to_cluster_level debug
    $ cephadm shell ceph -W cephadm --watch-debug
    [... output omitted ...]
    2022-12-19T06:48:48.851347-0500 mgr.node1.asyvoa [DBG] alertmanager.node1 container image registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.10
    2022-12-19T06:48:48.851396-0500 mgr.node1.asyvoa [DBG] args: --image registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.10 deploy --fsid 12381186-4094-11ed-8151-525400db0519 --name alertmanager.node1 --meta-json {"service_name": "alertmanager", "ports": [9093, 9094], "ip": null, "deployed_by": ["registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:31fbe18b6f81c53d21053a4a0897bc3875e8ee8ec424393e4d5c3c3afd388274", "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:3075e8708792ebd527ca14849b6af4a11256a3f881ab09b837d7af0f8b2102ea"], "rank": null, "rank_generation": null, "extra_container_args": null} --config-json - --tcp-ports 9093 9094
    2022-12-19T06:48:48.851449-0500 mgr.node1.asyvoa [DBG] stdin: {"files": {"alertmanager.yml": "# This file is generated by cephadm.\n# See https://prometheus.io/docs/alerting/configuration/ for documentation.\n\nglobal:\n  resolve_timeout: 5m\n  http_config:\n    tls_config:\n      insecure_skip_verify: true\n\nroute:\n  receiver: 'default'\n  routes:\n    - group_by: ['alertname']\n      group_wait: 10s\n      group_interval: 10s\n      repeat_interval: 1h\n      receiver: 'ceph-dashboard'\n\nreceivers:\n- name: 'default'\n  webhook_configs:\n- name: 'ceph-dashboard'\n  webhook_configs:\n  - url: 'http://host.containers.internal:8081/api/prometheus_receiver'\n"}, "peers": ["host.containers.internal:9094"]}
    2022-12-19T06:48:50.821498-0500 mgr.node1.asyvoa [DBG] code: 0
    2022-12-19T06:48:50.821574-0500 mgr.node1.asyvoa [DBG] err: Redeploy daemon alertmanager.node1 ...
    2022-12-19T06:48:50.835695-0500 mgr.node1.asyvoa [DBG] mon_command: 'dashboard get-alertmanager-api-host' -> 0 in 0.001s
    
  • the stdin of the ssh shell is not returning accordingly and will not be terminated correctly

    $ ps -ef | grep "import sys;exec(eval(sys.stdin.readline()))" 
    ceph       95060    5660  0 06:42 ?        00:00:00 ssh -C -F /tmp/cephadm-conf-qik7d34w -i /tmp/cephadm-identity-kx6idvdt -o ServerAliveInterval=7 -o ServerAliveCountMax=3 cephorch@node1 sudo python3 -c "import sys;exec(eval(sys.stdin.readline()))"
    
  • killing the process will leave an orphaned ssh process in the task list in addition

    $ ps -ef | grep defunct
    ceph        7965    5660  0 03:53 ?        00:00:00 [ssh] <defunct>
    ceph       69381    5660  0 05:38 ?        00:00:00 [ssh] <defunct>
    ceph       81785    5660  0 05:54 ?        00:00:00 [ssh] <defunct>
    ceph       95060    5660  0 06:42 ?        00:00:00 [ssh] <defunct>
    

Diagnostic Steps

  • check Cluster health on the host with no space left will result in an error

    $ cephadm shell ceph health
    Inferring fsid 12381186-4094-11ed-8151-525400db0519
    Using recent ceph image registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:31fbe18b6f81c53d21053a4a0897bc3875e8ee8ec424393e4d5c3c3afd388274
    Error: lsetxattr /var/log/ceph/12381186-4094-11ed-8151-525400db0519: no space left on device
    
  • check Cluster health from a different host will report HEALTH_OK

    $ cephadm shell ceph health
    HEALTH_OK
    
  • try removing the stale service

    $ cephadm shell  ceph orch rm alertmanager
    Removed service alertmanager
    
  • check state of the service which is stuck in <deleting>

    $ cephadm shell  ceph orch ls --refresh
    NAME                   PORTS        RUNNING  REFRESHED   AGE  PLACEMENT        
    alertmanager           ?:9093,9094      0/1  <deleting>  3m   count:1          
    [... output omitted ...]
    
  • check all Cluster Nodes for a process that blocks on input

    $ ps -ef | grep ceph 
     ssh -C -F /tmp/cephadm-conf-qik7d34w -i /tmp/cephadm-identity-kx6idvdt -o ServerAliveInterval=7 -o ServerAliveCountMax=3 cephorch@node1 sudo python3 -c "import sys;exec(eval(sys.stdin.readline()))"
    
  • kill the blocking process identified before import sys;exec(eval(sys.stdin.readline()))

    $ kill -9 <pid-of-process>
    
  • redeploy the service which was blocked

    $ cephadm shell ceph orch apply alertmanager 
    Scheduled alertmanager update...
    
  • verify that the service has been deployed accordingly

    $ cephadm shell ceph orch ps --refresh
    NAME                            HOST             PORTS        STATUS          REFRESHED   AGE  MEM USE  MEM LIM  VERSION          IMAGE ID      CONTAINER ID  
    alertmanager.node1          node1.example.com  *:9093,9094  running (110s)    42s ago  118s    16.0M        -                   57bb5bf33201  20e80925a5fc  
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments