Red Hat Ceph Storage 5, Alertmanager stuck in state <deleting>
Environment
- Red Hat Ceph Storage 5
Issue
- Alertmanager stuck in state
<deleting>
Resolution
- cleanup all full disk spaces to ensure proper function of the Ceph daemons
-
identify any blocking process on all Cluster nodes looking for a process waiting on input
$ ps -ef | grep "import sys;exec(eval(sys.stdin.readline()))"
-
cleanup these processes to get the orchestrator unblocked
$ kill -9 <pid-of-process-found>
Note: killing the process will leave an orphaned ssh process. This process can only be removed by restarting the affected ceph-mgr daemon
-
ensure to have proper monitoring of all file systems in place
Root Cause
-
full file system blocking ceph process to finish
$ ceph config set mgr mgr/cephadm/log_to_cluster_level debug $ cephadm shell ceph -W cephadm --watch-debug [... output omitted ...] 2022-12-19T06:48:48.851347-0500 mgr.node1.asyvoa [DBG] alertmanager.node1 container image registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.10 2022-12-19T06:48:48.851396-0500 mgr.node1.asyvoa [DBG] args: --image registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.10 deploy --fsid 12381186-4094-11ed-8151-525400db0519 --name alertmanager.node1 --meta-json {"service_name": "alertmanager", "ports": [9093, 9094], "ip": null, "deployed_by": ["registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:31fbe18b6f81c53d21053a4a0897bc3875e8ee8ec424393e4d5c3c3afd388274", "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:3075e8708792ebd527ca14849b6af4a11256a3f881ab09b837d7af0f8b2102ea"], "rank": null, "rank_generation": null, "extra_container_args": null} --config-json - --tcp-ports 9093 9094 2022-12-19T06:48:48.851449-0500 mgr.node1.asyvoa [DBG] stdin: {"files": {"alertmanager.yml": "# This file is generated by cephadm.\n# See https://prometheus.io/docs/alerting/configuration/ for documentation.\n\nglobal:\n resolve_timeout: 5m\n http_config:\n tls_config:\n insecure_skip_verify: true\n\nroute:\n receiver: 'default'\n routes:\n - group_by: ['alertname']\n group_wait: 10s\n group_interval: 10s\n repeat_interval: 1h\n receiver: 'ceph-dashboard'\n\nreceivers:\n- name: 'default'\n webhook_configs:\n- name: 'ceph-dashboard'\n webhook_configs:\n - url: 'http://host.containers.internal:8081/api/prometheus_receiver'\n"}, "peers": ["host.containers.internal:9094"]} 2022-12-19T06:48:50.821498-0500 mgr.node1.asyvoa [DBG] code: 0 2022-12-19T06:48:50.821574-0500 mgr.node1.asyvoa [DBG] err: Redeploy daemon alertmanager.node1 ... 2022-12-19T06:48:50.835695-0500 mgr.node1.asyvoa [DBG] mon_command: 'dashboard get-alertmanager-api-host' -> 0 in 0.001s
-
the stdin of the ssh shell is not returning accordingly and will not be terminated correctly
$ ps -ef | grep "import sys;exec(eval(sys.stdin.readline()))" ceph 95060 5660 0 06:42 ? 00:00:00 ssh -C -F /tmp/cephadm-conf-qik7d34w -i /tmp/cephadm-identity-kx6idvdt -o ServerAliveInterval=7 -o ServerAliveCountMax=3 cephorch@node1 sudo python3 -c "import sys;exec(eval(sys.stdin.readline()))"
-
killing the process will leave an orphaned ssh process in the task list in addition
$ ps -ef | grep defunct ceph 7965 5660 0 03:53 ? 00:00:00 [ssh] <defunct> ceph 69381 5660 0 05:38 ? 00:00:00 [ssh] <defunct> ceph 81785 5660 0 05:54 ? 00:00:00 [ssh] <defunct> ceph 95060 5660 0 06:42 ? 00:00:00 [ssh] <defunct>
Diagnostic Steps
-
check Cluster health on the host with no space left will result in an error
$ cephadm shell ceph health Inferring fsid 12381186-4094-11ed-8151-525400db0519 Using recent ceph image registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:31fbe18b6f81c53d21053a4a0897bc3875e8ee8ec424393e4d5c3c3afd388274 Error: lsetxattr /var/log/ceph/12381186-4094-11ed-8151-525400db0519: no space left on device
-
check Cluster health from a different host will report
HEALTH_OK
$ cephadm shell ceph health HEALTH_OK
-
try removing the stale service
$ cephadm shell ceph orch rm alertmanager Removed service alertmanager
-
check state of the service which is stuck in
<deleting>
$ cephadm shell ceph orch ls --refresh NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 0/1 <deleting> 3m count:1 [... output omitted ...]
-
check all Cluster Nodes for a process that blocks on input
$ ps -ef | grep ceph ssh -C -F /tmp/cephadm-conf-qik7d34w -i /tmp/cephadm-identity-kx6idvdt -o ServerAliveInterval=7 -o ServerAliveCountMax=3 cephorch@node1 sudo python3 -c "import sys;exec(eval(sys.stdin.readline()))"
-
kill the blocking process identified before
import sys;exec(eval(sys.stdin.readline()))
$ kill -9 <pid-of-process>
-
redeploy the service which was blocked
$ cephadm shell ceph orch apply alertmanager Scheduled alertmanager update...
-
verify that the service has been deployed accordingly
$ cephadm shell ceph orch ps --refresh NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.node1 node1.example.com *:9093,9094 running (110s) 42s ago 118s 16.0M - 57bb5bf33201 20e80925a5fc
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments