The user-defined Prometheus pods are unable to create the logs in mount hostpath with direstory's permission to 755
Environment
- Red Hat OpenShift Container Platform
- 4.10+
- prometheus operator
- 2.32.1
Issue
After installing prometheus via enabling monitoring and using hostpath with direstory's permission to 755 as a persistent volume, one can see the following message issued by prometheus-user-workload pods:
$ oc get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
prometheus-operator-b55fdf657-ljdnq 2/2 Running 0 74s 10.128.0.129 master01.ocp4.danliu.com <none> <none>
prometheus-user-workload-0 4/5 CrashLoopBackOff 2 (13s ago) 46s 10.131.0.58 worker02.ocp4.danliu.com <none> <none>
prometheus-user-workload-1 4/5 CrashLoopBackOff 2 (17s ago) 46s 10.128.2.161 worker03.ocp4.danliu.com <none> <none>
thanos-ruler-user-workload-0 3/3 Running 0 66s 10.128.2.160 worker03.ocp4.danliu.com <none> <none>
thanos-ruler-user-workload-1 3/3 Running 0 66s 10.129.3.28 worker01.ocp4.danliu.com <none> <none>
$ oc get po prometheus-user-workload-0 -o yaml
message: "ts=2023-06-08T05:38:49.062Z caller=main.go:532 level=info msg=\"Starting
opening query log file\" file=/prometheus/queries.active err=\"open /prometheus/queries.active:
permission denied\"\npanic: Unable to create mmap-ed active query log\n\ngoroutine"
reason: Error
Resolution
-
- ssh into
hostpath nodes
and set UID and GID to 65534 in the hostpath directoryprometheus-db
.
- ssh into
-
- Delete pods that are working abnormally.
$ ls -lZd /mnt/prometheus-data/prometheus-db/
drwxr-xr-x. 2 root root system_u:object_r:container_file_t:s0 6 Jun 8 06:12 /mnt/prometheus-data/prometheus-db/
$ chown -R 65534:65534 /mnt/prometheus-data/prometheus-db/
$ ls -lZd /mnt/prometheus-data/prometheus-db/
drwxr-xr-x. 2 nfsnobody nfsnobody system_u:object_r:container_file_t:s0 6 Jun 8 06:12 /mnt/prometheus-data/prometheus-db/
$ oc get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
prometheus-operator-b55fdf657-ljdnq 2/2 Running 0 2m58s 10.128.0.129 master01.ocp4.danliu.com <none> <none>
prometheus-user-workload-0 5/5 Running 4 (89s ago) 2m30s 10.131.0.58 worker02.ocp4.danliu.com <none> <none>
prometheus-user-workload-1 4/5 CrashLoopBackOff 4 (47s ago) 2m30s 10.128.2.161 worker03.ocp4.danliu.com <none> <none>
thanos-ruler-user-workload-0 3/3 Running 0 2m50s 10.128.2.160 worker03.ocp4.danliu.com <none> <none>
thanos-ruler-user-workload-1 3/3 Running 0 2m50s 10.129.3.28 worker01.ocp4.danliu.com <none> <none>
$ oc delete po prometheus-user-workload-1
$ oc get po
NAME READY STATUS RESTARTS AGE
prometheus-operator-b55fdf657-ljdnq 2/2 Running 0 3m26s
prometheus-user-workload-0 5/5 Running 4 (117s ago) 2m58s
prometheus-user-workload-1 5/5 Running 0 16s
thanos-ruler-user-workload-0 3/3 Running 0 3m18s
thanos-ruler-user-workload-1 3/3 Running 0 3m18s
Root Cause
- The correct UID and GID as
65534
should be set on the host path directoryprometheus-db
created by theprometheus-user-workload pod
.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments