rabbitmq beam.smp taking huge memory and OOM trigged
Issue
-
In Cisco CVIM, we are running containerized openstack.
-
In controller nodes, we run rabbitmq as containers.
-
Controller kernel is running RT kernel.
1) Interestingly, we see rabbit container is hitting OOM.
$ egrep 'Out of memory:' journalctl_--no-pager_--catalog_--boot
Feb 25 05:24:34 overcloud-controller-0 kernel: Out of memory: Kill process 67978 (beam.smp) score 279 or sacrifice child
Feb 25 06:11:45 overcloud-controller-0 kernel: Out of memory: Kill process 67978 (beam.smp) score 295 or sacrifice child
Feb 25 06:11:57 overcloud-controller-0 kernel: Out of memory: Kill process 67978 (beam.smp) score 295 or sacrifice child
Feb 25 06:12:02 overcloud-controller-0 kernel: Out of memory: Kill process 68158 (1_scheduler) score 295 or sacrifice child
2) The consumed memory values are unexpectedly high.
egrep 'total-vm' journalctl_--no-pager_--catalog_--boot
Feb 25 05:24:34 overcloud-controller-0 kernel: Killed process 68280 (inet_gethost) total-vm:11588kB, anon-rss:36kB, file-rss:404kB, shmem-rss:0kB
Feb 25 06:11:45 overcloud-controller-0 kernel: Killed process 211651 (inet_gethost) total-vm:11588kB, anon-rss:120kB, file-rss:352kB, shmem-rss:0kB
Feb 25 06:11:57 overcloud-controller-0 kernel: Killed process 67978 (beam.smp) total-vm:147909308kB, anon-rss:100758012kB, file-rss:748kB, shmem-rss:0kB
3) SAR report shows high memory usage, swap is 100% consumed almost 3 hrs before the OOM killer.
kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
<..>
01:00:01 AM 54190112 340504072 86.27 11540 34951468 36155508 8.44 41790672 6719456 368
01:10:01 AM 49373836 345320348 87.49 11540 34749340 41526932 9.70 46599324 6718940 412
01:20:01 AM 35269672 359424512 91.06 11540 34754060 57483240 13.42 60666028 6719348 332
01:30:02 AM 7953964 386740220 97.98 11540 34758704 97590008 22.79 87932344 6719492 124
01:40:01 AM 32578000 362116184 91.75 372 11385164 92104328 21.51 63054644 8324204 332
01:50:04 AM 2336736 392357448 99.41 132 2824760 147438792 34.43 95592028 7305064 4
02:20:48 AM 2335260 392358924 99.41 132 1279816 162650364 37.98 96285684 6930152 4
02:40:16 AM 2502644 392191540 99.37 132 1192380 161715736 37.76 96350260 6642756 8
02:50:01 AM 103000584 291693600 73.90 132 1112164 22938660 5.36 1642460 1312364 264
03:00:01 AM 102893160 291801024 73.93 132 1133300 22938744 5.36 1658440 1403304 304
03:10:01 AM 102856416 291837768 73.94 132 1154932 22939996 5.36 1671028 1426084 272
03:20:01 AM 102750408 291943776 73.97 132 1162728 22944740 5.36 1682256 1520952 80
<..>
kbswpfree kbswpused %swpused kbswpcad %swpcad
<..>
01:20:01 AM 33554428 0 0.00 0 0.00
01:30:02 AM 33554428 0 0.00 0 0.00
01:40:01 AM 33516532 37896 0.11 9172 24.20
01:50:04 AM 17809660 15744768 46.92 546144 3.47
02:20:48 AM 0 33554428 100.00 810372 2.42
02:40:16 AM 504348 33050080 98.50 684172 2.07
02:50:01 AM 25557676 7996752 23.83 140084 1.75
03:00:01 AM 25596188 7958240 23.72 195308 2.45
03:10:01 AM 25601080 7953348 23.70 200756 2.52
03:20:01 AM 25646456 7907972 23.57 248732 3.15
03:30:01 AM 25650764 7903664 23.55 255668 3.23
<..>
4) Many containers crashed and core files are generated including beam.smp
$ cd tech_support-overcloud-controller-0-20200225-233836/
$ cd var/crash/
$ ls -lrt
total 361888
-rw-r--r-- 1 cisco cisco 8729503 Feb 25 02:24 httpd.1582577612.212742.gz
-rw-r--r-- 1 cisco cisco 21158755 Feb 25 02:32 nova-scheduler.1582576462.89519.gz
-rw-r--r-- 1 cisco cisco 21210878 Feb 25 02:33 nova-scheduler.1582576472.89522.gz
-rw-r--r-- 1 cisco cisco 21596221 Feb 25 02:33 nova-scheduler.1582576480.89512.gz
-rw-r--r-- 1 cisco cisco 21234256 Feb 25 02:33 nova-scheduler.1582576477.89540.gz
-rw-r--r-- 1 cisco cisco 21404817 Feb 25 02:33 nova-scheduler.1582576490.89513.gz
-rw-r--r-- 1 cisco cisco 20861144 Feb 25 02:37 nova-scheduler.1582576490.89677.gz
-rw-r--r-- 1 cisco cisco 5722542 Feb 25 02:37 beam.smp.1582578459.215646.gz
-rw-r--r-- 1 cisco cisco 20802266 Feb 25 02:38 nova-scheduler.1582576514.89615.gz
-rw-r--r-- 1 cisco cisco 21221407 Feb 25 02:38 nova-scheduler.1582576512.89539.gz
-rw-r--r-- 1 cisco cisco 20959961 Feb 25 02:38 nova-scheduler.1582576487.89561.gz
-rw-r--r-- 1 cisco cisco 21602023 Feb 25 02:38 nova-scheduler.1582576498.89662.gz
-rw-r--r-- 1 cisco cisco 21304999 Feb 25 02:39 nova-scheduler.1582576461.89520.gz
-rw-r--r-- 1 cisco cisco 20784890 Feb 25 02:39 nova-scheduler.1582576465.89534.gz
-rw-r--r-- 1 cisco cisco 20765355 Feb 25 02:39 nova-scheduler.1582576509.89558.gz
-rw-r--r-- 1 cisco cisco 20793839 Feb 25 02:40 nova-scheduler.1582576479.89527.gz
-rw-r--r-- 1 cisco cisco 20994945 Feb 25 02:40 nova-scheduler.1582576517.89655.gz
-rw-r--r-- 1 cisco cisco 21234184 Feb 25 02:40 neutron-sriov-n.1582578572.77287.gz
-rw-r--r-- 1 cisco cisco 18145911 Feb 25 02:40 mysqld.1582576162.66763.gz
5) We would still believe that 'killing and restarting containerd' is the result of memory issue.
<..>
Feb 25 05:31:41 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:41.219228836+09:00" level=error msg="libcontainerd: failed to receive event from containerd:
rpc error: code = 13 desc = transport is closing"
Feb 25 05:31:46 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:46.795638023+09:00" level=info msg="libcontainerd: new containerd process, pid: 211851"
Feb 25 05:32:01 overcloud-controller-0 teamd[29604]: some periodic function calls missed (1)
Feb 25 05:31:52 overcloud-controller-0 xinetd[444502]: FAIL: mysqlchk service_limit from=172.20.106.33
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:49.836074937+09:00" level=info msg="killing and restarting containerd"
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:51.205138021+09:00" level=info msg="libcontainerd: new containerd process, pid: 211859"
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:52.263615030+09:00" level=info msg="killing and restarting containerd"
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:56.583755428+09:00" level=info msg="libcontainerd: new containerd process, pid: 211865"
<..>
- We need your quick assistance to analyze the issue.
Environment
- Red Hat OpenStack Platform 13.0 (RHOSP)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.