rabbitmq beam.smp taking huge memory and OOM trigged

Solution In Progress - Updated -

Issue

  • In Cisco CVIM, we are running containerized openstack.

  • In controller nodes, we run rabbitmq as containers.

  • Controller kernel is running RT kernel.

1) Interestingly, we see rabbit container is hitting OOM.

$ egrep 'Out of memory:' journalctl_--no-pager_--catalog_--boot
Feb 25 05:24:34 overcloud-controller-0 kernel: Out of memory: Kill process 67978 (beam.smp) score 279 or sacrifice child
Feb 25 06:11:45 overcloud-controller-0 kernel: Out of memory: Kill process 67978 (beam.smp) score 295 or sacrifice child
Feb 25 06:11:57 overcloud-controller-0 kernel: Out of memory: Kill process 67978 (beam.smp) score 295 or sacrifice child
Feb 25 06:12:02 overcloud-controller-0 kernel: Out of memory: Kill process 68158 (1_scheduler) score 295 or sacrifice child

2) The consumed memory values are unexpectedly high.

egrep 'total-vm' journalctl_--no-pager_--catalog_--boot
Feb 25 05:24:34 overcloud-controller-0 kernel: Killed process 68280 (inet_gethost) total-vm:11588kB, anon-rss:36kB, file-rss:404kB, shmem-rss:0kB
Feb 25 06:11:45 overcloud-controller-0 kernel: Killed process 211651 (inet_gethost) total-vm:11588kB, anon-rss:120kB, file-rss:352kB, shmem-rss:0kB
Feb 25 06:11:57 overcloud-controller-0 kernel: Killed process 67978 (beam.smp) total-vm:147909308kB, anon-rss:100758012kB, file-rss:748kB, shmem-rss:0kB

3) SAR report shows high memory usage, swap is 100% consumed almost 3 hrs before the OOM killer.

            kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
<..>
01:00:01 AM  54190112 340504072     86.27     11540  34951468  36155508      8.44  41790672   6719456       368
01:10:01 AM  49373836 345320348     87.49     11540  34749340  41526932      9.70  46599324   6718940       412
01:20:01 AM  35269672 359424512     91.06     11540  34754060  57483240     13.42  60666028   6719348       332
01:30:02 AM   7953964 386740220     97.98     11540  34758704  97590008     22.79  87932344   6719492       124
01:40:01 AM  32578000 362116184     91.75       372  11385164  92104328     21.51  63054644   8324204       332
01:50:04 AM   2336736 392357448     99.41       132   2824760 147438792     34.43  95592028   7305064         4
02:20:48 AM   2335260 392358924     99.41       132   1279816 162650364     37.98  96285684   6930152         4
02:40:16 AM   2502644 392191540     99.37       132   1192380 161715736     37.76  96350260   6642756         8
02:50:01 AM 103000584 291693600     73.90       132   1112164  22938660      5.36   1642460   1312364       264
03:00:01 AM 102893160 291801024     73.93       132   1133300  22938744      5.36   1658440   1403304       304
03:10:01 AM 102856416 291837768     73.94       132   1154932  22939996      5.36   1671028   1426084       272
03:20:01 AM 102750408 291943776     73.97       132   1162728  22944740      5.36   1682256   1520952        80
<..>

            kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
<..>
01:20:01 AM  33554428         0      0.00         0      0.00
01:30:02 AM  33554428         0      0.00         0      0.00
01:40:01 AM  33516532     37896      0.11      9172     24.20
01:50:04 AM  17809660  15744768     46.92    546144      3.47
02:20:48 AM         0  33554428    100.00    810372      2.42
02:40:16 AM    504348  33050080     98.50    684172      2.07
02:50:01 AM  25557676   7996752     23.83    140084      1.75
03:00:01 AM  25596188   7958240     23.72    195308      2.45
03:10:01 AM  25601080   7953348     23.70    200756      2.52
03:20:01 AM  25646456   7907972     23.57    248732      3.15
03:30:01 AM  25650764   7903664     23.55    255668      3.23
<..>

4) Many containers crashed and core files are generated including beam.smp

$ cd tech_support-overcloud-controller-0-20200225-233836/
$ cd var/crash/
$ ls -lrt
total 361888
-rw-r--r-- 1 cisco cisco  8729503 Feb 25 02:24 httpd.1582577612.212742.gz
-rw-r--r-- 1 cisco cisco 21158755 Feb 25 02:32 nova-scheduler.1582576462.89519.gz
-rw-r--r-- 1 cisco cisco 21210878 Feb 25 02:33 nova-scheduler.1582576472.89522.gz
-rw-r--r-- 1 cisco cisco 21596221 Feb 25 02:33 nova-scheduler.1582576480.89512.gz
-rw-r--r-- 1 cisco cisco 21234256 Feb 25 02:33 nova-scheduler.1582576477.89540.gz
-rw-r--r-- 1 cisco cisco 21404817 Feb 25 02:33 nova-scheduler.1582576490.89513.gz
-rw-r--r-- 1 cisco cisco 20861144 Feb 25 02:37 nova-scheduler.1582576490.89677.gz
-rw-r--r-- 1 cisco cisco  5722542 Feb 25 02:37 beam.smp.1582578459.215646.gz
-rw-r--r-- 1 cisco cisco 20802266 Feb 25 02:38 nova-scheduler.1582576514.89615.gz
-rw-r--r-- 1 cisco cisco 21221407 Feb 25 02:38 nova-scheduler.1582576512.89539.gz
-rw-r--r-- 1 cisco cisco 20959961 Feb 25 02:38 nova-scheduler.1582576487.89561.gz
-rw-r--r-- 1 cisco cisco 21602023 Feb 25 02:38 nova-scheduler.1582576498.89662.gz
-rw-r--r-- 1 cisco cisco 21304999 Feb 25 02:39 nova-scheduler.1582576461.89520.gz
-rw-r--r-- 1 cisco cisco 20784890 Feb 25 02:39 nova-scheduler.1582576465.89534.gz
-rw-r--r-- 1 cisco cisco 20765355 Feb 25 02:39 nova-scheduler.1582576509.89558.gz
-rw-r--r-- 1 cisco cisco 20793839 Feb 25 02:40 nova-scheduler.1582576479.89527.gz
-rw-r--r-- 1 cisco cisco 20994945 Feb 25 02:40 nova-scheduler.1582576517.89655.gz
-rw-r--r-- 1 cisco cisco 21234184 Feb 25 02:40 neutron-sriov-n.1582578572.77287.gz
-rw-r--r-- 1 cisco cisco 18145911 Feb 25 02:40 mysqld.1582576162.66763.gz

5) We would still believe that 'killing and restarting containerd' is the result of memory issue.

<..>
Feb 25 05:31:41 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:41.219228836+09:00" level=error msg="libcontainerd: failed to receive event from containerd: 
rpc error: code = 13 desc = transport is closing"
Feb 25 05:31:46 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:46.795638023+09:00" level=info msg="libcontainerd: new containerd process, pid: 211851"
Feb 25 05:32:01 overcloud-controller-0 teamd[29604]: some periodic function calls missed (1)
Feb 25 05:31:52 overcloud-controller-0 xinetd[444502]: FAIL: mysqlchk service_limit from=172.20.106.33
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:49.836074937+09:00" level=info msg="killing and restarting containerd"
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:51.205138021+09:00" level=info msg="libcontainerd: new containerd process, pid: 211859"
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:52.263615030+09:00" level=info msg="killing and restarting containerd"
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:56.583755428+09:00" level=info msg="libcontainerd: new containerd process, pid: 211865"
<..>
  • We need your quick assistance to analyze the issue.

Environment

  • Red Hat OpenStack Platform 13.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In