rabbitmq beam.smp taking huge memory and OOM trigged

Solution In Progress - Updated -

Issue

  • In Cisco CVIM, we are running containerized openstack.

  • In controller nodes, we run rabbitmq as containers.

  • Controller kernel is running RT kernel.

1) Interestingly, we see rabbit container is hitting OOM.

$ egrep 'Out of memory:' journalctl_--no-pager_--catalog_--boot
Feb 25 05:24:34 overcloud-controller-0 kernel: Out of memory: Kill process 67978 (beam.smp) score 279 or sacrifice child
Feb 25 06:11:45 overcloud-controller-0 kernel: Out of memory: Kill process 67978 (beam.smp) score 295 or sacrifice child
Feb 25 06:11:57 overcloud-controller-0 kernel: Out of memory: Kill process 67978 (beam.smp) score 295 or sacrifice child
Feb 25 06:12:02 overcloud-controller-0 kernel: Out of memory: Kill process 68158 (1_scheduler) score 295 or sacrifice child

2) The consumed memory values are unexpectedly high.

egrep 'total-vm' journalctl_--no-pager_--catalog_--boot
Feb 25 05:24:34 overcloud-controller-0 kernel: Killed process 68280 (inet_gethost) total-vm:11588kB, anon-rss:36kB, file-rss:404kB, shmem-rss:0kB
Feb 25 06:11:45 overcloud-controller-0 kernel: Killed process 211651 (inet_gethost) total-vm:11588kB, anon-rss:120kB, file-rss:352kB, shmem-rss:0kB
Feb 25 06:11:57 overcloud-controller-0 kernel: Killed process 67978 (beam.smp) total-vm:147909308kB, anon-rss:100758012kB, file-rss:748kB, shmem-rss:0kB

3) SAR report shows high memory usage, swap is 100% consumed almost 3 hrs before the OOM killer.

            kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
<..>
01:00:01 AM  54190112 340504072     86.27     11540  34951468  36155508      8.44  41790672   6719456       368
01:10:01 AM  49373836 345320348     87.49     11540  34749340  41526932      9.70  46599324   6718940       412
01:20:01 AM  35269672 359424512     91.06     11540  34754060  57483240     13.42  60666028   6719348       332
01:30:02 AM   7953964 386740220     97.98     11540  34758704  97590008     22.79  87932344   6719492       124
01:40:01 AM  32578000 362116184     91.75       372  11385164  92104328     21.51  63054644   8324204       332
01:50:04 AM   2336736 392357448     99.41       132   2824760 147438792     34.43  95592028   7305064         4
02:20:48 AM   2335260 392358924     99.41       132   1279816 162650364     37.98  96285684   6930152         4
02:40:16 AM   2502644 392191540     99.37       132   1192380 161715736     37.76  96350260   6642756         8
02:50:01 AM 103000584 291693600     73.90       132   1112164  22938660      5.36   1642460   1312364       264
03:00:01 AM 102893160 291801024     73.93       132   1133300  22938744      5.36   1658440   1403304       304
03:10:01 AM 102856416 291837768     73.94       132   1154932  22939996      5.36   1671028   1426084       272
03:20:01 AM 102750408 291943776     73.97       132   1162728  22944740      5.36   1682256   1520952        80
<..>

            kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
<..>
01:20:01 AM  33554428         0      0.00         0      0.00
01:30:02 AM  33554428         0      0.00         0      0.00
01:40:01 AM  33516532     37896      0.11      9172     24.20
01:50:04 AM  17809660  15744768     46.92    546144      3.47
02:20:48 AM         0  33554428    100.00    810372      2.42
02:40:16 AM    504348  33050080     98.50    684172      2.07
02:50:01 AM  25557676   7996752     23.83    140084      1.75
03:00:01 AM  25596188   7958240     23.72    195308      2.45
03:10:01 AM  25601080   7953348     23.70    200756      2.52
03:20:01 AM  25646456   7907972     23.57    248732      3.15
03:30:01 AM  25650764   7903664     23.55    255668      3.23
<..>

4) Many containers crashed and core files are generated including beam.smp

$ cd tech_support-overcloud-controller-0-20200225-233836/
$ cd var/crash/
$ ls -lrt
total 361888
-rw-r--r-- 1 cisco cisco  8729503 Feb 25 02:24 httpd.1582577612.212742.gz
-rw-r--r-- 1 cisco cisco 21158755 Feb 25 02:32 nova-scheduler.1582576462.89519.gz
-rw-r--r-- 1 cisco cisco 21210878 Feb 25 02:33 nova-scheduler.1582576472.89522.gz
-rw-r--r-- 1 cisco cisco 21596221 Feb 25 02:33 nova-scheduler.1582576480.89512.gz
-rw-r--r-- 1 cisco cisco 21234256 Feb 25 02:33 nova-scheduler.1582576477.89540.gz
-rw-r--r-- 1 cisco cisco 21404817 Feb 25 02:33 nova-scheduler.1582576490.89513.gz
-rw-r--r-- 1 cisco cisco 20861144 Feb 25 02:37 nova-scheduler.1582576490.89677.gz
-rw-r--r-- 1 cisco cisco  5722542 Feb 25 02:37 beam.smp.1582578459.215646.gz
-rw-r--r-- 1 cisco cisco 20802266 Feb 25 02:38 nova-scheduler.1582576514.89615.gz
-rw-r--r-- 1 cisco cisco 21221407 Feb 25 02:38 nova-scheduler.1582576512.89539.gz
-rw-r--r-- 1 cisco cisco 20959961 Feb 25 02:38 nova-scheduler.1582576487.89561.gz
-rw-r--r-- 1 cisco cisco 21602023 Feb 25 02:38 nova-scheduler.1582576498.89662.gz
-rw-r--r-- 1 cisco cisco 21304999 Feb 25 02:39 nova-scheduler.1582576461.89520.gz
-rw-r--r-- 1 cisco cisco 20784890 Feb 25 02:39 nova-scheduler.1582576465.89534.gz
-rw-r--r-- 1 cisco cisco 20765355 Feb 25 02:39 nova-scheduler.1582576509.89558.gz
-rw-r--r-- 1 cisco cisco 20793839 Feb 25 02:40 nova-scheduler.1582576479.89527.gz
-rw-r--r-- 1 cisco cisco 20994945 Feb 25 02:40 nova-scheduler.1582576517.89655.gz
-rw-r--r-- 1 cisco cisco 21234184 Feb 25 02:40 neutron-sriov-n.1582578572.77287.gz
-rw-r--r-- 1 cisco cisco 18145911 Feb 25 02:40 mysqld.1582576162.66763.gz

5) We would still believe that 'killing and restarting containerd' is the result of memory issue.

<..>
Feb 25 05:31:41 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:41.219228836+09:00" level=error msg="libcontainerd: failed to receive event from containerd: 
rpc error: code = 13 desc = transport is closing"
Feb 25 05:31:46 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:46.795638023+09:00" level=info msg="libcontainerd: new containerd process, pid: 211851"
Feb 25 05:32:01 overcloud-controller-0 teamd[29604]: some periodic function calls missed (1)
Feb 25 05:31:52 overcloud-controller-0 xinetd[444502]: FAIL: mysqlchk service_limit from=172.20.106.33
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:49.836074937+09:00" level=info msg="killing and restarting containerd"
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:51.205138021+09:00" level=info msg="libcontainerd: new containerd process, pid: 211859"
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:52.263615030+09:00" level=info msg="killing and restarting containerd"
Feb 25 05:32:01 overcloud-controller-0 dockerd-current[45116]: time="2020-02-25T05:31:56.583755428+09:00" level=info msg="libcontainerd: new containerd process, pid: 211865"
<..>
  • We need your quick assistance to analyze the issue.

Environment

  • Red Hat OpenStack Platform 13.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content