Detecting memory leaks with JEMALLOC
Table of Contents
Introduction
Detecting memory leaks can be a challenging task if there is no reproducible test case and thus detecting them in a production environment can pose performance related challenges associated with severe overhead from using tradition memory leak detection tools such as Valgrind. The JEMALLOC allocator uses a different approach to memory profiling and detecting memory leaks which can be used in production environments with no severe performance degradation. This article explains how to use JEMALLOC userland memory allocator library to detect memory leaks in the context of Red Hat Directory Server 11 and its ns-slapd server process. The described technique however can be applied to any process that is either linked or loaded with JEMALLOC.
How it works
The JEMALLOC profiling approach is based on statistical sampling of memory allocations instead of blanket tracking all memory allocations approach used by traditional profiling tools. The main idea is that given the process runs for long enough time and the memory leak is repeatable and significant enough it would be caught by profiling samples eventually. Due to the fact that only a small fraction of memory allocations being profiled it allows the application to run uninhibited with very little performance degradation and only on those operations for which profiling takes place thus making this approach suitable for most production usage scenarios.
How to setup
Red Hat Directory Server 11 already ships with JEMALLOC installed and enabled out of the box. However by default all profiling functionality is disabled. To enable memory profiling and leak detection in particular edit the Directory Server's custom systemd configuration file
/etc/systemd/system/dirsrv@.service.d/custom.conf
or if more than one instance is installed and you want to target a specific instance, use per instance configuration instead
/etc/systemd/system/dirsrv@<instance>.service.d/custom.conf
Add the following JEMALLOC's MALLOC_CONF options
[Service]
TimeoutStartSec=0
TimeoutStopSec=600
Environment=MALLOC_CONF=prof:true,prof_leak:true,lg_prof_sample:19,prof_final:true,stats_print:true,prof_prefix:/run/dirsrv/jeprof
Environment=LD_PRELOAD=/usr/lib64/dirsrv/lib/libjemalloc.so.2
These options tell JEMALLOC to
- Enable profiling and leak detection
- Use profiling sample of 2^19 bytes
- Generate profile on process exit
- Print allocator internal statistics
- Store profile data under /run/dirsrv
To apply this configuration change to Directory Server process execute
# systemctl daemon-reload
# systemctl restart dirsrv@<instance>
To disable profiling afterwards simply reverse these configuration changes by removing or commenting out the MALLOC_CONF options Environment line or removing the custom configuration file and executing the above commands again.
How to profile
Now when the Directory Server restarts the sampling and data collection is active and all is needed to get the relevant data is to reproduce the memory leak scenario.
Some very specific memory leaks can evade configured sampling interval and thus might not show up in the profiling report. This can be addressed by instructing JEMALLOC to sample more often. Sampling more often however would have more impact on performance. Thus the sampling interval is always a trade-off between fidelity and performance.
lg_prof_sample:19
is the default sampling interval of 2^19 bytes (512Kb) which is suitable in most cases as it strikes a good balance between fidelity and performance. Decreasing this value would cause JEMALLOC to sample more often and increasing it would cause it to sample less often. Each sample is taken after the number of bytes specified have been allocated. For server type applications with large memory footprints that allocate often in response to client activity the default value would result in a fair number of samples taken to capture most significant memory leaks.
Data collection
Once the memory leak is reproduced the server process needs to shut down in order to signal JEMALLOC to terminate its sampling and data collection and write its profiling report at the specified location, which would contain the data collected
/run/dirsrv/jeprof.2613573.0.f.heap
The internal allocator statistics and leak summary will be logged separately via syslog
/var/log/messages:
___ Begin jemalloc statistics ___
[ ... ]
--- End jemalloc statistics ---
ns-slapd[2613573]: <jemalloc>: Leak approximation summary: ~2966192 bytes, ~22 objects, >= 5 contexts
ns-slapd[2613573]: <jemalloc>: Run jeprof on "/run/dirsrv/jeprof.2613573.0.f.heap" for leak detail
The profiling data along with syslog messages can then either be sent directly to Red Hat Support for further analysis and/or turned into a profiling report and analyzed locally. Discussion on how to analyze memory leaks is out of scope for this article however.
Local reports
In order to produce a report based on profiling data the jeprof tool is required. This tool is normally a part of the Directory Server 11 distribution and can be found at the following location
/usr/lib64/dirsrv/bin/jeprof
It can also be obtained from JEMALLOC upstream GitHub page if needed.
Note that in order to generate meaningful reports jeprof tool needs relevant debuginfo packages for all binaries and libraries being profiled. This is needed to match profiling data collected for call stack unwinding and related purposes. For Directory Server they are
389-ds-base-debuginfo
389-ds-base-libs-debuginfo
These packages can be installed using the following command:
debuginfo-install 389-ds-base
With these packages in place jeprof tool can then be used to generate and inspect reports based on the profiling data collected however should a memory leak occur outside the Directory Server code base eg in one of the libraries it is linked with then the debuginfo packages for those would need to be installed as well.
jeprof --show_bytes /usr/sbin/ns-slapd /run/dirsrv/jeprof.2613573.0.f.heap
Using local file /usr/sbin/ns-slapd.
Using local file /run/dirsrv/jeprof.2613573.0.f.heap.
Welcome to jeprof! For help, type 'help'.
(jeprof) top
Total: 2998001 B
829411 27.7% 27.7% 829411 27.7% slapi_ch_malloc
592551 19.8% 47.4% 592551 19.8% sqlite3_errmsg
527365 17.6% 65.0% 527365 17.6% PORT_ZAlloc_Util
524368 17.5% 82.5% 524368 17.5% slapi_ch_calloc
524304 17.5% 100.0% 524304 17.5% __GI___strdup
0 0.0% 100.0% 1119917 37.4% C_GetInterface
0 0.0% 100.0% 592551 19.8% NSS_Initialize
0 0.0% 100.0% 592551 19.8% NSS_IsInitialized
0 0.0% 100.0% 527365 17.6% PK11_FreeSymKey
0 0.0% 100.0% 527365 17.6% PK11_PubUnwrapSymKeyWithFlagsPerm
(jeprof)
The jeprof tool has a number of reporting options and formats however discussing those is out of scope for this article. Outside of the available documentation, Red Hat Support can advise on profiling and report generating options further, depending on specific use cases and requirements.
Comments