28.5. Troubleshooting NVDIMM
28.5.1. Monitoring NVDIMM Health Using S.M.A.R.T.
Prerequisites
- On some systems, the acpi_ipmi driver must be loaded to retrieve health information using the following command:
#modprobe acpi_ipmi
Procedure
- To access the health information, use the following command:
#ndctl list --dimms --health... { "dev":"nmem0", "id":"802c-01-1513-b3009166", "handle":1, "phys_id":22, "health": { "health_state":"ok", "temperature_celsius":25.000000, "spares_percentage":99, "alarm_temperature":false, "alarm_spares":false, "temperature_threshold":50.000000, "spares_threshold":20, "life_used_percentage":1, "shutdown_state":"clean" } } ...
28.5.2. Detecting and Replacing a Broken NVDIMM
- Detect which NVDIMM device is failing,
- Back up data stored on it, and
- Physically replace the device.
Procedure 28.3. Detecting and Replacing a Broken NVDIMM
- To detect the broken DIMM, use the following command:
# ndctl list --dimms --regions --health --media-errors --human
Thebadblocksfield shows which NVDIMM is broken. Note its name in thedevfield. In the following example, the NVDIMM namednmem0is broken:Example 28.1. Health Status of NVDIMM Devices
# ndctl list --dimms --regions --health --media-errors --human ... "regions":[ { "dev":"region0", "size":"250.00 GiB (268.44 GB)", "available_size":0, "type":"pmem", "numa_node":0, "iset_id":"0xXXXXXXXXXXXXXXXX", "mappings":[ { "dimm":"nmem1", "offset":"0x10000000", "length":"0x1f40000000", "position":1 }, { "dimm":"nmem0", "offset":"0x10000000", "length":"0x1f40000000", "position":0 } ], "badblock_count":1, "badblocks":[ { "offset":65536, "length":1, "dimms":[ "nmem0" ] } ], "persistence_domain":"memory_controller" } ] } - Use the following command to find the
phys_idattribute of the broken NVDIMM:# ndctl list --dimms --human
From the previous example, you know thatnmem0is the broken NVDIMM. Therefore, find thephys_idattribute ofnmem0. In the following example, thephys_idis0x10:Example 28.2. The phys_id Attributes of NVDIMMs
# ndctl list --dimms --human [ { "dev":"nmem1", "id":"XXXX-XX-XXXX-XXXXXXXX", "handle":"0x120", "phys_id":"0x1c" }, { "dev":"nmem0", "id":"XXXX-XX-XXXX-XXXXXXXX", "handle":"0x20", "phys_id":"0x10", "flag_failed_flush":true, "flag_smart_event":true } ] - Use the following command to find the memory slot of the broken NVDIMM:
# dmidecode
In the output, find the entry where theHandleidentifier matches thephys_idattribute of the broken NVDIMM. TheLocatorfield lists the memory slot used by the broken NVDIMM. In the following example, thenmem0device matches the0x0010identifier and uses theDIMM-XXX-YYYYmemory slot:Example 28.3. NVDIMM Memory Slot Listing
# dmidecode ... Handle 0x0010, DMI type 17, 40 bytes Memory Device Array Handle: 0x0004 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 125 GB Form Factor: DIMM Set: 1 Locator: DIMM-XXX-YYYY Bank Locator: Bank0 Type: Other Type Detail: Non-Volatile Registered (Buffered) ... - Back up all data in the namespaces on the NVDIMM. If you do not back up the data before replacing the NVDIMM, the data will be lost when you remove the NVDIMM from your system.
Warning
In some cases, such as when the NVDIMM is completely broken, the backup might fail.To prevent this, regularly monitor your NVDIMM devices using S.M.A.R.T. as described in Section 28.5.1, “Monitoring NVDIMM Health Using S.M.A.R.T.” and replace failing NVDIMMs before they break.Use the following command to list the namespaces on the NVDIMM:# ndctl list --namespaces --dimm=DIMM-ID-number
In the following example, thenmem0device contains thenamespace0.0andnamespace0.2namespaces, which you need to back up:Example 28.4. NVDIMM Namespaces Listing
# ndctl list --namespaces --dimm=0 [ { "dev":"namespace0.2", "mode":"sector", "size":67042312192, "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "sector_size":4096, "blockdev":"pmem0.2s", "numa_node":0 }, { "dev":"namespace0.0", "mode":"sector", "size":67042312192, "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "sector_size":4096, "blockdev":"pmem0s", "numa_node":0 } ] - Replace the broken NVDIMM physically.

Where did the comment section go?
Red Hat's documentation publication system recently went through an upgrade to enable speedier, more mobile-friendly content. We decided to re-evaluate our commenting platform to ensure that it meets your expectations and serves as an optimal feedback mechanism. During this redesign, we invite your input on providing feedback on Red Hat documentation via the discussion platform.