Red Hat Training

A Red Hat training course is available for Red Hat Linux

28.5. Troubleshooting NVDIMM

28.5.1. Monitoring NVDIMM Health Using S.M.A.R.T.

Some NVDIMMs support Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) interfaces for retrieving health information.
Monitor NVDIMM health regularly to prevent data loss. If S.M.A.R.T. reports problems with the health status of an NVDIMM, replace it as described in Section 28.5.2, “Detecting and Replacing a Broken NVDIMM”.

Prerequisites

  • On some systems, the acpi_ipmi driver must be loaded to retrieve health information using the following command:
    # modprobe acpi_ipmi

Procedure

  • To access the health information, use the following command:
    # ndctl list --dimms --health
    ...
        {
          "dev":"nmem0",
          "id":"802c-01-1513-b3009166",
          "handle":1,
          "phys_id":22,
          "health":
          {
            "health_state":"ok",
            "temperature_celsius":25.000000,
            "spares_percentage":99,
            "alarm_temperature":false,
            "alarm_spares":false,
            "temperature_threshold":50.000000,
            "spares_threshold":20,
            "life_used_percentage":1,
            "shutdown_state":"clean"
          }
         }
    ...
    

28.5.2. Detecting and Replacing a Broken NVDIMM

If you find error messages related to NVDIMM reported in your system log or by S.M.A.R.T., it might mean an NVDIMM device is failing. In that case, it is necessary to:
  1. Detect which NVDIMM device is failing,
  2. Back up data stored on it, and
  3. Physically replace the device.

Procedure 28.3. Detecting and Replacing a Broken NVDIMM

  1. To detect the broken DIMM, use the following command:
    # ndctl list --dimms --regions --health --media-errors --human
    
    The badblocks field shows which NVDIMM is broken. Note its name in the dev field. In the following example, the NVDIMM named nmem0 is broken:

    Example 28.1. Health Status of NVDIMM Devices

    # ndctl list --dimms --regions --health --media-errors --human
    
    ...
      "regions":[
        {
          "dev":"region0",
          "size":"250.00 GiB (268.44 GB)",
          "available_size":0,
          "type":"pmem",
          "numa_node":0,
          "iset_id":"0xXXXXXXXXXXXXXXXX",
          "mappings":[
            {
              "dimm":"nmem1",
              "offset":"0x10000000",
              "length":"0x1f40000000",
              "position":1
            },
            {
              "dimm":"nmem0",
              "offset":"0x10000000",
              "length":"0x1f40000000",
              "position":0
            }
          ],
          "badblock_count":1,
          "badblocks":[
            {
              "offset":65536,
              "length":1,
              "dimms":[
                "nmem0"
              ]
            }
          ],
          "persistence_domain":"memory_controller"
        }
      ]
    }
    
  2. Use the following command to find the phys_id attribute of the broken NVDIMM:
    # ndctl list --dimms --human
    
    From the previous example, you know that nmem0 is the broken NVDIMM. Therefore, find the phys_id attribute of nmem0. In the following example, the phys_id is 0x10:

    Example 28.2. The phys_id Attributes of NVDIMMs

    # ndctl list --dimms --human
    
    [
      {
        "dev":"nmem1",
        "id":"XXXX-XX-XXXX-XXXXXXXX",
        "handle":"0x120",
        "phys_id":"0x1c"
      },
      {
        "dev":"nmem0",
        "id":"XXXX-XX-XXXX-XXXXXXXX",
        "handle":"0x20",
        "phys_id":"0x10",
        "flag_failed_flush":true,
        "flag_smart_event":true
      }
    ]
    
  3. Use the following command to find the memory slot of the broken NVDIMM:
    # dmidecode
    
    In the output, find the entry where the Handle identifier matches the phys_id attribute of the broken NVDIMM. The Locator field lists the memory slot used by the broken NVDIMM. In the following example, the nmem0 device matches the 0x0010 identifier and uses the DIMM-XXX-YYYY memory slot:

    Example 28.3. NVDIMM Memory Slot Listing

    # dmidecode
    
    ...
    Handle 0x0010, DMI type 17, 40 bytes
    Memory Device
            Array Handle: 0x0004
            Error Information Handle: Not Provided
            Total Width: 72 bits
            Data Width: 64 bits
            Size: 125 GB
            Form Factor: DIMM
            Set: 1
            Locator: DIMM-XXX-YYYY
            Bank Locator: Bank0
            Type: Other
            Type Detail: Non-Volatile Registered (Buffered)
    ...
    
  4. Back up all data in the namespaces on the NVDIMM. If you do not back up the data before replacing the NVDIMM, the data will be lost when you remove the NVDIMM from your system.

    Warning

    In some cases, such as when the NVDIMM is completely broken, the backup might fail.
    To prevent this, regularly monitor your NVDIMM devices using S.M.A.R.T. as described in Section 28.5.1, “Monitoring NVDIMM Health Using S.M.A.R.T.” and replace failing NVDIMMs before they break.
    Use the following command to list the namespaces on the NVDIMM:
    # ndctl list --namespaces --dimm=DIMM-ID-number
    In the following example, the nmem0 device contains the namespace0.0 and namespace0.2 namespaces, which you need to back up:

    Example 28.4. NVDIMM Namespaces Listing

    # ndctl list --namespaces --dimm=0
    
    [
      {
        "dev":"namespace0.2",
        "mode":"sector",
        "size":67042312192,
        "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "sector_size":4096,
        "blockdev":"pmem0.2s",
        "numa_node":0
      },
      {
        "dev":"namespace0.0",
        "mode":"sector",
        "size":67042312192,
        "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "sector_size":4096,
        "blockdev":"pmem0s",
        "numa_node":0
      }
    ]
    
  5. Replace the broken NVDIMM physically.