Chapter 3. Troubleshooting networking issues

This chapter lists basic troubleshooting procedures connected with networking and Network Time Protocol (NTP).

3.1. Prerequisites

  • A running Red Hat Ceph Storage cluster.

3.2. Basic networking troubleshooting

Red Hat Ceph Storage depends heavily on a reliable network connection. Red Hat Ceph Storage nodes use the network for communicating with each other. Networking issues can cause many problems with Ceph OSDs, such as them flapping, or being incorrectly reported as down. Networking issues can also cause the Ceph Monitor’s clock skew errors. In addition, packet loss, high latency, or limited bandwidth can impact the cluster performance and stability.

Prerequisites

  • Root-level access to the node.

Procedure

  1. Installing the net-tools and telnet packages can help when troubleshooting network issues that can occur in a Ceph storage cluster:

    Red Hat Enterprise Linux 7

    [root@mon ~]# yum install net-tools
    [root@mon ~]# yum install telnet

    Red Hat Enterprise Linux 8

    [root@mon ~]# dnf install net-tools
    [root@mon ~]# dnf install telnet

  2. Verify that the cluster_network and public_network parameters in the Ceph configuration file include the correct values:

    Example

    [root@mon ~]# cat /etc/ceph/ceph.conf | grep net
    cluster_network = 192.168.1.0/24
    public_network = 192.168.0.0/24

  3. Verify that the network interfaces are up:

    Example

    [root@mon ~]# ip link list
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    2: enp22s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
        link/ether 40:f2:e9:b8:a0:48 brd ff:ff:ff:ff:ff:ff

  4. Verify that the Ceph nodes are able to reach each other using their short host names. Verify this on each node in the storage cluster:

    Syntax

    ping SHORT_HOST_NAME

    Example

    [root@mon ~]# ping osd01

  5. If you use a firewall, ensure that Ceph nodes are able to reach other on their appropriate ports. The firewall-cmd and telnet tools can validate the port status, and if the port is open respectively:

    Syntax

    firewall-cmd --info-zone=ZONE
    telnet IP_ADDRESS PORT

    Example

    [root@mon ~]# firewall-cmd --info-zone=public
    public (active)
      target: default
      icmp-block-inversion: no
      interfaces: enp1s0
      sources: 192.168.0.0/24
      services: ceph ceph-mon cockpit dhcpv6-client ssh
      ports: 9100/tcp 8443/tcp 9283/tcp 3000/tcp 9092/tcp 9093/tcp 9094/tcp 9094/udp
      protocols:
      masquerade: no
      forward-ports:
      source-ports:
      icmp-blocks:
      rich rules:
    
    [root@mon ~]# telnet 192.168.0.22 9100

  6. Verify that there are no errors on the interface counters. Verify that the network connectivity between nodes has expected latency, and that there is no packet loss.

    1. Using the ethtool command:

      Syntax

      ethtool -S INTERFACE

      Example

      [root@mon ~]# ethtool -S enp22s0f0 | grep errors
      NIC statistics:
           rx_fcs_errors: 0
           rx_align_errors: 0
           rx_frame_too_long_errors: 0
           rx_in_length_errors: 0
           rx_out_length_errors: 0
           tx_mac_errors: 0
           tx_carrier_sense_errors: 0
           tx_errors: 0
           rx_errors: 0

    2. Using the ifconfig command:

      Example

      [root@mon ~]# ifconfig
      enp22s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
      inet 10.8.222.13  netmask 255.255.254.0  broadcast 10.8.223.255
      inet6 2620:52:0:8de:42f2:e9ff:feb8:a048  prefixlen 64  scopeid 0x0<global>
      inet6 fe80::42f2:e9ff:feb8:a048  prefixlen 64  scopeid 0x20<link>
      ether 40:f2:e9:b8:a0:48  txqueuelen 1000  (Ethernet)
      RX packets 4219130  bytes 2704255777 (2.5 GiB)
      RX errors 0  dropped 0  overruns 0  frame 0 1
      TX packets 1418329  bytes 738664259 (704.4 MiB)
      TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0 2
      device interrupt 16

    3. Using the netstat command:

      Example

      [root@mon ~]# netstat -ai
      Kernel Interface table
      Iface          MTU   RX-OK RX-ERR RX-DRP RX-OVR  TX-OK TX-ERR TX-DRP TX-OVR Flg
      docker0       1500       0      0      0 0           0      0      0      0 BMU
      eno2          1500       0      0      0 0           0      0      0      0 BMU
      eno3          1500       0      0      0 0           0      0      0      0 BMU
      eno4          1500       0      0      0 0           0      0      0      0 BMU
      enp0s20u13u5  1500  253277      0      0 0           0      0      0      0 BMRU
      enp22s0f0     9000  234160      0      0 0      432326      0      0      0 BMRU 1
      lo           65536   10366      0      0 0       10366      0      0      0 LRU

  7. For performance issues, in addition to the latency checks and to verify the network bandwidth between all nodes of the storage cluster, use the iperf3 tool. The iperf3 tool does a simple point-to-point network bandwidth test between a server and a client.

    1. Install the iperf3 package on the Red Hat Ceph Storage nodes you want to check the bandwidth:

      Red Hat Enterprise Linux 7

      [root@mon ~]# yum install iperf3

      Red Hat Enterprise Linux 8

      [root@mon ~]# dnf install iperf3

    2. On a Red Hat Ceph Storage node, start the iperf3 server:

      Example

      [root@mon ~]# iperf3 -s
      -----------------------------------------------------------
      Server listening on 5201
      -----------------------------------------------------------

      Note

      The default port is 5201, but can be set using the -P command argument.

    3. On a different Red Hat Ceph Storage node, start the iperf3 client:

      Example

      [root@osd ~]# iperf3 -c mon
      Connecting to host mon, port 5201
      [  4] local xx.x.xxx.xx port 52270 connected to xx.x.xxx.xx port 5201
      [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
      [  4]   0.00-1.00   sec   114 MBytes   954 Mbits/sec    0    409 KBytes
      [  4]   1.00-2.00   sec   113 MBytes   945 Mbits/sec    0    409 KBytes
      [  4]   2.00-3.00   sec   112 MBytes   943 Mbits/sec    0    454 KBytes
      [  4]   3.00-4.00   sec   112 MBytes   941 Mbits/sec    0    471 KBytes
      [  4]   4.00-5.00   sec   112 MBytes   940 Mbits/sec    0    471 KBytes
      [  4]   5.00-6.00   sec   113 MBytes   945 Mbits/sec    0    471 KBytes
      [  4]   6.00-7.00   sec   112 MBytes   937 Mbits/sec    0    488 KBytes
      [  4]   7.00-8.00   sec   113 MBytes   947 Mbits/sec    0    520 KBytes
      [  4]   8.00-9.00   sec   112 MBytes   939 Mbits/sec    0    520 KBytes
      [  4]   9.00-10.00  sec   112 MBytes   939 Mbits/sec    0    520 KBytes
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID] Interval           Transfer     Bandwidth       Retr
      [  4]   0.00-10.00  sec  1.10 GBytes   943 Mbits/sec    0             sender
      [  4]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec                  receiver
      
      iperf Done.

      This output shows a network bandwidth of 1.1 Gbits/second between the Red Hat Ceph Storage nodes, along with no retransmissions (Retr) during the test.

      Red Hat recommends you validate the network bandwidth between all the nodes in the storage cluster.

  8. Ensure that all nodes have the same network interconnect speed. Slower attached nodes might slow down the faster connected ones. Also, ensure that the inter switch links can handle the aggregated bandwidth of the attached nodes:

    Syntax

    ethtool INTERFACE

    Example

    [root@mon ~]# ethtool enp22s0f0
    Settings for enp22s0f0:
    Supported ports: [ TP ]
    Supported link modes:   10baseT/Half 10baseT/Full
                            100baseT/Half 100baseT/Full
                            1000baseT/Half 1000baseT/Full
    Supported pause frame use: No
    Supports auto-negotiation: Yes
    Supported FEC modes: Not reported
    Advertised link modes:  10baseT/Half 10baseT/Full
                            100baseT/Half 100baseT/Full
                            1000baseT/Half 1000baseT/Full
    Advertised pause frame use: Symmetric
    Advertised auto-negotiation: Yes
    Advertised FEC modes: Not reported
    Link partner advertised link modes:  10baseT/Half 10baseT/Full
                                         100baseT/Half 100baseT/Full
                                         1000baseT/Full
    Link partner advertised pause frame use: Symmetric
    Link partner advertised auto-negotiation: Yes
    Link partner advertised FEC modes: Not reported
    Speed: 1000Mb/s 1
    Duplex: Full 2
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    MDI-X: off
    Supports Wake-on: g
    Wake-on: d
    Current message level: 0x000000ff (255)
           drv probe link timer ifdown ifup rx_err tx_err
    Link detected: yes 3

Additional Resources

3.3. Basic chrony NTP troubleshooting

This section includes basic chrony troubleshooting steps.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor node.

Procedure

  1. Verify that the chronyd daemon is running on the Ceph Monitor hosts:

    Example

    [root@mon ~]# systemctl status chronyd

  2. If chronyd is not running, enable and start it:

    Example

    [root@mon ~]# systemctl enable chronyd
    [root@mon ~]# systemctl start chronyd

  3. Ensure that chronyd is synchronizing the clocks correctly:

    Example

    [root@mon ~]# chronyc sources
    [root@mon ~]# chronyc sourcestats
    [root@mon ~]# chronyc tracking

Additional Resources

3.4. Basic NTP troubleshooting

This section includes basic NTP troubleshooting steps.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor node.

Procedure

  1. Verify that the ntpd daemon is running on the Ceph Monitor hosts:

    Example

    [root@mon ~]# systemctl status ntpd

  2. If ntpd is not running, enable and start it:

    Example

    [root@mon ~]# systemctl enable ntpd
    [root@mon ~]# systemctl start ntpd

  3. Ensure that ntpd is synchronizing the clocks correctly:

    Example

    [root@mon ~]# ntpq -p

Additional Resources

  • See the How to troubleshoot NTP issues solution on the Red Hat Customer Portal for advanced NTP troubleshooting steps.
  • See the Clock skew section in the Red Hat Ceph Storage Troubleshooting Guide for further details.