How to set rsyslog keepAlive to prevent long idle TCP sessions from being disconnected by network devices
Environment
- Red Hat Enterprise Linux 8
- Red Hat Enterprise Linux 9
Issue
- When rsyslog sends logs to remote server, network devices on the route (such as NAT and firewalls) disconnect long idle TCP sessions.
- The following typical log output on the sending side:
omfwd: remote server at 192.168.xx.xx:514 seems to have closed connection. This often happens when the remote peer (or an interim system like a load balancer or firewall) shuts down or aborts a connection. Rsyslog will re-open the connection if configured to do so (we saw a generic IO Error, which usually goes along with that behaviour).
action 'action-7-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension.
- The following typical log output on the log receiver side:
rsyslogd: netstream session xxxxxxxxxxx from 1.2.3.4 will be closed due to error: Connection reset by peer
- Please also refer to the following documentation.
Resolution
-
Enable keepalive packets at the tcp socket layer. Rsyslog will forcibly terminates unresponsive TCP sessions upon keepalive timeout and automatically establishes new connections.
-
The following is a reference example. Please set according to actual needs.
action(type="omfwd"
KeepAlive="on"
KeepAlive.Probes="3"
KeepAlive.Interval="10"
KeepAlive.Time="60"
queue.filename="fwdRule1" # unique name prefix for spool files
queue.maxdiskspace="1g" # 1gb space limit (use as much as possible)
queue.saveonshutdown="on" # save messages to disk on shutdown
queue.type="LinkedList" # run asynchronously
action.resumeRetryCount="-1" # infinite retries if host is down
Target="192.168.xx.xx" Port="514" Protocol="tcp"
)
Root Cause
- The time to detect TCP connection timeout =
KeepAlive.Time+ (KeepAlive.Interval×KeepAlive.Probes). - See the following upstream documentation:
KeepAlive
Enable or disable keep-alive packets at the tcp socket layer. The default is to disable them.
KeepAlive.Probes
The number of unacknowledged probes to send before considering the connection dead and notifying the application layer. The default, 0, means that the operating system defaults are used. This has only effect if keep-alive is enabled. The functionality may not be available on all platforms.
KeepAlive.Time
The interval between the last data packet sent (simple ACKs are not considered data) and the first keepalive probe; after the connection is marked to need keepalive, this counter is not used any further. The default, 0, means that the operating system defaults are used. This has only effect if keep-alive is enabled. The functionality may not be available on all platforms.
KeepAlive.Interval
The interval for keep alive packets.
Diagnostic Steps
- A large number of TCP retransmissions are observed.
192.168.xx.xx → 1.2.3.4 RSH 145 Client -> Server data
192.168.xx.xx → 1.2.4.4 TCP 145 [TCP Retransmission] 50672 → 514 [PSH, ACK] Seq=74 Ack=1 Win=229 Len=77 TSval=1690056461 TSecr=3145985470
192.168.xx.xx → 1.2.3.4 RSH 145 Client -> Server data
192.168.xx.xx → 1.2.3.4 TCP 145 [TCP Retransmission] 55582 → 514 [PSH, ACK] Seq=74 Ack=1 Win=229 Len=77 TSval=1287642961 TSecr=3233031486
192.168.xx.xx → 1.2.3.4 TCP 218 [TCP Retransmission] 50672 → 514 [PSH, ACK] Seq=1 Ack=1 Win=229 Len=150 TSval=1690056669 TSecr=3145985470
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments