RHEL 6.5 connect() behaviour differs to bind behaviour regarding local TCP ports

Solution In Progress - Updated -

Environment

Red Hat Enterprise Linux 6.5

Issue

  • The bind() call assigns a local IP Address and port before connect() is called. Consequently it is possible to bind to a local TCP port that is already in use by another IP Address. However the connect() call returns [Errno 99] Cannot assign requested address when it goes to assign a local TCP port effectively disregarding 4 tuple information. This only occurs if bind() is called prior to connect() from the process that originally used the local TCP port.

Resolution

There are a few possibilities:

  • The most elegant avoidance is to avoid using the bind() call for local ports prior to calling connect(). It is fine to use bind() for listen sockets as they are not affected. By having all programs call connect() on it's own this problem is bypassed as the hash table is initialized by __inet_hash_connect() instead of inet_csk_get_port() which ensures subsequent programs calling connect() will do the 4 tuple check and work.

  • Increase the local TCP port range. Note that the TCP is defined as 16 bit signed integer in the TCP header. Therefore its maximum is 65535. The defaults are:

# cat ip_local_port_range
32768   61000

It can be increased as follows:

sysctl -w net.ipv4.ip_local_port_range="1024    65535"
  • If the problem still persists after increasing the local port range another possibility, if there are sockets in the TIME-WAIT state and TCP timestamps are enabled (net.ipv4.tcp_timestamps=1), is to set the net.ipv4.tcp_tw_reuse variable to 1. This will allow the sockets in a TIME-WAIT state to be reused.

  • If the removing the bind() call is not an option then use the bind() call specifying both the IP address and TCP port number before calling connect() for all processes. Note SPORT can equal 0.

import socket
import sys

SPORT = 0
HOST = '192.168.x.4'
PORT = 3000
SADDR = '192.168.x.12'

server_address = (SADDR, int(SPORT))
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(server_address)
s.connect((HOST, PORT))

Root Cause

When the initial connections take place the corresponding tb hash entries are allocated by bind() via inet_csk_get_port(). Now when you call connect() on it's own from another process which calls the __inet_hash_connect() kernel routine, it traverses the tb entries that were previously allocated via bind() and as the tb->fastreuse variable was set to =>0 by inet_csk_get_port() the __inet_hash_connect() code ignores the 4 tuple check and advances to the next port until it cycles through all of them hitting the same condition, finally returning -EADDRNOTAVAIL.

Please refer to "Resolution" section for a list of avoidances.

  • Based on the above we would like the following answered:

An explanation for the behaviour observed?

The connect() call uses a completely different kernel function to select a port an initialise the port hash table.

In net/ipv4/tcp_ipv4.c

 230 ▹       tcp_set_state(sk, TCP_SYN_SENT);↩
 231 ▹       err = inet_hash_connect(&tcp_death_row, sk);↩
 232 ▹       if (err)↩
 233 ▹       ▹       goto failure;↩

Here is a back trace from an stap script that demonstrates this.

__inet_hash_connect called by python with pid 4882
 0xffffffff81487f20 : __inet_hash_connect+0x0/0x380 [kernel]
 0xffffffff814882ef : inet_hash_connect+0x4f/0x60 [kernel]
 0xffffffff814a075a : tcp_v4_connect+0x2aa/0x570 [kernel]
 0xffffffff814b0952 : inet_stream_connect+0x272/0x2c0 [kernel]
 0xffffffff81436227 : sys_connect+0xd7/0xf0 [kernel]
 0xffffffff8100b072 : system_call_fastpath+0x16/0x1b [kernel]

The bind() call uses the inet_csk_get_port() function.

net/ipv4/inet_connection_sock.c
118 int inet_csk_get_port(struct sock *sk, unsigned short snum)↩

Here is a back trace from an stap script that demonstrates this.

inet_csk_get_port called by python with pid 5138
 0xffffffff8148a640 : inet_csk_get_port+0x0/0x4a0 [kernel]
 0xffffffff814b0aaa : inet_bind+0x10a/0x200 [kernel]
 0xffffffff81436390 : sys_bind+0xd0/0xf0 [kernel]
 0xffffffff8100b072 : system_call_fastpath+0x16/0x1b [kernel]

Why does binding help in this scenario?

The bind() call updates the tcp socket with the local port in inet_sk(sk)->num. When connect() is subsequently called it calls the __inet_hash_connect() kernel routine. As inet_sk(sk)->num is > 0 it bypasses the code that was failing to do the 4 tuple check. It does the 4 tuple check by calling check_established() which returns 0 as the IP Address is different and therefore connect() succeeds.

Here is an stap script that demonstrates this. Note this check is not called by connect().

inet_csk_bind_conflict called by python with pid 
 0xffffffff81489630 : inet_csk_bind_conflict+0x0/0xf0 [kernel]
 0xffffffff8148a801 : inet_csk_get_port+0x1c1/0x4a0 [kernel]
 0xffffffff814b0aaa : inet_bind+0x10a/0x200 [kernel]
 0xffffffff81436390 : sys_bind+0xd0/0xf0 [kernel]
 0xffffffff8100b072 : system_call_fastpath+0x16/0x1b [kernel]

Why isn't the complete 4 tuple information used for determining port reusability in the failing case?

When the initial connections take place the corresponding tb hash entries are allocated by bind() via inet_csk_get_port(). Now when you call connect() on it's own from another process which calls the __inet_hash_connect() kernel routine, it traverses the tb entries that were previously allocated via bind() and as the tb->fastreuse variable was set to =>0 by inet_csk_get_port() the __inet_hash_connect() code ignores the 4 tuple check and advances to the next port until it cycles through all of them hitting the same condition, finally returning -EADDRNOTAVAIL.

Diagnostic Steps

Processes may log the following error when the local TCP ports are exhausted.

[Errno 99] Cannot assign requested address

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments