RHEL 6.5 connect() behaviour differs to bind behaviour regarding local TCP ports
Environment
Red Hat Enterprise Linux 6.5
Issue
- The
bind()call assigns a local IP Address and port beforeconnect()is called. Consequently it is possible to bind to a local TCP port that is already in use by another IP Address. However theconnect()call returns[Errno 99] Cannot assign requested addresswhen it goes to assign a local TCP port effectively disregarding 4 tuple information. This only occurs ifbind()is called prior toconnect()from the process that originally used the local TCP port.
Resolution
There are a few possibilities:
-
The most elegant avoidance is to avoid using the
bind()call for local ports prior to callingconnect(). It is fine to usebind()for listen sockets as they are not affected. By having all programs callconnect()on it's own this problem is bypassed as the hash table is initialized by__inet_hash_connect()instead ofinet_csk_get_port()which ensures subsequent programs callingconnect()will do the 4 tuple check and work. -
Increase the local TCP port range. Note that the TCP is defined as 16 bit signed integer in the TCP header. Therefore its maximum is 65535. The defaults are:
# cat ip_local_port_range
32768 61000
It can be increased as follows:
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
-
If the problem still persists after increasing the local port range another possibility, if there are sockets in the
TIME-WAITstate and TCP timestamps are enabled (net.ipv4.tcp_timestamps=1), is to set thenet.ipv4.tcp_tw_reusevariable to 1. This will allow the sockets in aTIME-WAITstate to be reused. -
If the removing the
bind()call is not an option then use thebind()call specifying both the IP address and TCP port number before callingconnect()for all processes. NoteSPORTcan equal 0.
import socket
import sys
SPORT = 0
HOST = '192.168.x.4'
PORT = 3000
SADDR = '192.168.x.12'
server_address = (SADDR, int(SPORT))
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(server_address)
s.connect((HOST, PORT))
Root Cause
When the initial connections take place the corresponding tb hash entries are allocated by bind() via inet_csk_get_port(). Now when you call connect() on it's own from another process which calls the __inet_hash_connect() kernel routine, it traverses the tb entries that were previously allocated via bind() and as the tb->fastreuse variable was set to =>0 by inet_csk_get_port() the __inet_hash_connect() code ignores the 4 tuple check and advances to the next port until it cycles through all of them hitting the same condition, finally returning -EADDRNOTAVAIL.
Please refer to "Resolution" section for a list of avoidances.
- Based on the above we would like the following answered:
An explanation for the behaviour observed?
The connect() call uses a completely different kernel function to select a port an initialise the port hash table.
In net/ipv4/tcp_ipv4.c
230 ▹ tcp_set_state(sk, TCP_SYN_SENT);↩
231 ▹ err = inet_hash_connect(&tcp_death_row, sk);↩
232 ▹ if (err)↩
233 ▹ ▹ goto failure;↩
Here is a back trace from an stap script that demonstrates this.
__inet_hash_connect called by python with pid 4882
0xffffffff81487f20 : __inet_hash_connect+0x0/0x380 [kernel]
0xffffffff814882ef : inet_hash_connect+0x4f/0x60 [kernel]
0xffffffff814a075a : tcp_v4_connect+0x2aa/0x570 [kernel]
0xffffffff814b0952 : inet_stream_connect+0x272/0x2c0 [kernel]
0xffffffff81436227 : sys_connect+0xd7/0xf0 [kernel]
0xffffffff8100b072 : system_call_fastpath+0x16/0x1b [kernel]
The bind() call uses the inet_csk_get_port() function.
net/ipv4/inet_connection_sock.c
118 int inet_csk_get_port(struct sock *sk, unsigned short snum)↩
Here is a back trace from an stap script that demonstrates this.
inet_csk_get_port called by python with pid 5138
0xffffffff8148a640 : inet_csk_get_port+0x0/0x4a0 [kernel]
0xffffffff814b0aaa : inet_bind+0x10a/0x200 [kernel]
0xffffffff81436390 : sys_bind+0xd0/0xf0 [kernel]
0xffffffff8100b072 : system_call_fastpath+0x16/0x1b [kernel]
Why does binding help in this scenario?
The bind() call updates the tcp socket with the local port in inet_sk(sk)->num. When connect() is subsequently called it calls the __inet_hash_connect() kernel routine. As inet_sk(sk)->num is > 0 it bypasses the code that was failing to do the 4 tuple check. It does the 4 tuple check by calling check_established() which returns 0 as the IP Address is different and therefore connect() succeeds.
Here is an stap script that demonstrates this. Note this check is not called by connect().
inet_csk_bind_conflict called by python with pid
0xffffffff81489630 : inet_csk_bind_conflict+0x0/0xf0 [kernel]
0xffffffff8148a801 : inet_csk_get_port+0x1c1/0x4a0 [kernel]
0xffffffff814b0aaa : inet_bind+0x10a/0x200 [kernel]
0xffffffff81436390 : sys_bind+0xd0/0xf0 [kernel]
0xffffffff8100b072 : system_call_fastpath+0x16/0x1b [kernel]
Why isn't the complete 4 tuple information used for determining port reusability in the failing case?
When the initial connections take place the corresponding tb hash entries are allocated by bind() via inet_csk_get_port(). Now when you call connect() on it's own from another process which calls the __inet_hash_connect() kernel routine, it traverses the tb entries that were previously allocated via bind() and as the tb->fastreuse variable was set to =>0 by inet_csk_get_port() the __inet_hash_connect() code ignores the 4 tuple check and advances to the next port until it cycles through all of them hitting the same condition, finally returning -EADDRNOTAVAIL.
Diagnostic Steps
Processes may log the following error when the local TCP ports are exhausted.
[Errno 99] Cannot assign requested address
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments