Network file transfers drop to zero after 1 hour
Working on a problem where large file transfers over a 1Gbps dedicated WAN circuit that runs over an MPLS cloud will run at optimal speed for an hour and throttle back to nearly zero. Occasionally the the file transfer will ramp up briefly then drop back down to a trickle of data flowing to the receiving end. Both ends are running RHEL5.5 on Dell Servers. We've checked that all ports are set to 1Gbps or greater from end to end. No packet loss, no retransmissions. Latency is 60ms end to end as the circuit is 2000 miles long.
This problem does not occur on UDP, only TCP. MTU is set to 1500 and the TCP stack on RHEL kernels is set to the default values.
net.ipv4.udp_wmem_min = 4096
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_mem = 4651200 6201600 9302400
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
net.ipv4.tcp_mem = 196608 262144 393216
net.ipv4.igmp_max_memberships = 20
net.core.optmem_max = 20480
net.core.rmem_default = 129024
net.core.wmem_default = 129024
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
vm.lowmem_reserve_ratio = 256 256 32
vm.overcommit_memory = 2
Responses
How are you doing the file copy? Is it possible that you're overwhelming the storage on the receiver? This would result in "TCP Zero Window" coming from the receiving end, then the TCP congestion control algorithm might take a while to ramp the connection back up again over the long link.
The watcher-cron script might be helpful to see what each end is actually doing:
Gathering system baseline resource usage for IO performance issues
https://access.redhat.com/site/articles/279063
Consult the manpages of the relevant tools to help interpret the output at the time you see the slowness.
With 1Gbps at 60ms, you might want to tune the initial socket buffer size to help TCP along a bit:
How do I tune RHEL for better TCP performance over a WAN connection?
https://access.redhat.com/site/solutions/168483
Keep in mind your transfer's only going to go as fast as the slowest component in the system. If you have 10GbE connected to 10Mbps storage, you can only achieve a streaming data rate of 10Mbps.
If you have a support entitlement, feel free to open a case for us to assist further with investigation.
Hey Bryan,
Are you able to do a packet analysis (wireshark or extrahop, etc..)? It seems net.ipv4.tcp_sack might be a tunable you could look into. Unfortunately the offload of the network card, or the firewall(s) between the environments might be playing a factor.
This appeared to be an interesting write-up specifically on the Cisco FWSM
https://supportforums.cisco.com/docs/DOC-12668#TCP_Sequence_Number_Randomization_and_SACK
I would be suspicious of SCP here. SSH has a hard-coded small internal buffer size which can restrict transfers over long fat pipes. Although results vary:
- http://www.spikelab.org/blog/transfer-largedata-scp-tarssh-tarnc-compared.html
- http://crashingdaily.wordpress.com/2007/07/09/scp-is-slow-hey-not-so-fast/
An easy way to confirm would be test with something which isn't SCP. Can you try FTP or NFS or even just netcat between the two systems?
If you require encryption, FTP has secure options. If you truly require SCP, you could apply the Dynamic Window patch from HPN-SSH:
Though unfortunately there's no patch against RHEL5's old OpenSSH 4.3p2. The earliest version the HPN author maintains a patch for is 4.7p1 but I don't know if that compiles on RHEL5. I've only ever used HPN-SSH on RHEL6 which has 5.3p1.
Don't forget to try socket buffer tuning too.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
