Slow upload speed from Windows stations to SSL hosts

Question

Hi there,

Since 3 or 4 months ago, I'm facing a really strange problem in a network containing around 100 workstations running mixed OSes (Windows 7, MacOS and Linux). This problem consists in a very slow traffic when trying to upload files to some HTTPS website from Windows (and only from Windows!!) workstations. A good reproducible test would be an upload to wetransfer.com or sendspace likely sites, where I get speeds around 5KB/s, taking like 6 hours to upload a 60MB file. Doing the same test (at the same time) from a Mac or Linux station, I get the full link speed (which is 50Mb/s), finishing the upload in less than a minute.

Trying to figure out the issue, I ran Wireshark in the router (a Debian Linux running iptables) where I noticed lots of TCP Retransmission coming from the Windows host as you can check in he link below:

CloudShark Link

I know what does those retransmissions mean and that maybe this is the symptom of some other problem, but I can't understand why this only happen to Windows workstations. This problem is seriously affecting the job of the company since this makes impossible send emails with attachments, use Teamviewer for remote access and so on. Not less important, I tested one of the workstations connecting directly to the internet link with a public ip address, and then this problem simply has gone, as the issue would be directly related to my Linux firewall.

Thanks in advance for any help on that.

Best,

Danilo

Answer 1

Hi Mussolini,

you got quite an interesting trace file. I try to give my answer first, and then describe, how I came to the summary.

Summary

It could be, that the network card on your Debian box is reassembling TCP segments (i. e. individual IP packets) into one large segment and fails forwarding these packets. Fixing the problem would require a configuration in your Linux firewall / router.

Detailed Analysis

Based on the display filter provided in the link to cloudshark I assume that the windows machine is using the IP address 192.168.8.26. Unfortunately we don't have a packet that could be used as a finger print.

This system is visible with 2 TCP sessions in the trace file:

A session on TCP port 3389 to 192.168.8.1 (probably your gateway / router / firewall)
A session on TCP port 443 to 54.231.9.49

As a bonus - and already identified by kishan pandey as a potential indicator for trouble - we get 4 ICMP messages "fragmentation needed" (type 3, code 4).

As kishan already pointed out, the "fragmentation needed" messages are triggerd by abnormally large frames. Apply the filter ip.len > 1500 to see that 192.168.9.120 is also generating these messages.

Background: Packet sizes

Unless Jumbo Frames are used the maximum size of an IP packet is 1500 bytes. As 20 bytes are used the IP header and another 20 bytes are used for the TCP header this leaves 1460 bytes for application data. This value is called the "maximum segment size" (MSS).

The packet size might be reduced to accommodate PPPoE, IPsec or other headers. (We ignore TCP or IP options to keep things simple). Both endpoints of the TCP connection have to know how much data can be stuffed into a packet, so they exchange the MSS size during the handshake. You would see the value 1460 within your LAN or 1452 if the remote site is using PPPoE.

Jumbo frames in the trace

Unfortunately, the trace does not show the connection start for the two clients. So we don't know the maximum segment size. Still it is a safe bet to assume that the MSS is 1460 or less.

TCP reassembly

It would be great to have a trace file that is not recorded on a separate device (using a SPAN port or a tap). I am pretty sure that this trace will show, that 192.168.8.26 is sending packets with an IP lengh of 1500 or less. In other words: You do not have jumbo frames in your LAN

Still, your trace shows jumbo frames. These are most likely generated in your Debian Linux box. This could be the work of TCP Offloading in your network card (in the Windows world it is called "TCP chimney"), or a result of the software used in your Linux box.

Artefacts of the TCP reassembly

The reassembly within your Linux box becomes clear when looking at the IP Identification. Try to apply the IP ID and the IP Length as columens (Display filter ip.id and ip.len). You will notice, that the IP-ID should be incremented by one with each packets. Notice, that the IP-ID is incrementing in a non-linear way. Everytime the IP length is exceeded 1500 bytes the IP ID is increased by more than 1.

The jumbo frame now exists within the memory of your firewall / router. When forwarding the packet to the external interface (or maybe, when the NAT process is applied) the IP stack notices that the frame exceeds the maximum packet length for the interface and discards the packet. The source (192.168.8.26) is informed with an "ICMP fragmentation needed".

You don't see the ICMP packet for every single frame, because the kernel throttles the number of ICMP packets.

Why Windows, and not Linux?

I could imagine that the problem also exists with Linux clients. Windows and Linux will probably react in different ways to the ICMP fragmentation method. As the sender did nothing wrong the fragmentation needed message is confusing at best. If you show us a trace file with both Windows and Linux boxes transmitting simultaneously we can see the difference.

How to fix this?

Please check the configuration of your Linux box. At some level (either network card or network stack) multiple incoming TCP segments are combined into the jumbo frame. Either use the offloading mechanism of your external network card to transmit the large packet or disable the segment reassembly for incoming data.

Good luck and happy hunting

PS: Just to still my curiosity: I would be interested to know

a) how Linux systems look like in the trace
b) what software is involved in your Linux box (Kernel, version, possible firewall / proxy ...)
c) what parameter fixes the behaviour