Why receiving RST because of many DUP ACKs?

Question

My server works for lots of clients concurrently (client amount is between 300-800 in any moment).

I wrote a server and client implementation and clients getting disconnected somehow, which i dont why. Even i am getting disconnected rarely for this unknown reason. And this is ruining quality.

I logged in to server with WinSCP and made a test like this:

Started uploading a 200mb file to server and at %20 server disconnected me. WinSCP told me to enter name and password again. You can see last moments from below log:

Cloudshark log

Time 5:42 where disconnection have happened. Real log file was 1300+ seconds, so this version is splitted.

Server operating system is Centos5 64bit with 1Gbit bandwidth.

I cant even upload a file to server without getting disconnected. What must i do to fix this?

Edit: There was a logical deadlock in my software. Deadlock was causing to not being able to read & write sockets. Network buffers were filling up at this stage and linux was killing socket connections to fix problem at the end. That is why other softwares were being affected too.

Accepted Answer

I cant even upload a file to server without getting disconnected.

O.K. your problem seems to be affecting different applications on that server (own application, winscp, etc.), which leads me to the conclusion that there is a problem with either of these

The server itself is somehow overloaded from time to time (CPU load)
The interface of the server is broken (or the driver). Check with netstat -ni and kernel logs (dmesg)
The switch or switch port of the server is broken or overloaded (flooding). Check the switch port statistics and the switch logs.
There is a network burst in the local network, that overloads local network components. Check with capture files taken at the server
Any other system in path to the server (firewalls, load balancer, router) are overloaded from time to time. Ask the admins

What must i do to fix this?

Well, that's a lot of possible problems and you will not be able to identify all of them by looking at network traces (capture files), especially if you capture the traffic at the client (as in your sample on cloudshark).

The best way to eliminate (possibly) faulty components is to run some tests locally (client and server in the same subnet), to see if a file upload (scp) gets interrupted as well.

If YES: take a look at

the switch and/or switch-port of the server
the server interface
the server load
iptables on the server (rate limiting or similar)

If NO: take a look at

other components in the path to the server (firewalls, router, etc.)

To answer your question:

Why receiving RST because of many DUP ACKs?

The client closes the connection after many retries with a RESET. It (basically) gives up because there is no answer from the server anymore.

Regards
Kurt

Answer 2

The trace file provided was taken at the client side and is showing that one full-size segment tcp.seq==399361 is never acknowledged by the server while at the same time we still see packets in the reverse direction. So we can assume we still have connectivity and it is that single packet this is causing the problem.

"What must i do to fix this?" Hello, as Kurt mentions you should have a look at the server side and see whether CentOS is saw the retransmitted packets and if so, whether the seq/ack numbers are correct. (Compare the real sequence numbers in both client and server trace). Sometimes the TCP SACK option is confusing devices so it might be worth a try to disable it on the server side and see if it helps.

 /etc/sysctl.conf net.ipv4.tcp_sack = 0 
Then run "/sbin/sysctl -p /etc/sysctl.conf" to load the settings into the running kernel.