My game client loses connection from my game server intermittently. (about 0.2% per minute) When connection is dropped, my client received a error code 10053(connaborted) usually. The number of user is about 100k, so about 200 users are dropped per minute. I should find the cause, but can't find the root cause. Now I have written code to work aroud this problem. When my client detects losing connection, my client tries to connect again. Although this work around, I want to know the root cause of this problem. Here is my game C/S's environment.
Here is my approach to solve this problem. First of all, I doubted my in-house code. I reviewed my codes very carefully but couldn't find any problems. Second, I tried to capture packets on my server. Before loosing connection, my server can't receve any packets from a client. My server can't receive a ack packet, so my server try retransmission again and again. Maybe a client also can't receive any packets from a server. (I can't reproduce this problem in my computer, so can't capture packets in client's side.) The reason I think a client can't receive any packets is that a client is disconnected with a 10053 error code before the server's retransmission timeout. The client doesn't call closesocket explicitly. I wonder why client and server can't receive any packets? What's wrong? If the reason is congestion, trying reconnect should be fail, isn't it? But almost trying reconnect is success! I don't know who drops my packets and why? I suspect the firewall, so I make the firewall disable on the server (just one machine). But, the problem still produces. Finally, I suspect linux kernel/tcp stack or NIC device driver. But commonsensically the linux kernel/tcp stack is very stable, isn't it? Do you have any idea? I'll be very appreciate any feedback. ** attach my wireshark's result
asked 13 Sep '13, 02:34 plotonix |
2 Answers:
So, the client entered retransmission and finally aborted the connection when retries were exhausted. The new SYN packet went through immediately, so we can exclude a general IP connectivity problem. I'd say, the problem is with security device in the path that is dropping packets of your TCP session for whatever reason. Good luck in finding out more using traces at the endpoints ! ;-) Regards Matthias answered 13 Sep '13, 13:29 mrEEde |
Well, that's hard to do, as you won't be able to easily figure that out. A drop of a security device is usually 'silent' meaning, you don't know where is happens. One option I see is this: On the server (better on the client as well): If you detect multiple re-transmissions/DUP ACK, etc., you could start a new thread and start sending the last packet (the one you don't get an answer for) with increasing TTL. If you're lucky, you will at least be able to nail down the approx. device that could be dropping the packet, which is the one after which you don't get ICMP time exceeded anymore. This will obviouly only work, if the routers on the way do send the ICMP messages and they are not filtered on the way to your server.
The firewall drops the TCP packet and thus the last station you get an ICMP message from would be Router2. Regards answered 14 Sep '13, 04:49 Kurt Knochner ♦ edited 14 Sep '13, 14:54 @Kurt Knochner Thanks for your comment. I have some questions about your comment.
2. I'm not a fluent English speaker. :( I couldn't understand your sentence. "device that could be dropping the packet, which is the one after which you don't get ICMP time exceeded anymore" If you explain this sentence with another words, I will be appreciate it. 3. Why do you suggest to increase TTL.
(14 Sep '13, 07:25) plotonix
Well, actually I don't think it will be possible with the standard TCP/IP API calls, so you need to either use libpcap in a kind of monitoring thread of your server and look for retransmissions yourself (rather hard) or use some scripting together with Wireshark/tshark. As soon as you detect a retransmission, you fire up script and try to implement what I mentioned above. You can use packet injection tools to do that. Although, that sounds like a weird hack, it might be your best option to identify the part where everything fails.
Did you try to capture the traffic off-box, meaning on a mirror port of the switch? Maybe the problem is caused by the NIC (or the driver) of your server. Maybe you should do that first!
Neither am I ;-)
Look at the picture in my answer. The last device that sends an ICMP response, is the last hop before the device that possibly drops the TCP packets.
You will only see ICMP messages, if you implement the 'TTL hack' I tried to describe. (14 Sep '13, 14:54) Kurt Knochner ♦ @Kurt Knochner Thanks for your reply. I've got what you mean. I'll try it and share the result. (15 Sep '13, 18:21) plotonix |
@mrEEde Thanks for your comment.
Of course, this trace is filtered on the client's ip addr and the port.
I totally agree with your answer. It's not a general IP connectivity problem.
When I tried to turn off the firewall in my server, the problem is still occurred. So, the firewall in my server is not dropping my packets.
I want to know who drops my packets (intermittently) and why? How can I trace this problem and how to find 'who and why?'
With your server serving 100.000 users I gather that the clients are spread all over the world. So, to find "THE" device that is dropping those packets by looking at a single connection is close to impossible as there are probably many devices out there that are causing your 'problem'. Furthermore, those devices are most likely not in your scope of influence and the owners will not be too keen to spend their time figuring out what is going on unless you have a valid business reason. To summ it up, I think 200 drops out of 100.000 user sessions is not too bad ;-)