This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

Losing connection with weird behavior.

0

My game client loses connection from my game server intermittently. (about 0.2% per minute) When connection is dropped, my client received a error code 10053(connaborted) usually.

The number of user is about 100k, so about 200 users are dropped per minute.

I should find the cause, but can't find the root cause.

Now I have written code to work aroud this problem. When my client detects losing connection, my client tries to connect again.

Although this work around, I want to know the root cause of this problem.

Here is my game C/S's environment.

  • Client

    • works on Windows (XP, Visita, 7)
    • uses in-house network library (it's very simple)
  • Server

    • works on Linux (Ubuntu 12.04.1 - kernel 3.2.0-29 / Ubuntu 12.04.2 - kernel 3.5.0-23)
    • using iptables or ufw

Here is my approach to solve this problem.

First of all, I doubted my in-house code. I reviewed my codes very carefully but couldn't find any problems.

Second, I tried to capture packets on my server. Before loosing connection, my server can't receve any packets from a client. My server can't receive a ack packet, so my server try retransmission again and again.

Maybe a client also can't receive any packets from a server. (I can't reproduce this problem in my computer, so can't capture packets in client's side.)

The reason I think a client can't receive any packets is that a client is disconnected with a 10053 error code before the server's retransmission timeout. The client doesn't call closesocket explicitly.

I wonder why client and server can't receive any packets? What's wrong? If the reason is congestion, trying reconnect should be fail, isn't it? But almost trying reconnect is success!

I don't know who drops my packets and why?

I suspect the firewall, so I make the firewall disable on the server (just one machine). But, the problem still produces.

Finally, I suspect linux kernel/tcp stack or NIC device driver. But commonsensically the linux kernel/tcp stack is very stable, isn't it?

Do you have any idea? I'll be very appreciate any feedback.

** attach my wireshark's result

  • port: 4134 is a client
  • port: 7781 is a server
  • a client should send a ping msg (which size is 20 bytes) to a server per 3 seconds.
  • but, there is no client's ping msg from 768 to 786
  • a server sends a ping msg per 20 seconds. At 786 a server try to send a ping msg. (which size is 30 bytes)
  • but a server can't receive ack packet from a client.
  • At 815, a client tries to reconnect to server!! (It's my work around.) But, a server can't receive any FIN or RST.
  • Maybe a client had sent a RST to a server, but a server can't receive it.

alt text

asked 13 Sep '13, 02:34

plotonix's gravatar image

plotonix
1113
accept rate: 0%


2 Answers:

1
  • Assuming that this trace is filtered on the remote client's ip address the client port number 'wraps' from within 4134 to 1737. Very unlikely for a windows client which uses incremental ephemeral ports.
  • The SYN packet arrives with a reduced MSS of 1414
  • The client's 'ping' packet was due at 761.64x but was never seen at the server
  • The server's ping packets don't go through either
  • 815.708 into the trace - 54 seconds after the first missing packet - the client reconnects - immediately after the abort as you say your client is coded.

So, the client entered retransmission and finally aborted the connection when retries were exhausted. The new SYN packet went through immediately, so we can exclude a general IP connectivity problem. I'd say, the problem is with security device in the path that is dropping packets of your TCP session for whatever reason.

Good luck in finding out more using traces at the endpoints ! ;-)

Regards Matthias

answered 13 Sep '13, 13:29

mrEEde's gravatar image

mrEEde
3.9k152270
accept rate: 20%

@mrEEde Thanks for your comment.

Of course, this trace is filtered on the client's ip addr and the port.

I totally agree with your answer. It's not a general IP connectivity problem.

When I tried to turn off the firewall in my server, the problem is still occurred. So, the firewall in my server is not dropping my packets.

I want to know who drops my packets (intermittently) and why? How can I trace this problem and how to find 'who and why?'

(13 Sep '13, 18:12) plotonix

With your server serving 100.000 users I gather that the clients are spread all over the world. So, to find "THE" device that is dropping those packets by looking at a single connection is close to impossible as there are probably many devices out there that are causing your 'problem'. Furthermore, those devices are most likely not in your scope of influence and the owners will not be too keen to spend their time figuring out what is going on unless you have a valid business reason. To summ it up, I think 200 drops out of 100.000 user sessions is not too bad ;-)

(15 Sep '13, 07:40) mrEEde

1

I want to know who drops my packets (intermittently) and why? How can I trace this problem and how to find 'who and why?'

Well, that's hard to do, as you won't be able to easily figure that out. A drop of a security device is usually 'silent' meaning, you don't know where is happens.

One option I see is this:

On the server (better on the client as well): If you detect multiple re-transmissions/DUP ACK, etc., you could start a new thread and start sending the last packet (the one you don't get an answer for) with increasing TTL. If you're lucky, you will at least be able to nail down the approx. device that could be dropping the packet, which is the one after which you don't get ICMP time exceeded anymore. This will obviouly only work, if the routers on the way do send the ICMP messages and they are not filtered on the way to your server.

Server -> Router1 --> Router2 --> Firewall/IDS/Whatever --> Router3 --> Router4
TTL:1     # <- ICMP
TTL:2                # <- ICMP
TTL:3                             :: drop
TTL:4                             :: drop
TTL:5                             :: drop

The firewall drops the TCP packet and thus the last station you get an ICMP message from would be Router2.

Regards
Kurt

answered 14 Sep '13, 04:49

Kurt%20Knochner's gravatar image

Kurt Knochner ♦
24.8k1039237
accept rate: 15%

edited 14 Sep '13, 14:54

@Kurt Knochner

Thanks for your comment. I have some questions about your comment.

  1. How do the server application know TCP retransmission. There is no callback and notification. As I known, It's just duty of TCP, and the server application couldn't know that timing. So, I couldn't start a new thread and blah blah.. at that timing. How could I do it?

2. I'm not a fluent English speaker. :( I couldn't understand your sentence. "device that could be dropping the packet, which is the one after which you don't get ICMP time exceeded anymore" If you explain this sentence with another words, I will be appreciate it.

3. Why do you suggest to increase TTL.

  1. As you said, if the routers on the way do send the ICMP msg to my server, I will know who drops my packet! I already tried to capture ICMP message on my server, but there is nothing special. I'll try to do it again.
(14 Sep '13, 07:25) plotonix

How do the server application know TCP retransmission.

Well, actually I don't think it will be possible with the standard TCP/IP API calls, so you need to either use libpcap in a kind of monitoring thread of your server and look for retransmissions yourself (rather hard) or use some scripting together with Wireshark/tshark. As soon as you detect a retransmission, you fire up script and try to implement what I mentioned above. You can use packet injection tools to do that. Although, that sounds like a weird hack, it might be your best option to identify the part where everything fails.

Second, I tried to capture packets on my server.

Did you try to capture the traffic off-box, meaning on a mirror port of the switch? Maybe the problem is caused by the NIC (or the driver) of your server. Maybe you should do that first!

I'm not a fluent English speaker.

Neither am I ;-)

I couldn't understand your sentence.

Look at the picture in my answer. The last device that sends an ICMP response, is the last hop before the device that possibly drops the TCP packets.

I already tried to capture ICMP message on my server, but there is nothing special.

You will only see ICMP messages, if you implement the 'TTL hack' I tried to describe.

(14 Sep '13, 14:54) Kurt Knochner ♦

@Kurt Knochner

Thanks for your reply. I've got what you mean. I'll try it and share the result.

(15 Sep '13, 18:21) plotonix