Hi Everyone I am a wirshark n00b and in desperate need of help. We have a SMPP client on some our server and it is connecting to an SMSC. They chat backwards and forwards using SMPP which is just a protocol over TCP. Everything was working great until 1 August 2012 (very strange date for things to suddenly go pear shape). One of the SMSCs that we connect to suddenly started exihibiting the worst connection stability. All the other SMSCs that we connect to have been perfect. We can maintain the connection for hours on end. This one SMSC drops the connection every couple of seconds. We have run many many traces looking at the SMPP protocol and I am pretty sure that there is nothing wrong that we are doing according to the SMPP spec (that and every other SMSC is stable). So I was reading wireshark forums etc and came across looking at the connection RST. When I run a wireshark trace against their IP address and run "Expert Infos" the log is full of them. Their IP is mostly the source. Occassionally we are the source but that is due to the fact that they have not answered keep alive packets. I also see a lot of TCP DUP ack (which I believe is a good indicator of packet loss). The connection also appears to be very unstable when we put them under load. During low traffic it seems to be fine except for our receivers that reset the binds every now and then due to the SMSC not acknowledging SMPP keep alive packets. Is there anybody here that can confirm my suspicions and look at this trace and tell me what they think. Sorry, just to add, from previous questions that I have read the contributors always mention that it would be nice to get the developers input. We are the developers of the SMPP software. We can answer all those questions. Any questions you have we will be able to answer. For example the cause of the TCP resets from us is that the keep alive packets are not answered within 30 seconds so we assume a dead connection and restart. asked 09 Aug '12, 00:52 uriDium edited 09 Aug '12, 01:11 |
One Answer:
There are a lot of these messages
Take a look at 'tcp.stream == 10' (first stream with 3-way handshake). Then take a look at 'tcp.stream == 11'. Apparently the SYN packet did not arrive and had to be retransmitted. I guess you have "some/to much" packet loss somewhere on the way. The TCP Resets are just the final act in that play. I suggest to capture in front of both systems and then compare the capture files to verify that. Below is a "picture" with the suggested capture points (CP).
If the packet loss takes place on your systems (I don't think so, as several systems are affected), you can only debug your implementation. If the packet loss takes place on the transfer network, you need to know all routers on that way (traceroute) and then work along that chain to find the place where the packets are lost. If the routers are not within your control, you can run a TCP performance test to verify the quality of the connection (xjperf) BTW: Did you upgrade any router/firewall/vpn firmware on the magic date 1. August? Regards answered 09 Aug '12, 01:36 Kurt Knochner ♦ edited 09 Aug '12, 01:48 showing 5 of 10 show 5 more comments |
Hi Kurt. It is so strange. This pretty much started at exactly 09:00Am 1 August. We have called in our hosting infastructure and they deny changing anything. I am inclined to believe them otherwise we "should" (maybe) see this with all our other SMSCs (most terminating in the same country). We have run a tracert and are trying to get both sides into a conference call.
@Kurt. So we ran similar wireshark traces to SMSCs in the same country, during the same period and traffic volumes. 0 TCP RST, 0 DUP ACK. So I am pretty sure that it must something downstream from us. We are just seriously struggling to get parties on both to cooperate. They just keep pointing fingers.
Ah, the good old game :-) Capture at both sides and as soon as you have an idea where the packets get lost, you can start pointing fingers (IF it's not your systems causing the trouble) ;-))
@Kurt thanks for all the help. Just a little more please :) If we get them to do the same capture on their side, what are we looking for? Where do we go from here? If their traces also show a ton of drops and DUP acks does that mean it is something in between them and us? Or if it looks clean then it is something further downstream and in their infrastructure? Like a faulty switch or router?
Where the packet loss takes place.
If they DO see all packets you see, then there is no packet loss on the network. Then the problem is within your application.
If they do NOT see all packets you see, then there packet loss on the network. In that case you need to work your way along the network path to figure out where the packet loss takes place. That's the most tedious part, especially if the parties don't want to cooperate.
Yes.
Maybe.
I suggest to capture directly in front of the communicating machines and then work your way up, device by device. Hard work, but there is no better way to do it.
@Kurt Sorry I lied. We need a little more help: I recall the SMSC accusing the wireshark of containing a lot of "Header checksum: 0x0000 [incorrect, should be 0x7b67 (may be caused by "IP checksum offload"?)]" warnings. We read however that may be caused by the NIC not having completed the checksum calculations yet. Some other forums had question relating to packet loss and dropped connections suggested that they disable checksum offloading on the NICs. These seemed to help some guys. What do you recommend?
That's just a strategy to keep away work and problems ;-)
Well, offloading "could" be causing problems too, however I don't think so in your case, that's why I did not mention it. Anyway, give it a try. Look for the advanced properties of your NIC driver.
However, I suggest to capture first in front of the system and not on the system. This will eliminate the wrong checksums and proof that offloading does not cause the problem. See Capture Setup
Let me clarify that a bit. By "the problem is within your application" I mean, the receiving application (or OS) on 196.11.146.34 (or the NATed device behind that IP) does not handle the packets in a propper way. So, if that is your system then it is your problem. If that is their system/application, it is their problem.
Yip, that IP is theirs. We just got off a 2 hour conference call. We sit pinging that IP with packet loss. They are now calling in their routing and switching team. Thanks for all the help.
good luck.