We have two windows XP pc's 10.10.10.1 and 10.10.10.2 with a direct connection to each other (So not via a router). These both have a user application running, which continuously exchange information via TCP/IP. When the communication between the 2 pc's is logged with wireshark, it shows every 10 minutes a ARP request is broadcast to refresh the arp cache. However once in a while, the ARP message exchange seems to set off the whole network communication. In wireshark it is followed by a TCP previous segment lost message. Here an example:
Anyone any idea why this is happening? asked 12 Jul ‘12, 05:18 rv_deventer |
2 Answers:
Let's sum it up in an answer. Based on the information gathered during analyzing the problem, I conclude: Windows: ARP Behavior
Look at the end.
Also this:
Conclusion: It looks like Windows dropped one (ore more) of your TCP packets, ALTHOUGH the information above seems to be only related to UDP. Maybe the information in the first link is wrong or the behavior is the same for TCP and other protocols. HOWEVER, This behavior does not fully comply with the explanation above. At least one packet should have been queued and sent, after ARP finished. But maybe that's just the packet Wireshark marks with "TCP Previous segment was not captured" and another packet with payload data was dropped. Without insight into the application, we will never know. Question: Why does it happen only once in a while? Solution for your problem, in the order I would recommend:
It's apparently the same for other operation systems, however the "ARP queue length" is different. Linux: You can configured the ARP queue len with the parameter
AIX: Same for AIX with
Regards answered 27 Jul '12, 02:15 Kurt Knochner ♦ Thanks Kurt for this extensive 'research', I understand it is some kind of default behaviour. Still strange that it looks like windows drops packets, except the last one. Anyway, we think we have 'fixed' it as follows: Every time at start-up of our application the application uses SendARP etc. to once discover the mac address of the other pc. Then it adds a static entry to the arp cache. Thanks, Rudy PS. I tried to award points to you, but in some way I cannot set the award points slider to a value different from 1. (?) (31 Jul '12, 05:56) rv_deventer
You're welcome and I learned something too ;-))
That's the way the Windows stack seems to work. It will only buffer one packet during the ARP request. So, if there are severeal packets sent during that time, all are dropped, except the last one.
SendArp() is O.K, but the static ARP entry is problematic!! This entry will be there until the machine reboots. Imagine, the NIC of the computer for which you have a static entry needs to be replaced (or the whole machine needs to be replaced). After it boots up, you will still try to contact the old MAC address due to the static ARP cache entry. It will take you (your admins, customers) quite some time to figure out what's going wrong. It's certainly better to use SendArp() throughout your application, triggered by a timer.
That's because you only have 1 karma point yourself. Just accept the answer (check mark) and if you like vote it up. (31 Jul '12, 06:14) Kurt Knochner ♦ I agree with your suggestion that SendARp throughout the application is better solution. With our solution, we are aware that we have an issue when the computer for which we have the static entry needs to be replaced. As a first approach we are going to instruct our service department (who does replacements) to reboot both computers. That will then prevent it. We then can still decide in the future to implement the timer triggered refresh... (31 Jul '12, 06:25) rv_deventer BTW: Why do you need to do anything at all? TCP recovers from the situation by its retransmission mechanism. (31 Jul '12, 07:03) Kurt Knochner ♦ |
Not a real idea, just some guessing for now.
Is that "once in a while" in the regular 10 minute schedule (Windows XP ARP cache renewal time for used entries)? If NO: You say, the computers are connected directly to each other. I assume a CAT5/6 cable. Maybe the RJ45 connector is not plugged in fully (at either end) and due to small movements (vibrations) the connector loses contact with some pins, which could cause packet loss. Maybe even the link state is lost. However: the time difference between the last TCP packet and the ARP request is probalby to short to reestablish the link. Anyway, can you please check the physical connection (unplug/plug) at both ends? If YES: I have no idea (yet) what causes the loss of at least one packet (10.10.10.2 -> 10.10.10.1). Regards answered 25 Jul '12, 13:40 Kurt Knochner ♦ edited 25 Jul '12, 13:43 Hi Kurt, Thanks for your answer. ARP broadcast and ARP cache renewal is every 10 minutes. Most renewals do not set off communication, it's just once in a while. We did replace the UTP cable between the pc's. Did not help. Further we do see the same issue at different customers (we have several systems in the field). We also did the following test: we added a static entry to the arp cache. That fixed the communication problem. It proved that the ARP renewal causes the communication is been set off. I uploaded 4 dump files, which shows the issue:
Regards Rudy (27 Jul ‘12, 00:10) rv_deventer 1 looking at capture Dump6.cap I can see the following: Wireshark shows “TCP previous segment not captured” for BOTH capture files (Frame #2329 PC_10.10.10.2_Dump6.cap and Frame #2049 PC_10.10.10.1_Dump6.cap). I conclude: The Windows TCP/IP stack must have dropped the packet internally, before it was sent to the network, maybe due to the ARP request. A quick search on google revealed a similar problem (with UDP). Unfortunately the link to experts-exchange.com is void. I guess it's either a bug in Windows or "works as designed" ;-)) UPDATE: A rather old link from microsoft itself, pointing at a similar problem. Maybe it's not just a problem in routing mode and still not fixed in Windows XP.
Doing a bit more "research" ... There needs (should) to be a queue in the TCP/IP stack that holds packets until the ARP request finished. Linux: Take a look at this description of ARP handling in the kernel.
You can configured the ARP queue len with the parameter
AIX: Same for AIX with
Maybe there is also a parameter for Windows, however I was not able to find one (yet). I guess this pretty much explains it:
Cite: ARP queues only one outbound IP datagram for a given destination address while that IP address is being resolved to a MAC address. .... An application can compensate for this by calling the Iphlpapi.dll routine SendArp() to establish an arp cache entry, before sending the stream of packets. Solution for your problem:
(27 Jul '12, 01:25) Kurt Knochner ♦ |
Can you post a capture file somewhere (perhaps www.cloudshark.org) that shows the problem? It’s hard to tell much from a text printout with only a single ARP request/response pair and only 11 packets total. It’s likely that the ARP and the lost segment are unrelated and it’s just coincidence that the lost segment sometimes comes right after an ARP.