I am having a weird problem. We have one set of router (Router-A)on one location A, and another set of router (Router-B) in a different location B. We have a server located at A and connected to Router-A, and few clients located at B and connected to router B. All the clients are in the same Vlan. Router-A and Router-B are connected via a MetroEthernet (provided by 3rd party) Everything works fine and also used to work fine before. Recently we decided to upgrade the clients. During this upgrade, the clients will download a image from the server. For few clients the download gets completed successfully. For few clients, the download fails inbetween. Now it is time to capture some packets at few locations. We captured packets on the server using TCP dump, and also sniffed the packets on Router-A's interface connected to the server and the interface connecting metroethernet device. ALso we captured packets on the client using TCP dump, and sniffed the packets on Router-B's Vlan interface connecting to clients and the interface that connects the metroethernet device. I can see the TCP packet being generated by the server, the same packets are also received on Router-A's interface connected to the server and also interface connecting metroethernet device. But the particular TCP packet is not received on Router-B's interface connected to Metroethernet device. One interesting thing is that, everytime the same segment gets dropped, the client sends DUP ack for that segment, the DUP Ack is seen on the server, the server retransmitts the TCP segment once again. The server retransmits the segment around 7 times. All the retransmitted packets along with the original packet are not seen on the Router-B's interface connected to MetroEthernet Device. I also tried downloading older version of image from the client, this download gets completed successfully. Everytime the same Segment (sequence number) gets dropped. Note: I am having relative sequence numbering turned on in wireshark so that I know which segment is getting dropped. asked 09 Mar '11, 01:31 blueguy777 |
One Answer:
The client will only send a DUP ack if it receives data that it does not expect. I assume the client receives packets after the missing one. A few possible causes come to mind:
Are you able to share the tracefile? answered 09 Mar '11, 01:54 SYN-bit ♦♦ showing 5 of 28 show 23 more comments |
One more item to add to your list:
As a workaround it could be helpful to reduce the clients TCP window sizes so that they slow down the server transmission while updating (protecting the WAN link), and set it back to normal later.
The client is receiving one segment after the dropped one. That's why it sends the DUP ack for the previous lost segment.
I checked the IP and TCP checksum and I don't see any issue with it. And I can't use TAP as we don't have it.
The DF bit is set in the TCP and the MSS value in SYN byte is 1460. The DATA block sent in these TCP segments is 1448, which will be 1514 captured at wire. I guess this is okay with ethernet as previous segments with same size are sent and received without any issues. Blockquote
The weird thing is that everytime the same segment along with its retransmitted segments are getting dropped. However, later when the client won't get these segments, and initiates termination of TCP session with FIN ACK, the server responds with ACK. This ACK reaches the client without any issues.
When I check the pattern of the segment in question (above TCP) with the previous version of the image being downloaded, it is different. This may be because the image versions are different.
But remember, the same version (latest) and the same segment with same pattern passes without any issues to other clients in the same Vlan.
I can't share the capture file.
You mentioned that the TCP segments are 1448 instead of 1460 bytes. The most common cause for this is that the "TCP timestamp option" is used (RFC 1323). Do the clients that are able also use the "TCP timestamp option (you can see this by checking the TCP options in the packet details pane or more easy, do they have 1448 or 1460 payload). I've seen issues before with devices not properly supporting the "TCP timestamp" option.
Yes, the client and server supports TCP timestamp option. The tcp packets contain options of 12 bytes and have TSval and TSerc.
The server looks to be using different type of FTP where in uses some algorithm and created TCP packet with segment size 17, 97, 515, 519, and 1448. I guess this should be fine as long as it is within the MSS value and also within MTU. And the same algorithm is used for previous version of image and with other clients as well.
Also I can see the dropped TCP segment having a payload of 1448 bytes.
Can you check a download of a client that does work? Is the "TCP timestamp option" used? Even though the client and the server support it, some intermediate devices might have a problem with it.
I checked with the download of the client that works and I can see the timestamp option set.
And the values from both server and client keeps incrementing after few packets. As we know what should be the next sequence number, is there any way by which I can check whether the timestamp values are being incremented properly.
As per RFC1323 the timestamps should monotonously increase and each tick should be between 1ms and 1s (advice).
Can you post the output of the following command (for like 15 packets before and after the first retransmitted packet)?
tshark -nlr <file> -T fields -e frame.number -e frame.time_relative -e tcp.srcport -e tcp.dstport -e tcp.seq -e tcp.ack -e tcp.len -e tcp.options.timestamp.tsval -e tcp.options.timestamp.tsecr -e tcp.options.sack_le -e tcp.options.sack_re
5100 216.580769000 5500 62669 1581394 4554 17
5101 216.581510000 62669 5500 4554 1581411 0
5102 216.597198000 62669 5500 4554 1581411 8
5103 216.597871000 5500 62669 1581411 4562 515
5104 216.598126000 5500 62669 1581926 4562 1448
5105 216.598131000 5500 62669 1583374 4562 97
5106 216.598298000 5500 62669 1583471 4562 1448
5107 216.598302000 5500 62669 1584919 4562 97
5108 216.599523000 62669 5500 4562 1583374 0
5109 216.599680000 62669 5500 4562 1584919 0
5110 216.639115000 62669 5500 4562 1585016 0
5111 216.639244000 5500 62669 1585016 4562 591
5112 216.640287000 62669 5500 4562 1585607 0
5113 217.148803000 62669 5500 4562 1585607 8
5114 217.149457000 5500 62669 1585607 4570 515
5115 217.149665000 5500 62669 1586122 4570 1448
5116 217.149672000 5500 62669 1587570 4570 97
5117 217.149887000 5500 62669 1587667 4570 1448
5118 217.149890000 5500 62669 1589115 4570 97
5119 217.150554000 62669 5500 4570 1586122 0
5120 217.150952000 62669 5500 4570 1587570 0
5121 217.150963000 62669 5500 4570 1587667 0
5122 217.151131000 62669 5500 4570 1587667 0 1589115 1589212
5123 217.569384000 5500 62669 1587667 4570 1448
5124 218.389324000 5500 62669 1587667 4570 1448
5125 220.029156000 5500 62669 1587667 4570 1448
5126 223.298948000 5500 62669 1587667 4570 1448
5127 229.828599000 5500 62669 1587667 4570 1448
5128 242.887717000 5500 62669 1587667 4570 1448
5129 269.015942000 5500 62669 1587667 4570 1448
5130 277.368719000 62669 5500 4570 1587667 0 1589115 1589212
5131 277.368864000 5500 62669 1589212 4571 0
Hmmm... at which point was this data captured. AFAIK, the "TCP timestamp option" should be in each packet or in none. Could you check all 4 tracefiles to see how TCP timestamps are handled? Also pay extra attention of differences in the TCP options in the SYN and SYN/ACK packets on the four capture locations.
It looks like the ACK containing the TCP timestamp option, does not get accepted by the server and therefor it retransmits the last packet until it times out.
SYNbit, Thanks for your patience. I really appreciate your support. I will revisit the captures once again and get back to you very soon.
Hi SYNbit, I checked the captures once again and paid special attention to the SYN and SYN/ACK packets on the four locations. I feel they are not getting changed in transit.
I guess the server is retransmitting the last packet because it gets a DUP ack from the client and not due to the timestamp options.
Timestamps in all the captures are matching to each other.
I just relooked at the data. There are no timestamps in the output (which can be due to an old tshark version), but the two values I took for timestamps are actually the SACK_LE and SACK_RE. They proof that frame 5116 arrived at the client (ACK=1587667) and that frame 5118 arrived at the client (SACK_LE=1589115, SACK_RE=1589212), but that frame 5117 did not, but yoy already knew that from looking at the other traces.
I think you should log a case at your metronet provider, as you see packets going into their network that do not come out.
I am using wireshark with version 1.4.4. Do we have a different tshark. Please let me know, I will download it and provide the O/P
We have already raised ticket with our Metront provider.
I am doing a parallel investigation to understand what would make any device to drop just this particular segment.
o/p frm v1.5
5101 216.581510000 62887 6600 4554 1581411 0 2934349266 1010959456
5102 216.597198000 62887 6600 4554 1581411 8 2934349282 1010959456
5103 216.597871000 6600 62887 1581411 4562 515 1010959457 2934349282
5104 216.598126000 6600 62887 1581926 4562 1448 1010959457 2934349282
5105 216.598131000 6600 62887 1583374 4562 97 1010959457 2934349282
5106 216.598298000 6600 62887 1583471 4562 1448 1010959457 2934349282
5107 216.598302000 5500 62669 1584919 4562 97 1010959457 2934349282
5108 216.599523000 62669 5500 4562 1583374 0 2934349284 1010959457
5109 216.599680000 62669 5500 4562 1584919 0 2934349285 1010959457
5110 216.639115000 62669 5500 4562 1585016 0 2934349325 1010959457
5111 216.639244000 5500 62669 1585016 4562 591 1010959462 2934349325
5112 216.640287000 62669 5500 4562 1585607 0 2934349326 1010959462
5113 217.148803000 62669 5500 4562 1585607 8 2934349847 1010959462
5114 217.149457000 5500 62669 1585607 4570 515 1010959513 2934349847
5115 217.149665000 5500 62669 1586122 4570 1448 1010959513 2934349847
5116 217.149672000 5500 62669 1587570 4570 97 1010959513 2934349847
5117 217.149887000 5500 62669 1587667 4570 1448 1010959513 2934349847
5118 217.149890000 5500 62669 1589115 4570 97 1010959513 2934349847
5119 217.150554000 62669 5500 4570 1586122 0 2934349849 1010959513
5120 217.150952000 62669 5500 4570 1587570 0 2934349849 1010959513
5121 217.150963000 62669 5500 4570 1587667 0 2934349849 1010959513
5122 217.151131000 62669 5500 4570 1587667 0 2934349849 1010959513 1589115 1589212
5123 217.569384000 5500 62669 1587667 4570 1448 1010959555 2934349849
5124 218.389324000 5500 62669 1587667 4570 1448 1010959637 2934349849
5125 220.029156000 5500 62669 1587667 4570 1448 1010959801 2934349849
5126 223.298948000 5500 62669 1587667 4570 1448 1010960128 2934349849
5127 229.828599000 5500 62669 1587667 4570 1448 1010960781 2934349849
5128 242.887717000 5500 62669 1587667 4570 1448 1010962087 2934349849
5129 269.015942000 5500 62669 1587667 4570 1448 1010964699 2934349849
5130 277.368719000 62669 5500 4570 1587667 0 2934411514 1010959513 1589115 1589212
5131 277.368864000 5500 62669 1589212 4571 0 1010965535 2934411514
The timestamps look OK. I'm curious, one thing you could try is to extract frames 5115 to 5118 from the trace on the server and save them in a small tracefile. Then replay them (again from the server) with bittwist. I'm curious if in that case all packets go through, none go through or all but 5117 again.
I will check with the server team about running bittwist on the server. I believe they won't do it.
Well, they can be sent from a different system as long as it's in the same vlan (or else you need to change the destination mac-addresses to get the packet accepted by the first hop to the client).
This is just a step in making it easier for the Metronet provider to pinpoint the problem (well, it's also a bit for my curiosity I must admin), so don't jump through to many hoops to get it done.
Yes, you are right. I would love to use bittwist to generate these packets and check whether they arrive on Router-B. Even I am curious as to what's making this particular packet get dropped.
Anyways, I have raised a ticket on Metroprovider and provided them the information that we are not receiving the segment on Router-B. Let's see what will be their Answer.
I appreciate your support and efforts towards this issue. I will update this post once I receive an answer from the provider.
Thanks