This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

Intermittent SMB File Transfer Failures

0

Hi,

I'm using packet captures to troubleshoot failed file transfers between 2 hosts and have a few questions about what I'm seeing. It looks to me like there is a loop some where and I need someone here to help with that theory to see if I should continue with my current method of troubleshooting or go down a different rabbit hole.

THE SCENARIO:

A Solaris Zone on an LDOM (client) is using SMBv1 to copy log files from several Windows 2008 servers. These W2K8 servers are themselves VMs running on a VMware vSphere 5 host on a Cisco UCS. Most of the time the file transfers work but sometimes they fail. The frequency of the failures changes. Sometimes 100 transfers will work and then the 101st fails. Other times every 2nd transfer fails. The client experiences about a 20 second delay before an error is returned and it closes the connection. From the captures I believe that the client end is good. I am seeing some things I can't explain on the server side.

MY OBSERVATIONS:

1) I can see the 20 second delay on the client side - frame #654 to #655.

2) The server side capture is where I'm seeing things that I can't explain.

3) There are DUP ACKs from the client and retransmissions from the server, neither of which show up in the client capture. I've captured at various points in the path to rule out a bad capture at the client or a malfunctioning client.

4) The DUP ACKs (frames 642-646, 650, 667, 669, 671, 673, 675, 677, 679) all have an ethernet padding of ac:1f:80:36:ac:1f. When everything is working properly, normal ACKs have an ethernet padding of aa:aa:00:00:aa:aa. All other values in the DUP ACKs seem to be the same as in a normal ACK (i.e. src and dst MAC addresses).

5) The DUP ACKs also have an IP.ID value that doesn't fit into the normal range of ACKs from client to server. In fact, the IP.ID value of the DUP ACKs (client to server) are the exact same as the retransmissions (server to client) that immediately preceed the DUP ACK.

6) The DUP ACKs appear less than 1ms after each retransmission which is less than the iRTT

THE CAPTURES:

client side

server side

normal padding

abnormal padding

The client and server captures have both been sanitized using TraceWrangler but unfortunately this changed the ethernet padding. The normal and abnormal padding captures each contain a few packets that were unsanitized so you can see what I see.

THE QUESTIONS:

1) Does the value used in ethernet padding hold any significance or is it randomly generated? It doesn't look random to me

2) Can I use the ethernet padding value to figure out where the frames are originating since I don't see them on the client side?

3) Can I assume a server side problem since The IP.ID value on the DUP ACKs is the same as for the retransmissions (and is out of range as compared to other client to server traffic) and the DUP ACKs happen right after each retransmission?

Thanks,

Bruno

asked 12 Aug '15, 13:02

Bruno%20Wollmann's gravatar image

Bruno Wollmann
11225
accept rate: 0%


One Answer:

2

My guess is there is an "invisible" device between client and server. Take a look at the TTL values in the server trace coming from 172.31.20.97 - most of the time it's 58 (probably 6 hops away), but when the trouble starts we see a TTL of 254 (1 hop away). It looks like there is something right in front of the server, probably a load balancer or traffic shaper. You should identify that device and find out what it's doing, because it looks like it's the reason for your trouble.

Best way to see this is to select the TTL in any of the packets, and use the popup menu to select "Apply as column". Then filter on "ip.src == 172.31.20.97".

answered 12 Aug '15, 14:02

Jasper's gravatar image

Jasper ♦♦
23.8k551284
accept rate: 18%

It was my guess, too.

(12 Aug '15, 14:26) Christian_R

I see it now that you pointed it out. I got so focused on the ip.id and padding I stopped looking for other clues. There is a transparent LB in the same VLAN as the server. I will check this out and let you know what I find.

Thanks

(12 Aug '15, 21:13) Bruno Wollmann

I finally got to the bottom of this.

As pointed out by Jasper, the problem was 1 hop away and in this section of the customer's network there are 2 transparent devices. 1 is a firewall and the other is an IPS. The captures proved that the IPS was causing the problem but it took a few weeks to convince the vendor of this. After a couple more packet captures and many emails back and forth the vendor finally acknowledged a bug in their code. This was a great learning experience for me and I was able to get answers for question #1 and 2 in my original post. I learned that ethernet padding means nothing really as I couldn't use it to pinpoint the source of the trouble packets.

(14 Sep '15, 20:01) Bruno Wollmann