169.xxx.xxx.xxx ARP requests followed by connectivity issues

Question

We have a problem with getting DNS resolution errors in Chrome or Page Cannot be Displayed in IE.

We have ruled out the internet connection and also the firewall itself. When connected direct no problems. When wired into the Firewall on a separate lan created for that connection, no problem.

When connected through the local lan this will happen 1-2 times per hour per user. Happens regardless of the DNS. Using our domain controller DNS forwarded to our ISP or Google. Set the user up as Google DNS direct, same problem.

What we see in Wireshark is 15-20 ARP requests for multiple 169.xxx.xxx.xxx addresses, with our domain controller asking for the IP.

It will run perfectly fine, then we will see these requests, then a failure, hit refresh and its fine.

Not sure what is causing it, extensively went through the network and find no other issues.

Any idea's on where to start to look where these requests are being generated?

Answer 1

0

DNS query is normally not sent out for each access to a given web site because the DNS records have a time validity of tens of minutes, if not days. When a query for a given name is responded once, it is cached for the time validity indicated. So "1-2 times per hour per user" says almost nothing without knowledge of the web pages visited and even their contents (as there may be urls pointing to different sites).

So I would first run a ping -t 8.8.8.8 on the clients and capture using Wireshark again, to see whether only the DNS query is affected or also the icmp.

Next, I would use a display filter (ip.src_host == 169.254.0.0/16) or bootp to see whether the automatic IP is really only seen in domain controller's arp requests or whether it is actually used by the client PCs for a short while because DHCP server is lazy or something alike happens.

Also, would you mind exporting those arp packets into a separate .pcap file, posting them somewhere and placing a link here?

answered 16 Nov '15, 12:31

sindy
6.0k●4●8●51
accept rate: 24%

I will upload some tomorrow once we grab another set of captures.

This happens regardless of website and one of the sites 43 people use 24/7. The ERP system is cloud based so it gets hit pretty heavily. However it does happen on any site just not many people surf the web elsewhere, or are even allowed to.

When pinging Google DNS I get about 8 percent packet loss when the response time is set to 800ms, moving it up to 1000ms wait time I get 0.

Pinging anything local I have no problems. Connected directly to the connection I have no problems. We did not test the computer on the private network which excluded the local LAN but can tomorrow.

Thanks

(16 Nov '15, 14:26) ericspt

I ran a capture tonight. Lots of traffic since it was RDP but I went to 10-15 sites just to run through and no errors, pinging 8.8.8.8 the entire time.

There are some 169 requests.

https://www.cloudshark.org/captures/ea47ba8e8bbe

(16 Nov '15, 14:54) ericspt

Just BTW: Next time you could try the default capture filter: REMOTEHOST as you can see here: https://wiki.wireshark.org/CaptureFilters#Default_Capture_Filters

This filter tries to filter out your own remote session.

(16 Nov '15, 15:19) Christian_R

@ericspt: Your answers have been converted to a comment as that's how this site works. Please read the FAQ for more information.

(17 Nov '15, 01:10) Jaap ♦

@ericspt: the fact that no icmp request was lost should indicate that nothing is wrong at L2/L3 level and that the issue is really related to dns as Chrome's error message suggests.

Do I read your

I went to 10-15 sites just to run through and no errors

right that while you were taking this capture, the issue you're after has not occurred? If so, this capture cannot say more, and another one is necessary to get further.

If you know for which particular url there was the issue while you were taking the original capture you've mentioned in the question, please post that one together with information about the url(s) affected; otherwise please run another capture and ask the people to note down the urls which needed a retry.

To write something more than just additional questions: in addition to arp requests asking for MAC addresses associated to the automatic IP addresses (169.254.0.0/16), the domain controller also sends LLMNR (similar to DNS so much that a common Wireshark dissector is used for both) PTR queries (asking for a hostname associated to the IP) regarding these addresses to a multicast address 224.0.0.252, use e.g. display filter frame.number == 1280 or frame.number == 1407 or frame.number == 2434 or frame.number == 2451 to see what I mean.

So it seems that there are two machines using these automatic IPs on the LAN which want something from the domain controller, and the domain controller sends the the arp and llmnr requests in order to be able to send them its response. And these multicast DNS queries could somehow (due to a bug) affect processing of the DNS responses at your workstations. To confirm or disprove this hypothesis, the above mentioned capture taken while the issue has really occurred is necessary.

(17 Nov '15, 02:01) sindy

Thank you for your comment.

We had thought DNS was the original issue but if we change a PC's DNS to Google the problem still occurs.

PC running through LAN with DNS set to domain controller(forwarder properly setup) we get the error.
PC running through the LAN with DNS set to Google, we get the error.
PC running directly into the firewall on a separate LAN, no switch, PC right to the firewall, with DNS set to Google, no error.

The only thing I have not tried is setting the PC's DNS to the ISP's DNS, however we have redundant ISP's and the problem happens regardless of which ISP we route traffic through, using either that ISP's DNS or Google's DNS.

(17 Nov '15, 03:01) ericspt

Following the "related to DNS" path, which is supported by your initial observation that the arp requests occur at the same time as the issue, I was suspecting that those multicast LLMNR queries, sent by other machines and received at the workstation also at the same time, may affect processing of regular DNS responses at the workstation due to some bug in the DNS handling code. These multicast LLMNR requests cannot reach the test workstation when it is connected to the firewall using a separate LAN, and this would explain why the issue does not occur in this case.

But in the meantime I've also realized that as the issue hasn't popped up while you were taking the published capture, we cannot conclude anything even about the L2/L3 level (unless you are 150% sure that the issue was occurring while you tested the ping 8.8.8.8 without capturing and no ping was lost). So I have to insist that the analysis must be done on a capture which spans an occurrence of the issue and the particular affected url, and preferably also the time of the event, is known.

Also bear in mind that DNS-alike protocols are used also for other purposes (like proxy auto-detection as can be seen in your capture), and we don't know how narrowly pointed the browsers' error messages are.

BTW, as such request never came from the 10.99.79.200 where you took the capture, do you use http proxy in the network, and haven't you changed that setting on the test workstation when connecting it directly to the firewall using the dedicated LAN?

(17 Nov '15, 03:42) sindy

We do not use a HTTP Proxy.

The domain controller is the DNS and I see nothing at all unusual in the logs.

I did get a trace ran for most of the day with 3 failures but the file is too large to upload to Cloudshark even on a paid account.

Is there a way to strip out the good TCP packet info to get the size down?

What is odd is now I am seeing a bunch of duplicate IP's at the top end of the range. They have about 110-120 devices on the network on a given day. With the latest capture I have 4 duplicates at 10.99.79.117-121. I cannot identify these devices.

The only DHCP server is the one on the domain controller, there are no others.

We did have 2 Cisco 300 series switches which we factory reset and turned off STP. We also have 2 main HP 1920 switches which also are brand new and STP is turned off. Neither of them are running DHCP, neither if the firewall.

We have gone through the network device by device and looks like another walk through and finding the MAC addresses of each device may be needed.

(18 Nov '15, 05:27) ericspt

Is there a way to strip out the good TCP packet info to get the size down?

Yes, there is, you can set a display filter to show only packets 30 seconds before and 30 seconds after the event, example:
(frame.time > "Nov 18, 2015 11:55:07.000000000") && (frame.time < "Nov 18, 2015 11:56:07.000000000")

and then use File -> Export Specified Packets and choose "All packets" and "Displayed".

It might be a good idea to do this for all three events.

Also, you don't necessarily need to use Cloudshark, you may place the capture files to any cloud service allowing public access to the files. But even in such case please export the interesting minutes only.

2 Cisco 300 series switches which we factory reset and turned off STP.

Turning stp off may be a good idea in some configurations and a very bad one in others.

The domain controller is the DNS and I see nothing at all unusual in the logs.

It is a typical setup, but I was suspecting the DNS handling at the workstation to get confused, not the one at the domain server. But let's come back to that after you publish the capture exports.

(18 Nov '15, 05:39) sindy

I had to strip a lot out of the capture.

https://www.cloudshark.org/captures/ea47ba8e8bbe

Trimmed to 128bytes per packet, removed all SMB2 and SSL traffic.

I do not have time stamps to match, they did the capture but did not record the times.

I do see 5 dupe IP's but I have scanned the entire log and the only announcements are from our server. There are no other DHCP servers on the network. I also cannot identify those devices.

Makes me wonder if something is changing the MAC address and that is causing the IP's to be handed out twice.

There seems to be a fair amount of bad traffic but honestly I do not know how much 'bad' traffic is normal.

(18 Nov '15, 12:16) ericspt

I do not have time stamps to match, they did the capture but did not record the times.

If they haven't told you the times of the events, have they told you the page names they were visiting? Otherwise we look for a needle in a haystack. I'll continue analysing the trace but some starting point would be really helpful.

I do see 5 dupe IP's but I have scanned the entire log and the only announcements are from our server. There are no other DHCP servers on the network. I also cannot identify those devices.

If none of the duplicates is 10.99.79.200 it should not be the cause of the issue; if it is, the DNS response could have been sent to the twin, except that it sounds stupid to me that only DNS responses would be hijacked and not the rest of the traffic.

causing the IP's to be handed out twice

Do you mean "handed out" by DHCP? If there would be two identical MAC addresses in the subnet, the DHCP would give both of them the same IP (unless it looks at other information in the DHCP discover/request, which I doubt), but the switches could go mad due to seeing the same MAC at different ports.

There seems to be a fair amount of bad traffic

depends on what you call bad traffic. As many protocols are fault-tolerant, quite a lot of network problems as well as ill application behaviour may remain unnoticed because the recovery mechanisms repair the communication so it seems OK from user perspective.

(18 Nov '15, 13:03) sindy

From a quick view into the trace, I would say the most uncommon thing is the termination (RST) for a lot of HTTP and SSL Session at the end of the trace. The termination is initiated by the system 10.99.79.200.

I haven´t found any duplicate IP inside the trace.

And what packet number do you call as "Bad traffic"?

(18 Nov '15, 14:25) Christian_R

@ericspt: I feel quite strange about the capture:

you've written it ran for most of the day but what you've posted actually lasts only 158 seconds
you've written that you've trimmed packets to 128 bytes but I can see packets over 30 kBytes there

So I've spent quite some time looking for non-responded DNS queries to find none, but if there were just three issues during the whole day, I doubt that you'd be lucky enough to have one of them in those 2'40"...

So the first question is what is the time span and size of the original file before you've processed it (maybe the capture has ended prematurely on disk quota exhaustion?) Next, if the original file is really much bigger, you'll have to take it as it is, set display filter "dns" and then "File -> Export Specified Packets" as described earlier.

The arp requests to the two automatic IPs are very frequent (often once per second, with some pauses) so they cannot be used as a symptom.

The llmnr queries for the same two automatic IPs come more or less together at some multiples of 8 seconds (16,24,48,56) so they are also not likely to indicate the time when the issue has happened.

(18 Nov '15, 14:45) sindy

My mistake, wrong link

https://www.cloudshark.org/captures/d613ea8f3cac

(19 Nov '15, 09:59) ericspt

cloudshark asked me for login (which it never did before), so I've signed in with google account and I've got an error Unable to log in: Your username '[email protected]' does not have permission to view this file. Is it due to the file size or you have to mark the file for public access?

(19 Nov '15, 10:04) sindy

All set, had it set to private.

(19 Nov '15, 15:39) ericspt

hm, you've filtered out all UDP, so no way to see what was happening with DNS and its friends as they are UDP. So now please set display filter on the original capture file to "udp", export displayed packets and see how large the result is (or use tshark to do the same, whatever suits you better). If I'll be able to find some anomalies in DNS communication, you would then distill time intervals rather than packet types for further processing.

BTW, what is your time zone?

(20 Nov '15, 00:17) sindy

We are in Eastern.

Here is all the full UDP only traffic

https://www.cloudshark.org/captures/bfca46e8f4e6

(20 Nov '15, 07:34) ericspt

OK, so after all we are getting somewhere.

Using a display filter icmp and dns.flags.response == 1 and eth.src == 78:84:3c:36:51:f2 you can see that in 143 cases in that capture, the workstation has rejected to accept a DNS response packet from the domain controller.

You can see by frame numbers or time that this happens only during some time intervals.

Now let's take the last occurrence of this effect and see what preceded it. Set display filter dns.id == 0xc33b and you'll see that the domain controller was first not responding to the requests (or was maybe sending the responses somewhere else, or was not receiving them, see later why I am not sure) which the workstation was repeatedly sending with gradually growing distance from each other. As late as 4 seconds after the last retry of the workstation, the domain server has sent a reaction, which was not a DNS response but a notification of server failure, which the workstation refused to accept as it has closed the socket in the meantime.

With dns.id == 0xf431 or dns.id == 0x849a the exchange looks similar except that in these cases real answers do arrive but also late (4 and 5 seconds after the last query).

So now please take the original .pcap with everything in it, apply display filter (frame.time > "Nov 17, 2015 21:51:51.33") and (frame.time < "Nov 17, 2015 21:52:14.691") to it, export the displayed packets and publish the resulting pcap.

The time span given includes both DNS transaction IDs 0xf431 and 0x849a. In your two last files merged together, there is nothing during that time that could explain the behaviour at the protocol level, but maybe your filtering was too aggressive on the udp-less one?

After posting the capture, please take another two simultaneously at the domain controller and at the workstation, use the icmp and dns.flags.response == 1 and eth.src == 78:84:3c:36:51:f2 display filter at the capture from the workstation (modify the MAC if you use another workstation) to identify similar issues, and then use display filter dns.id = the_one_found to see whether the DNS queries did arrive to the domain controller. If they do, the controller is overloaded; if they don't, either the network is eating the packets or the domain controller is so overloaded that it cannot even capture; in that case, you'd need to use a second machine and a tap or a SPAN port at the switch, mirroring the port to which the domain controller is connected.

(20 Nov '15, 09:20) sindy

Another chance is that the controller itself could not get the response from the higher level DNS server, which the capture from it should answer you as well (you would see it sending DNS requests with same contents but different transaction ID to it).

(20 Nov '15, 09:24) sindy

Awesome thank you for not only the help but the detailed explanation.

I will get that file uploaded tonight.

As far as your second comment this was what I first investigated. I figured it could be on/off Net DNS issue. They do have 2 ISP's. We route all general traffic over their cable, and then specific traffic for their ERP system over a secondary fiber connection.

However we have tested with all traffic on 1 ISP with the ISP's DNS's as forwarders, and we have done the same test with the other ISP.

We also have tried Google's DNS's on both ISP's, similar results across the board.

(20 Nov '15, 15:10) ericspt

Basically I ask you to take a capture at the domain controller in parallel to the workstation in order to avoid spotting a rare case different from the one you're really after. I do remember you've written that workstation on a separate LAN has no problems while a workstation on a common LAN does have them even if it reaches directly to the external DNS, but from the current capture I cannot say more. A capture from a workstation on common LAN configured to use external DNS will almost inevitably have to follow.

Just make sure to capture all interfaces of the domain controller if it has more than one.

And as the firewall device may be guilty as well, the best would be if you could take a capture on it (inside and outside of course) in parallel with the workstation and domain controller as mentioned above.

BTW, I've just spotted that Wireshark 2.0.0's display filter distinguishes between dns, llmnr and mdns, so where a display filter dns was enough to see all of them in 1.12.8, dns or llmnr or mdns must be used in 2.0.0. But I like the new way more because before, it was not easy to get rid of the unwanted ones when you were interested in just one of them.

Disconnecting for today as I live 6 hours ahead of you :-)

(20 Nov '15, 15:45) sindy

Sorry I have been out of town for a week.

I did 3 captures. Domain Controller, PC1 and PC2. PC2 I cannot wittle the file down enough to upload it but it did show a ton of LLMNR requests which I have disabled.

Here is the capture from the PC

https://www.cloudshark.org/captures/9cca014edc09

and from the DC

https://www.cloudshark.org/captures/a08679de67a5

Both were started within a few minutes of each other, the PC2 capture was done at the excact same time.

For some reason I cannot get it to strip out some data to shrink the file enough to updload. Can I strip TCP data or no?

(08 Dec '15, 06:16) ericspt

You don't really expect me to be in sync after three week's pause, do you :-) ?

After running through the conversation briefly I think UDP and arp traffic should be enough. And you are not bound to cloudshark, you can use any other way to post the files (which, however, does not mean you should not get rid of the tcp part before posting).

(08 Dec '15, 06:46) sindy

The first published capture from the PC is useless for analysis as there is no icmp rejecting a DNS answer in it.

(08 Dec '15, 07:31) sindy

And in the capture from a domain controller, I've taken randomly one DNS conversation which has resulted in icmp "destination port unreachable" report coming from the initial inquirer (.66) in reaction to the response packet, and I got quite confused by your overall network arrangement. Use display filter

dns.qry.name == "cdn.content.prod.cms.msn.com" and frame.number >= 92185 and frame.number <= 92199

and add ethernet source and ethernet destination of the packet as columns to the packet list (right-click each corresponding line in packet dissection pane and choose "Apply as Column".

And then please explain me why the packet for 8.8.4.4 is sent to the MAC of the .66 which is the initial inquirer.

Otherwise this particular exchange is harmless - the explanation is simple, the DC has sent the response at a moment where the inquirer has already scheduled the retransmission of the query, so the inquirer got satisfied by that response an rejected, using icmp, the one triggered by the retransmission of the query.

(08 Dec '15, 07:51) sindy

Here is the filtered content

https://www.cloudshark.org/captures/c95f60faf850

I'm working on following along but can you please explain:

"And then please explain me why the packet for 8.8.4.4 is sent to the MAC of the .66 which is the initial inquirer."

A bit more background. They had another managed service software installed on each PC that acted as a web proxy. This was all removed and I've verified that Auto Detect Proxy settings is off. I've verified the DNS is correct on the DNS forwarder also. They have multiple ISP's with fail over so we use Google as the DNS for either provider.

I have been looking for something else that could act as a proxying device and found nothing. There are only 2 devices on the network we have not touched which is their DVR and phone system. Both can be unplugged from the network but I'm trying to get an idea of where you believe the problem may be.

Thanks

(08 Dec '15, 08:31) ericspt

It seems I've confused you completely, sorry. I was asking you to apply that filter so that you could see the same what I was looking at, not to filter that out and send it back to me :) What I do want is the tcp-less capture from PC2, however.

Now a retry: I was asking you to look at that situation because I was puzzled by the machine .66 acting as a gateway towards 8.8.4.4 for the domain controller and wanted an explanation of that from you, assuming that if the .66 is a gateway towards the ISP and 8.8.4.4 behind them, it has little reasons to send a DNS query, while if it is a workstation, why on earth it should be a gateway towards the ISP?

Separately from that, when I was looking for other occurrences of icmp rejection of a DNS response, I've found 1000+ such events in the capture from the DC, so I've added a condition that the icmp should come from the .66. I've got about 40 such packets. Then, I took first 5 dns.id out of them, and the dns.id == 0x4a46 seemed promising as it shows several unresponded retransmissions of the query.

The next step was to filter out all DNS communication regarding the fqdn for which the .66's request 0x4a46 asks for which happened between the very first 0x4a46 request sent by .66 until its icmp rejection of the response which finally came after the .66 gave up waiting. The corresponding filter is

 frame.number >= 480343 and frame.number <= 482160 and dns.qry.name == "1D0C.cust.panorama9.com"`

If you apply that filter yourself, you'll see that several machines were querying that same fqdn at the same time, some of them repeatedly, and it seems to me that the DC was simply overloaded handling these requests, because

it did not ask Google DNS for it during that time it already had it in its cache, see frames 13144 (Qry) and 13176 (Rsp), it has not asked 8.8.4.4 of that fqdn any more, as the answer from 8.8.4.4 has arrived with TTL of 17xx seconds,
it started responding each query with a long delay but then several times in a row (i.e. it was responding all retransmissions of the given request id)

(08 Dec '15, 11:25) sindy

Ok thank you for the explanation. I was wondering why you were asking to apply the filter :)

Here is PC2

https://www.cloudshark.org/captures/f6ff6ff75ccf

.66 is a laptop within the company, I have to find the specific device.

I'm wondering where the DNS overload would come from....there are not a lot of devices on the network.

Could this be a MITM LLMNR problem? We have shut off LLMNR but have to reboot each PC for it to take effect.

(08 Dec '15, 11:58) ericspt

Stop, that is completely crazy. There are two TTLs in the google's response, the CNAME has 17xx seconds and the A record has only 59. Which makes the DC's clients repeat their requests in less than a minute again. Everything goes fine until frames 131776 and 131777 (relative second 1063) where the DC answers with TTL of the A record of 29 seconds. After that TTL expires, everyone asks again (with retransmissions) between relative seconds 1094 and 1103, with no response from the DC. The DC then provides all the answers starting from second 1103 (including answers to retransmissions), but provides only CNAME records, no A record (which has expired, which is the reason why everybody asked for an update in the same second). The DC itself doesn't ask 8.8.4.4 nor anyone else, however, and keeps answering with only CNAME, which the workstations don't accept and ask again, finally stuffing the DC's DNS process.

The last answer like this (no A, just CNAME) is frame 165568. Then the DC changes its suicidal strategy for a while and starts sending A records but with TTL=0, the first answer like that is frame 165604.

This lasts until 165757, and then, out of blue, without visibly asking anyone, the DC starts returning a different A record (different IP address) with 59 seconds' TTL, only to see that crazy situation repeat after another minute again.

I'm tired reading that mess, can you investigate what is that fqdn and why it chooses this weird policy of being resolved to a given IP for only one minute and then changing the IP?

Is it a malicious network or your customer's main site?

(08 Dec '15, 12:04) sindy

That domain is the previous managed service software. We are checking all of the PC's right now if it still exists.

It did act as a proxying software before, and they always had trouble with it. We removed it from each PC but checking that one it is apparent some part of it remains.

Just so I understand you are saying something else is responding somewhere in the chain with a DNS resolution other than the DC local DNS or the Google forwarder?

That is causing these odd TTL lengths?

Until today I never realized LLMNR existed until I saw all the traffic, but I guess you'd never run into it with DNS working properly.

What I do not understand is why did changing the individual PC to Google's DNS direct not help the problem?

Connecting that PC directly to the firewall, skipping the local LAN, with its own private network and Google acting as the DNS did resolve the problem.

(08 Dec '15, 12:13) ericspt

Just so I understand you are saying something else is responding somewhere in the chain with a DNS resolution other than the DC local DNS or the Google forwarder?

I am not sure I understand your question. What I am saying is:

I can see all the ~10 IP addresses (most likely the workstations) DNS-querying the DC for IP of the same fqdn. This is the only thing which seems normal to me.
I can see that in one case, the DC has forwarded that request to 8.8.4.4, got back an answer with IP address X and TTL of that address of ~60 seconds, and with CNAME record with much longer TTL ~1800 seconds, and was then answering following requests according to this 8.8.4.4's answer. In addition to the short TTL of the A record, another (but probably independent) weird thing here is that the DC uses one of the workstations, the .66, as a gateway towards 8.8.4.4.
I can not see any attempt of the DC to query anyone else about the fqdn after the first client query after the TTL of the A record expires - which would be a normal behaviour of a DNS relay. The most logical explanation for this is that the DC has some other network interface on which you haven't captured. I wrote on November 20th:

Just make sure to capture all interfaces of the domain controller if it has more than one.

and I've got no feedback from you on whether there are more network cards on the DC or not.

regardless how the DC DNS relay managed to obtain the new response from its own DNS server(s), I can see the way it responds to its clients before it gets a valid response (no A record at all in response to an A query, or an A record with TTL 0) is merely stupid, as it induces an immediate repeat of the query from the client. As a consequence, it gets clogged by those repeated requests, and of course if some of the clients happens to send a DNS query about something else, it doesn't respond in time either.

So before proceeding, please find out how to get rid of the remainders of that old software generating these DNS requests. Then we may, if still necessary, deal with the remaining oddities (how does the DC query its DNS server normally and why it occasionaly uses the .66 as a gateway, and even more surprisingly, it does so successfully).

(08 Dec '15, 13:54) sindy

Thank you for the detailed explanation.

The DC is a Hyper-V VM, HV host is 2012 R2 and DC is the same.

There were 2 Nic's assigned on independent Vswitches. One for LAN traffic, the second for a private replication network that is a completely different subnet and direct connected to a replica server.

During the initial testing first thing I did was remove that 2nd VSwitch and NIC and pushed replication over the regular LAN.

So right now there is only 1 NIC and I have verified there is only 1 NIC listed, there is only 1 DNS and DHCP binding and it is to that NIC.

I am starting to piece this together, not the problem, but how this software ended up on that PC. The old domain controller pushed this software out to each PC, then had a script that ran on boot that reinstalled it in case it was removed.

We removed the script deployment off of that PC, and ran an uninstall against each PC. Then removed it from the server. In the middle of this that PC was brand new and deployed the morning this happened, then left the office so it was never removed.

It did act in concert with their old Internet access appliance which was replaced by a single firewall that managed internet access via AD instead of a client on the PC.

It did act as a proxying server. How it has taken over an as active gateway to push traffic I have no idea. The 2 PC's that were captured have been wiped clean and we have scoured everything we can to find something odd on the network.

The new setup was completely new, new domain, rejoined all the PC's, complete DC setup from scratch 100 percent identical to all our other setups. New firewall, switches, etc.

Of course our first thought was something was wrong with the firewall or our config since there are 2 in high availability. After spending 2 weeks determining it was not the firewall or hardware we started to look elsewhere.

I truly appreciate your help and insight. I feel we are pretty technically competent but the simple fact is basic flat networks like this pretty much just work and this is one odd ball problem. I had an idea what was going on, some type of DNS related response issue but being able to dig through and pin point it in captures is not something we are technically proficient as we should be, but that will change in 2016.

(08 Dec '15, 20:35) ericspt

So removed that software, problem continues.

I turned off LLMNR but requests for it continue and the one that doesn't make sense to me is LLMNR requests from the domain controller even after turning it off and rebooting, and it is look for a root DNS server.

I'm running another capture in the morning.

(09 Dec '15, 14:52) ericspt

Please be more specific about which problem continues. The failed DNS requests in general or everybody on the network asking for that famous FQDN?

I don't like those virtualized setups where machines talking to each other get their time in interleaved slices on the same physical CPU, so you can then see effects where the DNS queries come from various machines and they even retransmit them, and then the DC gets its time slice and sends three responses to the same query in ten microseconds because it receives the original request and its two retransmissions in a burst of packets which were waiting in a queue to be processed. Has the DC got enough priority in access to the physical CPU?

(09 Dec '15, 15:07) sindy

All I know right now is they continue to get the Page Cannot be Displayed error.

Only the DC is asking for the root server. What I do not understand is we have disabled LLMNR but it continues to show on the network.

Although this is a virtualized environment we are so over on CPU/Ram than what is needed. We have 40 identical setups to this one right now to the hardware in larger user basis without issue.

Its a 8 Core E5 server with 8 slices assigned to the DC and 16 slices to an RDS server which see's light usage. I usually get around 1-2 percent utilization on the whole server, in this Hyper-V setup there are 64 total slices to be handed out.

If I run a long term capture then use your ICMP filter from above, changing the mac address to the computer that is having the issue I should be able to provide back the information needed, correct?

I went and did a quick grab from the DC and filtered out just ICMP, a lot of Port Unreachables

https://www.cloudshark.org/captures/9b0010eb08ed

But they do not seem to have the same issue you reference, the odd TTL times.

(09 Dec '15, 15:15) ericspt

If I run a long term capture then use your ICMP filter from above, changing the mac address to the computer that is having the issue I should be able to provide back the information needed, correct?

Partially. To find all dns.id which match that filter, e.g. all cases where a DNS response has arrived too late to be expected, is just the first step.

Then you have to find those for which the icmp is not a consequence of the time slicing as complained about above but a consequence of the first response arriving seconds after the first query, which I!ve only managed to do manually so far - right-click the line with "query id", choose "apply as column", find the "Id" column header in the packet list pane, and click it once to sort the displayed packets by that column. Then manually fill the display filter, set

 eth.addr == mac:of:the:work:station and dns.id == value1 and dns.id == value2 and dns.id == value3 ...

(as many values as your patience allows you), apply the filter and when the re-scan ends, click the leftmost column header of the packet list to sort by frame number again. This will show you which icmp rejections came after second DNS responses and which came after the first ones, and you shall also see the increasing time between query retransmissions by the client.

Next, choose any of the dns.id which belong to conversations with the icmp rejection of the first response ever, and use the subject of the query as display filter (right-click on the line with the fqdn in the query packet and choose "prepare as filter -> selected"), and manually add

and frame.number >= n and frame.number <= m

to the prepared filter, where n and m are the frame numbers of the first query and the icmp rejection, respectively. Then apply the filter.

You should see what else was happening (so far regarding DNS) during that time. Normally, you should see an attempt of the DC to query the upstream DNS, which fails for some reason, and is only responded a short while before the DC responds the original query. Or you may see the DC's query to be delayed, not the upstream DNS' response. Or you may see no upstream query at all because the DC knew the response but was too busy to send it... and depending on all that, you have to choose the next search strategy.

(09 Dec '15, 15:42) sindy

My question is though, why does changing the DNS in the individual PC result in the same problem?

I set it to Google and skip the DC entirely and the problem continues.

If I take that same PC and put it on its own private network through the same firewall and ISP, no problems.

(09 Dec '15, 16:14) ericspt

I went back and went through the ICMP failures, they were all Port 137 and Port 67/68 which is NETBIOS and DHCP.

Also the users on the RDS server have the same problem and they are on the same Vswitch which means their DNS request should never transverse the network outside of the server itself.

(09 Dec '15, 16:58) ericspt

I have to go back and look at the DC.

Directly on the DC I tried NSLOOKUP and got this:

C:\Users\Administrator>nslookup www.yahoo.com Server: localhost Address: 127.0.0.1 DNS request timed out. timeout was 2 seconds. DNS request timed out. timeout was 2 seconds. *** Request to localhost timed-out C:\Users\Administrator>nslookup www.yahoo.com Server: localhost Address: 127.0.0.1 DNS request timed out. timeout was 2 seconds. DNS request timed out. timeout was 2 seconds. *** Request to localhost timed-out C:\Users\Administrator>nslookup www.yahoo.com Server: localhost Address: 127.0.0.1 DNS request timed out. timeout was 2 seconds. DNS request timed out. timeout was 2 seconds. *** Request to localhost timed-out C:\Users\Administrator>nslookup www.google.com Server: localhost Address: 127.0.0.1

DNS request timed out. timeout was 2 seconds. DNS request timed out. timeout was 2 seconds. *** Request to localhost timed-out

(09 Dec ‘15, 17:46) ericspt

Then I tried to ping the address and everything started to work again

C:\Users\Administrator>ping www.google.com Pinging www.google.com [216.58.216.196] with 32 bytes of data: Reply from 216.58.216.196: bytes=32 time=21ms TTL=49 Reply from 216.58.216.196: bytes=32 time=18ms TTL=49 Reply from 216.58.216.196: bytes=32 time=21ms TTL=49 Reply from 216.58.216.196: bytes=32 time=26ms TTL=49 Ping statistics for 216.58.216.196: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 18ms, Maximum = 26ms, Average = 21ms C:\Users\Administrator>nslookup www.google.com Server: localhost Address: 127.0.0.1 Non-authoritative answer: Name: www.google.com 216.58.216.196 C:\Users\Administrator>nslookup www.google.com Server: localhost Address: 127.0.0.1 Non-authoritative answer: Name: www.google.com Addresses: 2607:f8b0:4009:808::2004 216.58.216.196 C:\Users\Administrator>nslookup www.google.com Server: localhost Address: 127.0.0.1 Non-authoritative answer: Name: www.google.com 216.58.216.196 C:\Users\Administrator>nslookup www.google.com Server: localhost Address: 127.0.0.1 Non-authoritative answer: Name: www.google.com 216.58.216.100 C:\Users\Administrator>nslookup www.google.com Server: localhost Address: 127.0.0.1 Non-authoritative answer: Name: www.google.com Addresses: 2607:f8b0:4009:80b::2004 216.58.216.100 C:\Users\Administrator>nslookup www.google.com Server: localhost Address: 127.0.0.1 Non-authoritative answer: Name: www.google.com Addresses: 2607:f8b0:4009:80b::2004 216.58.216.100 C:\Users\Administrator>nslookup www.yahoo.com Server: localhost Address: 127.0.0.1 Non-authoritative answer: Name: fd-fp3.wg1.b.yahoo.com Addresses: 2001:4998:44:204::a7 98.138.252.30 98.138.253.109 Aliases: www.yahoo.com C:\Users\Administrator>nslookup www.yahoo.com Server: localhost Address: 127.0.0.1 Non-authoritative answer: Name: fd-fp3.wg1.b.yahoo.com Addresses: 2001:4998:44:204::a7 98.138.253.109 98.138.252.30 Aliases: www.yahoo.com C:\Users\Administrator>nslookup www.yahoo.com Server: localhost Address: 127.0.0.1 Non-authoritative answer: Name: fd-fp3.wg1.b.yahoo.com Addresses: 2001:4998:44:204::a7 98.138.253.109 98.138.252.30 Aliases: www.yahoo.com

C:\Users\Administrator> C:\Users\Administrator>

(09 Dec ‘15, 17:46) ericspt

Then I tried to ping the address and everything started to work again C:\Users\Administrator>ping www.google.com

That sounds like “the egg and the chicken” issue. You cannot ping a hostname without previously resolving it to IP address. While “previously” definitely means “before sending the icmp echo request”, it may also mean “before typing the ping command on the keyboard”. In this particular case it means that either the DC already knew that one of www.google.com’s IP addresses from before, or the DNS operation became fixed spontaneously between your nslookup of www.yahoo.com and your nslookup to www.google.com.

Besides, if you use just nslookup whatever.fqdn.com, it uses the locally configured DNS server, which is localhost (127.0.0.1) in case of the DC.

So as you can see the DNS timeouts already on the DC, please open two CLI windows, and in just one of them, enter nslookup - 8.8.8.8. This way, nslookup will not use the pre-configured server but the one you have specified. What should not confuse you is that it will remain in so-called “interactive mode” so you’ll be just entering fqdns at its own prompt. To reduce the brain load from mode switching between the CLI windows, start nslookup without parameters in the second CLI window, it will run in interactive mode too but with the usual DNS server. Now start capturing on all DC’s network interfaces.

Now repeat asking in both CLI windows for fake fqdns like fqdn-001.which.does.not.exist.for.sure until you collect a set od DNS timeouts, and don’t forget to change the fake fqdns so that you wouldn’t use any of them more than once. Changing them this way will allow you to identify the corresponding individual DNS queries in the capture. If the DNS server receives the query, it will send a response packet even for the fake fqdns, except that the “Answer” part will be missing in the response.

Another point is that Wireshark on windows cannot capture at the loopback interface (localhost, 127.0.0.1), so you will not see in the capture the nslookup’s queries to the local DNS. But you should see the local DNS’s queries triggered by them.

Have a good hunt.

(10 Dec ‘15, 02:25) sindy

I am back to the firewall company and Microsoft.

Since seeing this problem first hand with NSLOOKUP I can confirm that if the user opens up a command prompt fast enough they can communicate with the DC fine, but NSLOOKUP will fail for any external host for about 20-30 seconds, then start working again.

We put in a rule to allow all DNS traffic and skip everything in the firewall and its only capturing the rule for the Remote Desktop users off a single IP, which happens to be on the same V-switch.

The firewall manages traffic based on AD authentication so I’m thinking something is not right there, although we have this setup identical everywhere else.

My next step is to take away all the rules, and open the internet up completely, remove all of the auditing and see if it works. It may be tied to a rule somewhere.

The rest of the traffic looks like. All the rest seemed to be a red herring, still good to get it cleaned up but looks like a firewall/Windows problem.

(10 Dec '15, 16:20) ericspt

I think my knowledge of English is not that bad after all those years but for some parts of your previous comment I would need translation (and I don't have in mind the red herring), rather

if the user opens up a command prompt fast enough

But what has really attracted my attention is what you wrote about firewalls, as it did not come to my mind before to put together the following bits of information:

you wrote something about redundant ISPs before,
you write about firewall issues,
the DC was very creative in finding the output route for the packets (as we've seen in one of the captures),
normal DNS is using UDP as transport.

Firewalls normally use very simple philosophy for UDP: after a packet from socket A to socket B is forwarded from the protected side to the public side, UDP packets from socket B to socket A (assumed to be responses) are permitted to pass from the public side to the private side only for a limited period of time, which may be short for the first response and longer for any other, based on an idea of something like "udp session".

For this to work, it is necessary that both directions of the UDP conversation pass through the same firewall.

Could it be that at some point there is a single public address but the request packets from it to internet could go through one (interface of a) firewall, but packets back to it would go through the other one? E.g. due to some dynamic switching between the ISPs when choosing the output route?

(11 Dec '15, 06:19) sindy

To clarify my statement:

"if the user opens up a command prompt fast enough"

The problem happens within a 30 second window at a time, so if they were not quick enough to run nslookup we miss the error window.

For the rest of your comments first thing we did was completely simplified the network.

Went down to 1 firewall
Removed any HA feature
Took our the managed switch between the firewalls
Routed traffic through 1 ISP only
Tried both ISPs independently and the problem remained.

We have managed to fix the problem for 1/2 of the users. 45 users are on Remote Desktop and writing a specific rule to force DNS traffic to avoid all other firewall rules has removed the problem for them.

The other users continue to see the problem even though we have tried similar rules.

The firewall company has an engineer working on it.

It is very strange because yesterday we finished up an identical job. Same set of firewalls, same switches, same servers, identical setup, etc...etc...they even have the same ISP's on the same local subnet being a 1/4 mile away.

No problems.

We start with a base image for our firewalls so its not like we made an error along the way. Our AD setup is all template based, as are GPO's, access rules, etc.

Part of why this has been so baffling is because so much of what we do is based off a base template for nearly everything that is repeatable.

Still appreciate the insight. It has helped quite a bit and we'll be spending a lot more time with Wireshark and training this year.

(13 Dec '15, 07:23) ericspt

showing 5 of 45 show 40 more comments

Answer 2

So here was our final resolution to the problem.

The firewall was dropping traffic due to unauthenticated users.

What was happening is the firewall would authenticate the user via AD and it worked. On a pretty random schedule our managed service software queries the machine and does so over another account.

That account is blocked from all external access. WMI would update with the new account name and associate temporarily that users PC with the service account, they would hit the DNS and it would drop the traffic because that account is not authorized.

Since the schedule was so random it didn't hit us until we saw all DNS requests were being processed by a single user account.

So when the firewall asked WMI for the user of the PC, 95 percent of the time it reported the actual user, and for short intervals it would report the service user which caused the failure.

After a few seconds the WMI would report back that the user is associated with the computer and all would work.