We have a problem with getting DNS resolution errors in Chrome or Page Cannot be Displayed in IE. We have ruled out the internet connection and also the firewall itself. When connected direct no problems. When wired into the Firewall on a separate lan created for that connection, no problem. When connected through the local lan this will happen 1-2 times per hour per user. Happens regardless of the DNS. Using our domain controller DNS forwarded to our ISP or Google. Set the user up as Google DNS direct, same problem. What we see in Wireshark is 15-20 ARP requests for multiple 169.xxx.xxx.xxx addresses, with our domain controller asking for the IP. It will run perfectly fine, then we will see these requests, then a failure, hit refresh and its fine. Not sure what is causing it, extensively went through the network and find no other issues. Any idea's on where to start to look where these requests are being generated? asked 16 Nov '15, 11:13 ericspt |
2 Answers:
DNS query is normally not sent out for each access to a given web site because the DNS records have a time validity of tens of minutes, if not days. When a query for a given name is responded once, it is cached for the time validity indicated. So "1-2 times per hour per user" says almost nothing without knowledge of the web pages visited and even their contents (as there may be urls pointing to different sites). So I would first run a Next, I would use a display filter Also, would you mind exporting those arp packets into a separate .pcap file, posting them somewhere and placing a link here? answered 16 Nov '15, 12:31 sindy I will upload some tomorrow once we grab another set of captures. This happens regardless of website and one of the sites 43 people use 24/7. The ERP system is cloud based so it gets hit pretty heavily. However it does happen on any site just not many people surf the web elsewhere, or are even allowed to. When pinging Google DNS I get about 8 percent packet loss when the response time is set to 800ms, moving it up to 1000ms wait time I get 0. Pinging anything local I have no problems. Connected directly to the connection I have no problems. We did not test the computer on the private network which excluded the local LAN but can tomorrow. Thanks (16 Nov '15, 14:26) ericspt I ran a capture tonight. Lots of traffic since it was RDP but I went to 10-15 sites just to run through and no errors, pinging 8.8.8.8 the entire time. There are some 169 requests. (16 Nov '15, 14:54) ericspt Just BTW: Next time you could try the default capture filter: REMOTEHOST as you can see here: https://wiki.wireshark.org/CaptureFilters#Default_Capture_Filters This filter tries to filter out your own remote session. (16 Nov '15, 15:19) Christian_R @ericspt: Your answers have been converted to a comment as that's how this site works. Please read the FAQ for more information. (17 Nov '15, 01:10) Jaap ♦ @ericspt: the fact that no icmp request was lost should indicate that nothing is wrong at L2/L3 level and that the issue is really related to dns as Chrome's error message suggests. Do I read your
right that while you were taking this capture, the issue you're after has not occurred? If so, this capture cannot say more, and another one is necessary to get further. If you know for which particular url there was the issue while you were taking the original capture you've mentioned in the question, please post that one together with information about the url(s) affected; otherwise please run another capture and ask the people to note down the urls which needed a retry. To write something more than just additional questions: in addition to arp requests asking for MAC addresses associated to the automatic IP addresses (169.254.0.0/16), the domain controller also sends LLMNR (similar to DNS so much that a common Wireshark dissector is used for both) PTR queries (asking for a hostname associated to the IP) regarding these addresses to a multicast address 224.0.0.252, use e.g. display filter So it seems that there are two machines using these automatic IPs on the LAN which want something from the domain controller, and the domain controller sends the the arp and llmnr requests in order to be able to send them its response. And these multicast DNS queries could somehow (due to a bug) affect processing of the DNS responses at your workstations. To confirm or disprove this hypothesis, the above mentioned capture taken while the issue has really occurred is necessary. (17 Nov '15, 02:01) sindy Thank you for your comment. We had thought DNS was the original issue but if we change a PC's DNS to Google the problem still occurs.
The only thing I have not tried is setting the PC's DNS to the ISP's DNS, however we have redundant ISP's and the problem happens regardless of which ISP we route traffic through, using either that ISP's DNS or Google's DNS. (17 Nov '15, 03:01) ericspt Following the "related to DNS" path, which is supported by your initial observation that the arp requests occur at the same time as the issue, I was suspecting that those multicast LLMNR queries, sent by other machines and received at the workstation also at the same time, may affect processing of regular DNS responses at the workstation due to some bug in the DNS handling code. These multicast LLMNR requests cannot reach the test workstation when it is connected to the firewall using a separate LAN, and this would explain why the issue does not occur in this case. But in the meantime I've also realized that as the issue hasn't popped up while you were taking the published capture, we cannot conclude anything even about the L2/L3 level (unless you are 150% sure that the issue was occurring while you tested the ping 8.8.8.8 without capturing and no ping was lost). So I have to insist that the analysis must be done on a capture which spans an occurrence of the issue and the particular affected url, and preferably also the time of the event, is known. Also bear in mind that DNS-alike protocols are used also for other purposes (like proxy auto-detection as can be seen in your capture), and we don't know how narrowly pointed the browsers' error messages are. BTW, as such request never came from the 10.99.79.200 where you took the capture, do you use http proxy in the network, and haven't you changed that setting on the test workstation when connecting it directly to the firewall using the dedicated LAN? (17 Nov '15, 03:42) sindy We do not use a HTTP Proxy. The domain controller is the DNS and I see nothing at all unusual in the logs. I did get a trace ran for most of the day with 3 failures but the file is too large to upload to Cloudshark even on a paid account. Is there a way to strip out the good TCP packet info to get the size down? What is odd is now I am seeing a bunch of duplicate IP's at the top end of the range. They have about 110-120 devices on the network on a given day. With the latest capture I have 4 duplicates at 10.99.79.117-121. I cannot identify these devices. The only DHCP server is the one on the domain controller, there are no others. We did have 2 Cisco 300 series switches which we factory reset and turned off STP. We also have 2 main HP 1920 switches which also are brand new and STP is turned off. Neither of them are running DHCP, neither if the firewall. We have gone through the network device by device and looks like another walk through and finding the MAC addresses of each device may be needed. (18 Nov '15, 05:27) ericspt
Yes, there is, you can set a display filter to show only packets 30 seconds before and 30 seconds after the event, example: and then use File -> Export Specified Packets and choose "All packets" and "Displayed". It might be a good idea to do this for all three events. Also, you don't necessarily need to use Cloudshark, you may place the capture files to any cloud service allowing public access to the files. But even in such case please export the interesting minutes only.
Turning stp off may be a good idea in some configurations and a very bad one in others.
It is a typical setup, but I was suspecting the DNS handling at the workstation to get confused, not the one at the domain server. But let's come back to that after you publish the capture exports. (18 Nov '15, 05:39) sindy I had to strip a lot out of the capture. https://www.cloudshark.org/captures/ea47ba8e8bbe Trimmed to 128bytes per packet, removed all SMB2 and SSL traffic. I do not have time stamps to match, they did the capture but did not record the times. I do see 5 dupe IP's but I have scanned the entire log and the only announcements are from our server. There are no other DHCP servers on the network. I also cannot identify those devices. Makes me wonder if something is changing the MAC address and that is causing the IP's to be handed out twice. There seems to be a fair amount of bad traffic but honestly I do not know how much 'bad' traffic is normal. (18 Nov '15, 12:16) ericspt
If they haven't told you the times of the events, have they told you the page names they were visiting? Otherwise we look for a needle in a haystack. I'll continue analysing the trace but some starting point would be really helpful.
If none of the duplicates is 10.99.79.200 it should not be the cause of the issue; if it is, the DNS response could have been sent to the twin, except that it sounds stupid to me that only DNS responses would be hijacked and not the rest of the traffic.
Do you mean "handed out" by DHCP? If there would be two identical MAC addresses in the subnet, the DHCP would give both of them the same IP (unless it looks at other information in the DHCP discover/request, which I doubt), but the switches could go mad due to seeing the same MAC at different ports.
depends on what you call bad traffic. As many protocols are fault-tolerant, quite a lot of network problems as well as ill application behaviour may remain unnoticed because the recovery mechanisms repair the communication so it seems OK from user perspective. (18 Nov '15, 13:03) sindy From a quick view into the trace, I would say the most uncommon thing is the termination (RST) for a lot of HTTP and SSL Session at the end of the trace. The termination is initiated by the system 10.99.79.200. I haven´t found any duplicate IP inside the trace. And what packet number do you call as "Bad traffic"? (18 Nov '15, 14:25) Christian_R @ericspt: I feel quite strange about the capture:
So I've spent quite some time looking for non-responded DNS queries to find none, but if there were just three issues during the whole day, I doubt that you'd be lucky enough to have one of them in those 2'40"... So the first question is what is the time span and size of the original file before you've processed it (maybe the capture has ended prematurely on disk quota exhaustion?) Next, if the original file is really much bigger, you'll have to take it as it is, set display filter "dns" and then "File -> Export Specified Packets" as described earlier. The arp requests to the two automatic IPs are very frequent (often once per second, with some pauses) so they cannot be used as a symptom. The llmnr queries for the same two automatic IPs come more or less together at some multiples of 8 seconds (16,24,48,56) so they are also not likely to indicate the time when the issue has happened. (18 Nov '15, 14:45) sindy My mistake, wrong link (19 Nov '15, 09:59) ericspt cloudshark asked me for login (which it never did before), so I've signed in with google account and I've got an error (19 Nov '15, 10:04) sindy All set, had it set to private. (19 Nov '15, 15:39) ericspt hm, you've filtered out all UDP, so no way to see what was happening with DNS and its friends as they are UDP. So now please set display filter on the original capture file to "udp", export displayed packets and see how large the result is (or use tshark to do the same, whatever suits you better). If I'll be able to find some anomalies in DNS communication, you would then distill time intervals rather than packet types for further processing. BTW, what is your time zone? (20 Nov '15, 00:17) sindy We are in Eastern. Here is all the full UDP only traffic (20 Nov '15, 07:34) ericspt OK, so after all we are getting somewhere. Using a display filter You can see by frame numbers or time that this happens only during some time intervals. Now let's take the last occurrence of this effect and see what preceded it. Set display filter With So now please take the original .pcap with everything in it, apply display filter The time span given includes both DNS transaction IDs 0xf431 and 0x849a. In your two last files merged together, there is nothing during that time that could explain the behaviour at the protocol level, but maybe your filtering was too aggressive on the udp-less one? After posting the capture, please take another two simultaneously at the domain controller and at the workstation, use the (20 Nov '15, 09:20) sindy Another chance is that the controller itself could not get the response from the higher level DNS server, which the capture from it should answer you as well (you would see it sending DNS requests with same contents but different transaction ID to it). (20 Nov '15, 09:24) sindy Awesome thank you for not only the help but the detailed explanation. I will get that file uploaded tonight. As far as your second comment this was what I first investigated. I figured it could be on/off Net DNS issue. They do have 2 ISP's. We route all general traffic over their cable, and then specific traffic for their ERP system over a secondary fiber connection. However we have tested with all traffic on 1 ISP with the ISP's DNS's as forwarders, and we have done the same test with the other ISP. We also have tried Google's DNS's on both ISP's, similar results across the board. (20 Nov '15, 15:10) ericspt Basically I ask you to take a capture at the domain controller in parallel to the workstation in order to avoid spotting a rare case different from the one you're really after. I do remember you've written that workstation on a separate LAN has no problems while a workstation on a common LAN does have them even if it reaches directly to the external DNS, but from the current capture I cannot say more. A capture from a workstation on common LAN configured to use external DNS will almost inevitably have to follow. Just make sure to capture all interfaces of the domain controller if it has more than one. And as the firewall device may be guilty as well, the best would be if you could take a capture on it (inside and outside of course) in parallel with the workstation and domain controller as mentioned above. BTW, I've just spotted that Wireshark 2.0.0's display filter distinguishes between dns, llmnr and mdns, so where a display filter Disconnecting for today as I live 6 hours ahead of you :-) (20 Nov '15, 15:45) sindy Sorry I have been out of town for a week. I did 3 captures. Domain Controller, PC1 and PC2. PC2 I cannot wittle the file down enough to upload it but it did show a ton of LLMNR requests which I have disabled. Here is the capture from the PC https://www.cloudshark.org/captures/9cca014edc09 and from the DC https://www.cloudshark.org/captures/a08679de67a5 Both were started within a few minutes of each other, the PC2 capture was done at the excact same time. For some reason I cannot get it to strip out some data to shrink the file enough to updload. Can I strip TCP data or no? (08 Dec '15, 06:16) ericspt You don't really expect me to be in sync after three week's pause, do you :-) ? After running through the conversation briefly I think UDP and arp traffic should be enough. And you are not bound to cloudshark, you can use any other way to post the files (which, however, does not mean you should not get rid of the tcp part before posting). (08 Dec '15, 06:46) sindy The first published capture from the PC is useless for analysis as there is no icmp rejecting a DNS answer in it. (08 Dec '15, 07:31) sindy And in the capture from a domain controller, I've taken randomly one DNS conversation which has resulted in icmp "destination port unreachable" report coming from the initial inquirer (.66) in reaction to the response packet, and I got quite confused by your overall network arrangement. Use display filter
and add ethernet source and ethernet destination of the packet as columns to the packet list (right-click each corresponding line in packet dissection pane and choose "Apply as Column". And then please explain me why the packet for 8.8.4.4 is sent to the MAC of the .66 which is the initial inquirer. Otherwise this particular exchange is harmless - the explanation is simple, the DC has sent the response at a moment where the inquirer has already scheduled the retransmission of the query, so the inquirer got satisfied by that response an rejected, using icmp, the one triggered by the retransmission of the query. (08 Dec '15, 07:51) sindy Here is the filtered content https://www.cloudshark.org/captures/c95f60faf850 I'm working on following along but can you please explain: "And then please explain me why the packet for 8.8.4.4 is sent to the MAC of the .66 which is the initial inquirer." A bit more background. They had another managed service software installed on each PC that acted as a web proxy. This was all removed and I've verified that Auto Detect Proxy settings is off. I've verified the DNS is correct on the DNS forwarder also. They have multiple ISP's with fail over so we use Google as the DNS for either provider. I have been looking for something else that could act as a proxying device and found nothing. There are only 2 devices on the network we have not touched which is their DVR and phone system. Both can be unplugged from the network but I'm trying to get an idea of where you believe the problem may be. Thanks (08 Dec '15, 08:31) ericspt It seems I've confused you completely, sorry. I was asking you to apply that filter so that you could see the same what I was looking at, not to filter that out and send it back to me :) What I do want is the tcp-less capture from PC2, however. Now a retry: I was asking you to look at that situation because I was puzzled by the machine .66 acting as a gateway towards 8.8.4.4 for the domain controller and wanted an explanation of that from you, assuming that if the .66 is a gateway towards the ISP and 8.8.4.4 behind them, it has little reasons to send a DNS query, while if it is a workstation, why on earth it should be a gateway towards the ISP? Separately from that, when I was looking for other occurrences of icmp rejection of a DNS response, I've found 1000+ such events in the capture from the DC, so I've added a condition that the icmp should come from the .66. I've got about 40 such packets. Then, I took first 5 The next step was to filter out all DNS communication regarding the fqdn for which the .66's request 0x4a46 asks for which happened between the very first 0x4a46 request sent by .66 until its icmp rejection of the response which finally came after the .66 gave up waiting. The corresponding filter is
If you apply that filter yourself, you'll see that several machines were querying that same fqdn at the same time, some of them repeatedly, and it seems to me that the DC was simply overloaded handling these requests, because
(08 Dec '15, 11:25) sindy Ok thank you for the explanation. I was wondering why you were asking to apply the filter :) Here is PC2 https://www.cloudshark.org/captures/f6ff6ff75ccf .66 is a laptop within the company, I have to find the specific device. I'm wondering where the DNS overload would come from....there are not a lot of devices on the network. Could this be a MITM LLMNR problem? We have shut off LLMNR but have to reboot each PC for it to take effect. (08 Dec '15, 11:58) ericspt Stop, that is completely crazy. There are two TTLs in the google's response, the CNAME has 17xx seconds and the A record has only 59. Which makes the DC's clients repeat their requests in less than a minute again. Everything goes fine until frames 131776 and 131777 (relative second 1063) where the DC answers with TTL of the A record of 29 seconds. After that TTL expires, everyone asks again (with retransmissions) between relative seconds 1094 and 1103, with no response from the DC. The DC then provides all the answers starting from second 1103 (including answers to retransmissions), but provides only CNAME records, no A record (which has expired, which is the reason why everybody asked for an update in the same second). The DC itself doesn't ask 8.8.4.4 nor anyone else, however, and keeps answering with only CNAME, which the workstations don't accept and ask again, finally stuffing the DC's DNS process. The last answer like this (no A, just CNAME) is frame 165568. Then the DC changes its suicidal strategy for a while and starts sending A records but with TTL=0, the first answer like that is frame 165604. This lasts until 165757, and then, out of blue, without visibly asking anyone, the DC starts returning a different A record (different IP address) with 59 seconds' TTL, only to see that crazy situation repeat after another minute again. I'm tired reading that mess, can you investigate what is that fqdn and why it chooses this weird policy of being resolved to a given IP for only one minute and then changing the IP? Is it a malicious network or your customer's main site? (08 Dec '15, 12:04) sindy That domain is the previous managed service software. We are checking all of the PC's right now if it still exists. It did act as a proxying software before, and they always had trouble with it. We removed it from each PC but checking that one it is apparent some part of it remains. Just so I understand you are saying something else is responding somewhere in the chain with a DNS resolution other than the DC local DNS or the Google forwarder? That is causing these odd TTL lengths? Until today I never realized LLMNR existed until I saw all the traffic, but I guess you'd never run into it with DNS working properly. What I do not understand is why did changing the individual PC to Google's DNS direct not help the problem? Connecting that PC directly to the firewall, skipping the local LAN, with its own private network and Google acting as the DNS did resolve the problem. (08 Dec '15, 12:13) ericspt
I am not sure I understand your question. What I am saying is:
and I've got no feedback from you on whether there are more network cards on the DC or not.
So before proceeding, please find out how to get rid of the remainders of that old software generating these DNS requests. Then we may, if still necessary, deal with the remaining oddities (how does the DC query its DNS server normally and why it occasionaly uses the .66 as a gateway, and even more surprisingly, it does so successfully). (08 Dec '15, 13:54) sindy Thank you for the detailed explanation. The DC is a Hyper-V VM, HV host is 2012 R2 and DC is the same. There were 2 Nic's assigned on independent Vswitches. One for LAN traffic, the second for a private replication network that is a completely different subnet and direct connected to a replica server. During the initial testing first thing I did was remove that 2nd VSwitch and NIC and pushed replication over the regular LAN. So right now there is only 1 NIC and I have verified there is only 1 NIC listed, there is only 1 DNS and DHCP binding and it is to that NIC. I am starting to piece this together, not the problem, but how this software ended up on that PC. The old domain controller pushed this software out to each PC, then had a script that ran on boot that reinstalled it in case it was removed. We removed the script deployment off of that PC, and ran an uninstall against each PC. Then removed it from the server. In the middle of this that PC was brand new and deployed the morning this happened, then left the office so it was never removed. It did act in concert with their old Internet access appliance which was replaced by a single firewall that managed internet access via AD instead of a client on the PC. It did act as a proxying server. How it has taken over an as active gateway to push traffic I have no idea. The 2 PC's that were captured have been wiped clean and we have scoured everything we can to find something odd on the network. The new setup was completely new, new domain, rejoined all the PC's, complete DC setup from scratch 100 percent identical to all our other setups. New firewall, switches, etc. Of course our first thought was something was wrong with the firewall or our config since there are 2 in high availability. After spending 2 weeks determining it was not the firewall or hardware we started to look elsewhere. I truly appreciate your help and insight. I feel we are pretty technically competent but the simple fact is basic flat networks like this pretty much just work and this is one odd ball problem. I had an idea what was going on, some type of DNS related response issue but being able to dig through and pin point it in captures is not something we are technically proficient as we should be, but that will change in 2016. (08 Dec '15, 20:35) ericspt So removed that software, problem continues. I turned off LLMNR but requests for it continue and the one that doesn't make sense to me is LLMNR requests from the domain controller even after turning it off and rebooting, and it is look for a root DNS server. I'm running another capture in the morning. (09 Dec '15, 14:52) ericspt Please be more specific about which problem continues. The failed DNS requests in general or everybody on the network asking for that famous FQDN? I don't like those virtualized setups where machines talking to each other get their time in interleaved slices on the same physical CPU, so you can then see effects where the DNS queries come from various machines and they even retransmit them, and then the DC gets its time slice and sends three responses to the same query in ten microseconds because it receives the original request and its two retransmissions in a burst of packets which were waiting in a queue to be processed. Has the DC got enough priority in access to the physical CPU? (09 Dec '15, 15:07) sindy All I know right now is they continue to get the Page Cannot be Displayed error. Only the DC is asking for the root server. What I do not understand is we have disabled LLMNR but it continues to show on the network. Although this is a virtualized environment we are so over on CPU/Ram than what is needed. We have 40 identical setups to this one right now to the hardware in larger user basis without issue. Its a 8 Core E5 server with 8 slices assigned to the DC and 16 slices to an RDS server which see's light usage. I usually get around 1-2 percent utilization on the whole server, in this Hyper-V setup there are 64 total slices to be handed out. If I run a long term capture then use your ICMP filter from above, changing the mac address to the computer that is having the issue I should be able to provide back the information needed, correct? I went and did a quick grab from the DC and filtered out just ICMP, a lot of Port Unreachables https://www.cloudshark.org/captures/9b0010eb08ed But they do not seem to have the same issue you reference, the odd TTL times. (09 Dec '15, 15:15) ericspt
Partially. To find all dns.id which match that filter, e.g. all cases where a DNS response has arrived too late to be expected, is just the first step. Then you have to find those for which the icmp is not a consequence of the time slicing as complained about above but a consequence of the first response arriving seconds after the first query, which I!ve only managed to do manually so far - right-click the line with "query id", choose "apply as column", find the "Id" column header in the packet list pane, and click it once to sort the displayed packets by that column. Then manually fill the display filter, set
(as many values as your patience allows you), apply the filter and when the re-scan ends, click the leftmost column header of the packet list to sort by frame number again. This will show you which icmp rejections came after second DNS responses and which came after the first ones, and you shall also see the increasing time between query retransmissions by the client. Next, choose any of the dns.id which belong to conversations with the icmp rejection of the first response ever, and use the subject of the query as display filter (right-click on the line with the fqdn in the query packet and choose "prepare as filter -> selected"), and manually add
to the prepared filter, where n and m are the frame numbers of the first query and the icmp rejection, respectively. Then apply the filter. You should see what else was happening (so far regarding DNS) during that time. Normally, you should see an attempt of the DC to query the upstream DNS, which fails for some reason, and is only responded a short while before the DC responds the original query. Or you may see the DC's query to be delayed, not the upstream DNS' response. Or you may see no upstream query at all because the DC knew the response but was too busy to send it... and depending on all that, you have to choose the next search strategy. (09 Dec '15, 15:42) sindy My question is though, why does changing the DNS in the individual PC result in the same problem? I set it to Google and skip the DC entirely and the problem continues. If I take that same PC and put it on its own private network through the same firewall and ISP, no problems. (09 Dec '15, 16:14) ericspt I went back and went through the ICMP failures, they were all Port 137 and Port 67/68 which is NETBIOS and DHCP. Also the users on the RDS server have the same problem and they are on the same Vswitch which means their DNS request should never transverse the network outside of the server itself. (09 Dec '15, 16:58) ericspt I have to go back and look at the DC. Directly on the DC I tried NSLOOKUP and got this:
(09 Dec ‘15, 17:46) ericspt Then I tried to ping the address and everything started to work again
(09 Dec ‘15, 17:46) ericspt
That sounds like “the egg and the chicken” issue. You cannot ping a hostname without previously resolving it to IP address. While “previously” definitely means “before sending the icmp echo request”, it may also mean “before typing the ping command on the keyboard”. In this particular case it means that either the DC already knew that one of www.google.com’s IP addresses from before, or the DNS operation became fixed spontaneously between your nslookup of www.yahoo.com and your nslookup to www.google.com. Besides, if you use just So as you can see the DNS timeouts already on the DC, please open two CLI windows, and in just one of them, enter Now repeat asking in both CLI windows for fake fqdns like Another point is that Wireshark on windows cannot capture at the loopback interface (localhost, 127.0.0.1), so you will not see in the capture the nslookup’s queries to the local DNS. But you should see the local DNS’s queries triggered by them. Have a good hunt. (10 Dec ‘15, 02:25) sindy I am back to the firewall company and Microsoft. Since seeing this problem first hand with NSLOOKUP I can confirm that if the user opens up a command prompt fast enough they can communicate with the DC fine, but NSLOOKUP will fail for any external host for about 20-30 seconds, then start working again. We put in a rule to allow all DNS traffic and skip everything in the firewall and its only capturing the rule for the Remote Desktop users off a single IP, which happens to be on the same V-switch. The firewall manages traffic based on AD authentication so I’m thinking something is not right there, although we have this setup identical everywhere else. My next step is to take away all the rules, and open the internet up completely, remove all of the auditing and see if it works. It may be tied to a rule somewhere. The rest of the traffic looks like. All the rest seemed to be a red herring, still good to get it cleaned up but looks like a firewall/Windows problem. (10 Dec '15, 16:20) ericspt I think my knowledge of English is not that bad after all those years but for some parts of your previous comment I would need translation (and I don't have in mind the red herring), rather
But what has really attracted my attention is what you wrote about firewalls, as it did not come to my mind before to put together the following bits of information:
Firewalls normally use very simple philosophy for UDP: after a packet from socket A to socket B is forwarded from the protected side to the public side, UDP packets from socket B to socket A (assumed to be responses) are permitted to pass from the public side to the private side only for a limited period of time, which may be short for the first response and longer for any other, based on an idea of something like "udp session". For this to work, it is necessary that both directions of the UDP conversation pass through the same firewall. Could it be that at some point there is a single public address but the request packets from it to internet could go through one (interface of a) firewall, but packets back to it would go through the other one? E.g. due to some dynamic switching between the ISPs when choosing the output route? (11 Dec '15, 06:19) sindy To clarify my statement: "if the user opens up a command prompt fast enough" The problem happens within a 30 second window at a time, so if they were not quick enough to run nslookup we miss the error window. For the rest of your comments first thing we did was completely simplified the network.
We have managed to fix the problem for 1/2 of the users. 45 users are on Remote Desktop and writing a specific rule to force DNS traffic to avoid all other firewall rules has removed the problem for them. The other users continue to see the problem even though we have tried similar rules. The firewall company has an engineer working on it. It is very strange because yesterday we finished up an identical job. Same set of firewalls, same switches, same servers, identical setup, etc...etc...they even have the same ISP's on the same local subnet being a 1/4 mile away. No problems. We start with a base image for our firewalls so its not like we made an error along the way. Our AD setup is all template based, as are GPO's, access rules, etc. Part of why this has been so baffling is because so much of what we do is based off a base template for nearly everything that is repeatable. Still appreciate the insight. It has helped quite a bit and we'll be spending a lot more time with Wireshark and training this year. (13 Dec '15, 07:23) ericspt showing 5 of 45 show 40 more comments |
So here was our final resolution to the problem. The firewall was dropping traffic due to unauthenticated users. What was happening is the firewall would authenticate the user via AD and it worked. On a pretty random schedule our managed service software queries the machine and does so over another account. That account is blocked from all external access. WMI would update with the new account name and associate temporarily that users PC with the service account, they would hit the DNS and it would drop the traffic because that account is not authorized. Since the schedule was so random it didn't hit us until we saw all DNS requests were being processed by a single user account. So when the firewall asked WMI for the user of the PC, 95 percent of the time it reported the actual user, and for short intervals it would report the service user which caused the failure. After a few seconds the WMI would report back that the user is associated with the computer and all would work. answered 17 Dec '15, 13:22 ericspt edited 17 Dec '15, 13:23 |
Is it possible to load a capture with the problem to cloudshark?