VMware ESXi 5.5 Windows Server 2012 connected via VLAN with 3 hops (Cisco switches) to Modbus TCP Devices: Advantech ADAM-5000/TCP (x5) Advantech ADAM-6260 TCP. (x1) The devices can be polled using the Advantech ADAM/Apax.net utility version 2.05.10. I'm running Wonderware System Platform 2014 R2 SP1 as the SCADA with DASMBTCP Version 3.0.1 Data Acquisition Server. The ADAM-6260 device is rock solid and never loses connection. The ADAM-5000/TCP devices all drop connection via port502 after 7 hours 20 minutes of stable operation. Pinging the devices is always possible, even after the Modbus TCP port closes. A physical reset of the ADAM5000 brings the device back to life. If I connect a Win7 laptop to the Server switch and route through the VLAN to the ADAM 5000, connection is rock solid. If I connect a Win7 laptop to the ADAM 5000 via cross-over cable the connection is rock solid. As soon as the device is connected and polled from the VMWare Windows Server 2012, the Modbus TCP fails after 7 hours and 20 minutes. I've tried using a Telnet client but Port 502 is definitely closed. The Advantech Utility error message is "Connect Module Failed! Reached maximum number of connections" The issue seems to be independent of the Wonderware DA Server or the Advantech ADAM/Apax.net Utility because when both are disabled / not polling the devices still drop out after the 7hrs20mins. The ADAM6260 is solid and Advantech support put this down to a function / feature called "Host Idle Timeout". This is not available on the ADAM5000, although I have requested it. The table below should help guide through the Wireshark log
I wanted to attach several screenshot of Packet Captures and upload a capture to Cloudshark but I'm unable to due to company policy and I'm unable to add files directly to this post. Any Guidance or insights would be very much appreciated, I've been working on this issue for 6 weeks now. Many thanks in advance. Chris Dell. asked 13 Apr '17, 03:47 chrisdell edited 13 Apr '17, 14:51 |
One Answer:
Working with the brief text excerpt of the capture I can see the following:
In summary, the master sends a requests, the slave fails to respond, the master closes the connection, the slave fails to complete the connection close, the master successfully opens another connection and sends a query to which the slave hard-closes the connection. Looks to me like an application issue on the slave. answered 13 Apr '17, 05:22 grahamb ♦ showing 5 of 6 show 1 more comments |
Thanks for your quick response Graham. I've tried to get the vendor to make a firmware change but so far I've hit resistance. If you don't mind I'm going to send your explanation to Advantech and see if they can make any firmware change.
Having access to the full capture would also enable analysis of other things such as multiple master connection attempts etc.
Personally I don't think there's any sensitive data in a Modbus capture, it's only register values. The exception is where multiple registers are used to return such things as strings, then there might be sensitive data. The IP addresses can be anonymized, but you've already shown those (although they might have been fuzzed).
This afternoon I tried a Tofino Modbus Firewall between the network and the ADAM5000 and early tests are looking good. The units aren't pingable anymore but the Modbus traffic is steady. I'll know for sure over the weekend.
I missed the part in your question about it only being a problem when running in VMWare.
This implies that the VMWare solution is doing something different, in which case where are you capturing? If capturing on the VMWare guest or host you may not be seeing the "real" network traffic. Ideally capture with a tap or mirror or span port between the slave and it's switch.
As it works other than when under VMWare, it would seem that the different behaviour can only be seen by analysing the full captures. As your're unable to provide them, then you're on your own there.
The Modbus firewall must be subtly modifying the traffic in some way such that it doesn't trigger the abnormal slave behaviour. Again you'll need to analyse the captures (again at the slave side of the switch) to try and spot any differences.
The unit in between the firewall device has been stable for over 16 hours now. I'm going to implement this as the solution. Further investigation will have to wait as the project has been on hold since I hit this issue over a month ago. Thanks for your input.
I know that sometimes business needs preclude completing an investigation, but I'd still be concerned that the issue has been postponed and not actually fixed.