Here is my problem. In an industrial automation environment we have one Modbus Master system and serveral Modbus slave's (13 x PLC's/PC devices). Modbus communication is working, but frequently (twice per hour) we lose the connection to one of the devices (everytime a different device). This takes a couple of seconds (10-20 sec) and then the communication starts again. We can see that the frequency of missing the connection is related to the number of slaves active on the network. When we have less slaves on the network it still happens but less frequent. In our Modbus Master system there is no good log available to see what is happening. I used Wireshark to capture the datatrafic between the master and the slaves. I've got a situation captured, but to be honest i really don't know what i am looking at in Wireshark. Hopefully someone can explain to me what the Wireshark log can tell me. Thanks,
Hereby a link to the file on cloudshark: Problem is between 220.127.116.11 and 18.104.22.168 on time: 03 min, 46 sec until 04 min, 06 sec Hopefully this will help to find my problem???? Thanks
Screenshot of Configuration in Master System (Wonderware AcherstrA). The Modbus devices are Elau PLC controllers. Also we communicate with an Xray machine wich has a dedicated PC based controller. Thanks to everyone for cooperative thinking and helping.
asked 13 Nov '12, 04:21
edited 15 Nov '12, 13:44
Kurt Knochner ♦
@Bill Meier might be on to something here.
In the first capture, if you filter by the rtu address (ip.addr == 22.214.171.124) and look at the frames preceding the connection close, i.e. 29986 onwards, you can see a sequence of Read Input Register requests.
Looking at the Modbus/TCP headers you can see the "Transaction Identifier" that allows the master to match up responses to queries and have multiple queries in flight. Standard Modbus doesn't have this as it's a strict request/response protocol, but the TCP variant does.
So the query in frame 29986 has TI's 28532-28534. The response (30004) has the responses to those TI's. The next query (30006) has TI's 28535 & 6. The next response (30067) has the responses to those TI's. The next query (30069) has TI's 28537-28540. The next response (30107) only has TI's 28537 & 8 so TI 28539 & 40 are still outstanding (in flight). The master then sends a query (30109) with TI 28541 and then another query (30126) with TI 28542-28544. Note that the latter two are actually coil writes so I'm speculating that the master pushed the writes out immediately (as we all like output ops to happen quickly) so now we have TI's 25839-25844 all in-flight. The rtu responds (30149) with TI 25839 & 40, so TI's 25841-44 are still in-flight. Now we have the near 10 second gap until the rtu sends the tcp keep-alive (36817) and very quickly after that the master closes the connection.
I think that the number of TI's in-flight (4) cause the master to not send any more queries until the rtu responds, and as this doesn't happen in 10 seconds or so (a master timeout?) the master recycles the connection. I would further speculate that if the master has a write request, it may exceed the "normal" number of in-flight requests.
You need to check the permitted number of in-flight requests for the rtu, and (if possible) configure the master to not exceed that. Out of interest can you name the master software and the rtu vendor?
I also note that the master is not coalescing reads, i.e. it's sending out "overlapping" read requests and thus it's not making the most efficient use of bandwidth, e.g. frame 29869 requests 70 registers from starting index 201 (201-270), and then the next query (29916) in TI 28528 requests 1 register from index 206. This new read is entirely contained in the previous query.
answered 14 Nov '12, 03:25
When referring to frames of interest it's best to use frame numbers rather than times that can be affected by the time-zone of the viewer.
I filtered by the ip address of the target rtu (ip.addr == 126.96.36.199) and can see that at frame 36817 the rtu sent a tcp keep-alive 9 seconds after the last communication from the master (30207), and the master ACK'd that then almost immediately after closed the connection and opened it again. Again after the ACK from the master indicating the connection was open (37003) there is a 9 second delay before the rtu sends a TCP Keep-alive (43200) and the master sends back the ack (43201). Half a second later the master starts sending requests.
So what happened around this period of interruption? Clearing the display filter to see all packets shows that the master is quite happily exchanging packets with other Modbus/TCP rtu's, and plenty of other traffic, e.g.CIP, some SQL server and some TPKT. There doesn't seem to be any issue with the network, may be the master is just too busy?
answered 13 Nov '12, 07:23
I also spent a little time doing an analysis ... :)
As noted in the previous answers:
For the first capture I spent a little time looking at the traffic on the connection which failed after a hiccup. The one thing I noticed is that there was a long sequence of repeated "READ INPUT REGISTER" queries just before the hiccup. In fact, there were 24 successive queries and this is the largest number of successive of "READ INPUT REGISTER" queries on a connection in the whole capture. Is this meaningful ? I've no idea.
To see if there's any pattern related to the failures which can be seen in the captures, additional captures covering the time period of a failure would be needed.
answered 13 Nov '12, 14:47
Bill Meier ♦♦
The screen shot is pretty hard to read, especially with these migraine-inducing color rules. As far as I can tell you have some inactivity in there, leading to TCP keep alives (all normal), and then a normal session teardown using FIN-ACK-FIN-ACK. Then a new connection is started (SYN). As far as I can tell from the screen shot there is no critical situation there - looks like inactivity between server and slave leading to a "we don't need this session anymore"-behaviour.
Try filtering for "tcp.analysis.flags" and see if there's something bad happening - if you find a connection that zero window messages, or has long delays (see the time column) or reset packets (filter for "tcp.flags.reset==1") you might be onto something, but it is often hard to tell with the Modbus stuff.
answered 13 Nov '12, 04:42